Machine Learning--week2 多元线性回归、梯度下降改进、特征缩放、均值归一化、多项式回归、正规方程与设计矩阵

2023-04-28,,

对于multiple features 的问题(设有n个feature),hypothesis 应该改写成

\[\mathit{h} _{\theta}(x) = \theta_{0} + \theta_{1}\cdot x_{1}+\theta_{2}\cdot x_{2}+\theta_{3}\cdot x_{3}+\dots+\theta_{n}\cdot x_{n}
\]

其中:

\[x=\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ \vdots \\ x_{n} \end{bmatrix}\in {\Bbb R}^n \;,\; \theta=\begin{bmatrix}\theta_{1}\\ \theta_{2}\\ \theta_{3}\\ \vdots \\ \theta_{n} \end{bmatrix}\in {\Bbb R}^n
\]

为便于表达,可令\(x_{0}=1\),则

\[\mathit{h} _{\theta}(x) = \theta_{0}\cdot x_{0} + \theta_{1}\cdot x_{1}+\theta_{2}\cdot x_{2}+\theta_{3}\cdot x_{3}+\dots+\theta_{n}\cdot x_{n}
\]

\[\quad\; x=\begin{bmatrix}x_{0} \\ x_{1}\\ x_{2}\\ x_{3}\\ \vdots \\ x_{n} \end{bmatrix}\in {\Bbb R}^{n+1}\;,\; \theta=\begin{bmatrix}\theta_{0} \\ \theta_{1}\\ \theta_{2}\\ \theta_{3}\\ \vdots \\ \theta_{n} \end{bmatrix}\in {\Bbb R}^{n+1}
\]

即:

\[h_{\theta}(x) = \theta^{\rm T}x
\]

multivariate linear regression(多元线性回归):\(h_{\theta}(x) = \theta^{\rm T}x\)

cost function:

\[J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2} = \frac{1}{2m} \sum_{i=1}^{m} (\theta^{\rm T}x^{(i)}-y^{(i)})^{2} = \frac{1}{2m} \sum_{i=1}^{m} (\sum_{j=0}^{n} \theta_{j}x_{j}^{(i)}-y^{(i)})^{2}
\]

\(\therefore\) 梯度下降算法的循环内容变成\(\theta_{j}\; \text{:= } \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta) \qquad (j = 0,1,2...,n)\)

\(\therefore\) gradient descent algorism(\(n \ge 1\), simultaneously update \(\theta_{j}\) for \(j=0,1,2,3\dots,n\)):

\[\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
\qquad\qquad\qquad\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \qquad (j = 0,1,2...,n)\\
\}\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\]

[实现的注意事项:比如作业中的某一道题,循环判断条件我用了\(\sum\Delta\theta_{j}^2>C\),其中\(C\)是且必须是某个很小的常数(如0.0000000001),不然出来的结果不准确,而\(\alpha\)可以相对大点但也不能太大(如0.001)]

技巧:

Feature Scaling(特征缩放)

​ 对x各项元素(也就是feature)除以这个feature的最大值,得到一个百分比的数,这样x就不会因为各元素之间的数量级差异导致梯度算法的性能变差

​ 也就是说,把x各项元素(也就是feature)的值约束到\([-1,1]\)之间

​ 范围在 \([-3,3]\) ~ \([-\frac{1}{3}, \frac{1}{3}]\)的feature是可以接受的,而过小或者过大的feature则需要进行feature scaling

Mean Normalization(均值归一化)

​ replace \(x_{i}​\) with \(x_{i}-\mu_{i}​\) to make features have approximately zero mean (But do not apply to \(x_{0} = 1​\), 因为所有的\(x_0=1​\)故不可能平均值为0)

​ 说明:也就是把feature的均值归一为0,其中\(\mu_{i}\)是\(x_i\)的平均值

​ \(e.g.:x_1= \frac{size-1000}{2000},\quad x_2 = \frac{\#bedrooms-2}{5},\qquad s.t.\, -0.5\le x_1\le 0.5,\; -0.5\le x_2\le 0.5\)

表达出来即是:

\[x_i = \frac{x_i-\mu_i}{s_i}
\]

​ 其中 \(\mu_i \text{ is average value, }s_i \text{ is the range of the feature[ == max(feature) - min(feature)] or feature's Standard Deviation}\)

【啊好吧,到这里讲了如何选择\(\alpha\),不用我自己摸索了】

Declare convergence if \(J(\theta)\) decreases by less than \(10^{-3}\) in one iteration. (#循环的判断条件)

To choose \(\alpha\), try \(\dots,0.001,0.003,0.01,0.03,0.1,0.3,1,\dots\) \((x_{n+1} = x_n * 3)\) (#\(\alpha\)的选择)

Try to pick the largest possible value, or the value just slightly smaller than the largest reasonable value that I found

统合特征,比如用面积代替长和宽

polynomial regression(多项式回归)

例:

​ \(\begin{align}h_{\theta}(x) &= \theta_0 + \theta_1\cdot x_1+ \theta_2\cdot x_2+\theta_3\cdot x_3\\&=\theta_0 + \theta_1\cdot (size)+ \theta_2\cdot (size)^2+\theta_3\cdot (size)^3 \end{align}\)

由于到后面不同指数的size的值相差甚远,因此需要对其进行均值归一化

其实指数不一定要上升,对于只增不减的某些函数而言,也可以选用:

​ \(\begin{align}h_{\theta}(x) &=\theta_0 + \theta_1\cdot (size)+ \theta_2\cdot \sqrt{(size)} \end{align}\)

其均值归一化过程(已知①②③):

①model is \(\begin{align}h_{\theta}(x) &=\theta_0 + \theta_1\cdot (size)+ \theta_2\cdot \sqrt{(size)} \end{align}\)

​ ②size range from 1 to 1000(feet\(^2\))

​ ③implement this by fitting a model \(\begin{align}h_{\theta}(x) &=\theta_0 + \theta_1\cdot x_1+ \theta_2\cdot x_2 \end{align}\)

​ \(\therefore\) \(x_1,x_2\) should satisfy \(x_1 = \frac{size}{1000}, \quad x_2=\frac{\sqrt{(size)}}{\sqrt{1000}}\)

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

Normal Equation(正规方程)

可以直接求出\(\theta\)的最优解

其实就是,求导,解出导数为0

一般直接想到的解法是:分别求解所有变量的偏导等于零:\(\frac{\partial}{\partial \theta_j}f(\theta) = 0\)

其实可以这么做:

令\(X = \begin{bmatrix}x_{10} & x_{11} &x_{12} & \cdots & x_{1n} \\ x_{20} & x_{21} &x_{22} & \cdots & x_{2n} \\ \vdots & \vdots &\vdots & \ddots & \vdots \\ x_{m0} & x_{m1} &x_{m2} & \cdots & x_{mn} \end{bmatrix}\quad,\quad y = \begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_m \end{bmatrix} ​\)

则 \(\large\theta = (X^TX)^{-1}X^Ty\)

$ x^{(i)} = \begin{bmatrix} x_0^{(i)} \ x_1^{(i)}\ x_2^{(i)} \ \vdots \ x_n^{(i)} \end{bmatrix}$ 则 design matrix(设计矩阵) \(X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T\\ (x^{(3)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\)

pinv(x'*x)*x'*y

(不需要归一化特征变量)

与Gradient Descent 的比较

Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
\(O (kn^2)\) \(O (n^3)\), need to calculate inverse of \(X^TX\)
Works well when n is large Slow if n is very large

选择:

​ \(lg(n)\ge 4:\text{ gradient descent} \\lg(n)\le 4: \text{ normal equation}\)

计算Normal Equation要\(X^TX\)是可逆的,但是如果它不可逆(Non-invertible)呢?

Octave 中的pinv()inv()都能求逆,但是pinv()能展现数学上的过程,即使矩阵不可逆

如果\(X^TX\)不可逆:

首先看看都没有redundant features, 比如一个feature是单位为feet,而另一个feature仅仅是那个feet单位换算成m,有就删掉redundant的feature
check if I may have too many features. 若是, I would either delete some features if I can bear to use fewer features or else I would consider using regularization.

Machine Learning--week2 多元线性回归、梯度下降改进、特征缩放、均值归一化、多项式回归、正规方程与设计矩阵的相关教程结束。

《Machine Learning--week2 多元线性回归、梯度下降改进、特征缩放、均值归一化、多项式回归、正规方程与设计矩阵.doc》

下载本文的Word格式文档,以方便收藏与打印。