Linear Regression with One Variable

Model and Cost Function

Model Representation

• Supervised Learning (监督学习): Given the “right answer” for each example in the data.
• Regression Problem (回归问题): Predict real-valued output.
• Classification Problem (分类问题): Predict discrete-valued output.
• Training set (训练集)
• m: number of training examples
• x‘s: “input” variable / features
• y‘s: “output” variable / “target” variable
• $(x, y)$: one training example
• $(x^i, y^i)$: $i^{th}$ training example
• Training Set -> Learning Algorithm -> h(hypothesis, 假设)
• h is a function maps from x’s to y’s
• e.g. Size of house -> h -> Estimated price
• Linear regression with one variable
• $h_\theta (x) = \theta_0 + \theta_1 x$
• Shorthand: $h(x)$
• Or named Univariate linear regression (单变量线性回归)

Cost Function

• Hypothesis: $h_\theta (x) = \theta_0 + \theta_1 x$

• $\theta_i$’s: Parameters (模型参数)
• How to choose $\theta_i$’s ?
• Idea: Choose $\theta_0, \theta_1$ so that $h_\theta (x)$ is close to $y$ for our training example $(x,y)$
• Cost function (代价函数)
• $J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)})-y^{(i)}\right)^2$
• Sometimes called Square error function (平方误差代价函数)
• Goal: minimise $J(\theta_0, \theta_1)$

Parameter Learning

• Goal
• Have some function $J(\theta_0, \theta_1)$
• Want $\theta_0, \theta_1$ of $min J(\theta_0, \theta_1)$
• Outline
• Start with some $\theta_0, \theta_1$, usually all set to $0$.
• Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we hopefully end up at minimum

repeat until convergence (收敛) {
​ $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_j)$ (for $j=0$ and $j=1$)
}

• := denotes assignment

• $\alpha$ denotes learning rate

• if too small, gradient descent can be slow
• If too large, gradient descent can overshoot the minimum. It may fail to converge or even diverge.
• You should simultaneously update $\theta_0$ and $\theta_1$

• That is, you should compute the right-hand sides of $\theta_0$ and $\theta_1$, then save them to temporary variables, and finally update $\theta_0$ and $\theta_1$.

$temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_j)$

$temp1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_j)$

$\theta_0 := temp0$

$\theta_1 :=temp1$

Intuition

• If $\theta_1$ at local optima, it leaves $\theta_1$ unchanged.

• gradient descent can converge to a local minimum, even with the learning rate $\alpha$ fixed.

• As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time.

We can compute that

$\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum^m_{i=1}\left(h_\theta(x^{(i)})-y^{(i)}\right)$

$\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum^m_{i=1}\left(h_\theta(x^{(i)})-y^{(i)}\right) \cdot x^{(i)}$

Thus the Gradient descent algorithm can be expressed as

repeat until convergence {
$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum^m_{i=1}\left(h_\theta(x^{(i)})-y^{(i)}\right)​$

$\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum^m_{i=1}\left(h_\theta(x^{(i)})-y^{(i)}\right) \cdot x^{(i)}$
}

And the cost funciton of linear refression is always a convex function (凸函数), or called Bowl-shaped function (弓形函数). It doesn’t have any local optima except for the one global optimum.