# Linear Regression

Let's start this tutorial from the classic Linear Regression ([1]) model.

In this chapter, you will build a model to predict house price with real datasets and learn about several important concepts about machine learning.

The source code of this tutorial is in book/fit_a_line. For the new users, please refer to Running This Book .

## Background

Given a $n$ dataset ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$, of which $x_{i1}, \ldots, x_{id}$ are the values of the $d$th attribute of $i$ sample, and $y_i$ is the target to be predicted for this sample.

The linear regression model assumes that the target $y_i$ can be described by a linear combination among attributes, i.e.

$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$

For example, in the problem of prediction of house price we are going to explore, $x_{ij}$ is a description of the various attributes of the house $i$ (such as the number of rooms, the number of schools and hospitals around, traffic conditions, etc.). $y_i$ is the price of the house.

At first glance, this assumption is too simple, and the true relationship among variables is unlikely to be linear. However, because the linear regression model has the advantages of simple form and easy to be modeled and analyzed, it has been widely applied in practical problems. Many classic statistical learning and machine learning books [2,3,4] also focus on linear model in a chapter.

## Result Demo

We used the Boston house price dataset obtained from UCI Housing dataset to train and predict the model. The scatter plot below shows the result of price prediction for parts of house with model. Each point on x-axis represents the median of the real price of the same type of house, and the y-axis represents the result of the linear regression model based on the feature prediction. When the two values are completely equal, they will fall on the dotted line. So the more accurate the model is predicted, the closer the point is to the dotted line.

Figure One. Predict value V.S Ground-truth value

## Model Overview

### Model Definition

In the dataset of Boston house price, there are 14 values associated with the home: the first 13 are used to describe various information of house, that is $x_i$ in the model; the last value is the medium price of the house we want to predict, which is $y_i$ in the model.

Therefore, our model can be expressed as:

$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$

$\hat{Y}$ represents the predicted result of the model and is used to distinguish it from the real value $Y$. The parameters to be learned by the model are: $\omega_1, \ldots, \omega_{13}, b$.

After building the model, we need to give the model an optimization goal so that the learned parameters can make the predicted value $\hat{Y}$ get as close to the true value $Y$. Here we introduce the concept of loss function (Loss Function, or Cost Function. Input the target value $y_{i}$ of any data sample and the predicted value $\hat{y_{i}}$ given by a model. Then the loss function outputs a non-negative real number, which is usually used to represent model error.

For linear regression models, the most common loss function is the Mean Squared Error (MSE), which is:

$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$

That is, for a test set in size of $n$, $MSE$ is the mean of the squared error of the $n$ data prediction results.

The method used to optimize the loss function is generally the gradient descent method. The gradient descent method is a first-order optimization algorithm. If $f(x)$ is defined and divisible at point $x_n$, then $f(x)$ is considered to be the fastest in the negative direction of the gradient $-▽f(x_n)$ at point of $x_n$. Adjust $x$ repeatedly to make $f(x)$ close to the local or global minimum value. The adjustment is as follows:

$$x_n+1=x_n-λ▽f(x), n≧0$$

Where λ represents the learning rate. This method of adjustment is called the gradient descent method.

### Training Process

After defining the model structure, we will train the model through the following steps.

1. Initialize parameters, including weights $\omega_i$ and bias $b$, to initialize them (eg. 0 as mean, 1 as variance).
2. Forward propagation of network calculates network output and loss functions.  3. Reverse error propagation according to the loss function ( backpropagation ), passing forward the network error from the output layer and updating the parameters in the network.  4. Repeat steps 2~3 until the network training error reaches the specified level or the training round reaches the set value.

## Dataset

### Dataset Introduction

The dataset consists of 506 lines, each containing information about a type of houses in a suburb of Boston and the median price of that type of house. The meaning of each dimensional attribute is as follows:

Property Name Explanation Type
CRIM Per capita crime rate in the town Continuous value
ZN Proportion of residential land with an area of over 25,000 square feet Continuous value
INDUS Proportion of non-retail commercial land Continuous value
CHAS Whether it is adjacent to Charles River Discrete value, 1=proximity; 0=not adjacent
NOX Nitric Oxide Concentration Continuous value
RM Average number of rooms per house Continuous value
AGE Proportion of self-use units built before 1940 Continuous value
DIS Weighted Distance to 5 Job Centers in Boston Continuous value
TAX Tax Rate of Full-value Property Continuous value
PTRATIO Proportion of Student and Teacher Continuous value
B 1000(BK - 0.63)^2, where BK is black ratio Continuous value
LSTAT Low-income population ratio Continuous value
MEDV Median price of a similar home Continuous value

### Data Pre-processing

#### Continuous value and discrete value

Analyzing the data, first we find that all 13-dimensional attributes exist 12-dimensional continuous value and 1-dimensional discrete values (CHAS). Discrete value is often represented by numbers like 0, 1, and 2, but its meaning is different from continuous value's because the difference of discrete value here has no meaning. For example, if we use 0, 1, and 2 to represent red, green, and blue, we cannot infer that the distance between blue and red is longer than that between green and red. So usually for a discrete property with $d$ possible values, we will convert them to $d$ binary properties with a value of 0 or 1 or map each possible value to a multidimensional vector. However, there is no this problem for CHAS, since CHAS itself is a binary attribute .

#### Normalization of attributes

Another fact that can be easily found is that the range of values of each dimensional attribute is largely different (as shown in Figure 2). For example, the value range of attribute B is [0.32, 396.90], and the value range of attribute NOX is [0.3850, 0.8170]. Here is a common operation - normalization. The goal of normalization is to scale the value of each attribute to a similar range, such as [-0.5, 0.5]. Here we use a very common operation method: subtract the mean and divide by the range of values.

There are at least three reasons for implementing normalization (or Feature scaling):

• A range of values that are too large or too small can cause floating value overflow or underflow during calculation.

• Different ranges of number result in different attributes being different for the model (at least in the initial period of training), and this implicit assumption is often unreasonable. This can make the optimization process difficult and the training time greatly longer.

• Many machine learning techniques/models (such as L1, L2 regular items, Vector Space Model) are based on the assumption that all attribute values are almost zero and their ranges of value are similar.

Figure 2. Value range of attributes for all dimensions

#### Organizing training set and testing set

We split the dataset into two parts: one is used to adjust the parameters of the model, that is, to train the model, the error of the model on this dataset is called training error ; the other is used to test.The error of the model on this dataset is called the test error. The goal of our training model is to predict unknown new data by finding the regulation from the training data, so the test error is an better indicator for the performance of the model. When it comes to the ratio of the segmentation data, we should take into account two factors: more training data will reduce the square error of estimated parameters, resulting in a more reliable model; and more test data will reduce the square error of the test error, resulting in more credible test error. The split ratio set in our example is $8:2$

In a more complex model training process, we often need more than one dataset: the validation set. Because complex models often have some hyperparameters (Hyperparameter) that need to be adjusted, we will try a combination of multiple hyperparameters to train multiple models separately and then compare their performance on the validation set to select the relatively best set of hyperparameters, and finally use the model with this set of parameters to evaluate the test error on the test set. Since the model trained in this chapter is relatively simple, we won't talk about this process at present.

## Training

fit_a_line/trainer.py demonstrates the overall process of training.

### Configuring the Data feeder

First we import the libraries:

import paddle
import numpy
import math
import sys
from __future__ import print_function


We introduced the dataset UCI Housing dataset via the uci_housing module

It is encapsulated in the uci_housing module:

2. The process of data preprocessing.

Next we define the data feeder for training. The data feeder reads a batch of data in the size of BATCH_SIZE each time. If the user wants the data to be random, it can define data in size of a batch and a cache. In this case, each time the data feeder randomly reads as same data as the batch size from the cache.

BATCH_SIZE = 20

batch_size=BATCH_SIZE)

batch_size=BATCH_SIZE)


If you want to read data directly from *.txt file, you can refer to the method as follows.

feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'convert' ]

feature_num = len(feature_names)

data = numpy.fromfile(filename, sep=' ') # Read primary data from file

data = data.reshape(data.shape[0] // feature_num, feature_num)

maximums, minimums, avgs = data.max(axis=0), data.min(axis=0), data.sum(axis=0)/data.shape[0]

for i in six.moves.range(feature_num-1): data[:, i] = (data[:, i] - avgs[i]) / (maximums[i] - minimums[i]) # six.moves is compatible to python2 and python3

ratio = 0.8 # distribution ratio of train dataset and verification dataset

offset = int(data.shape[0]*ratio)

train_data = data[:offset]

test_data = data[offset:]

### Configure Program for Training

The aim of the program for training is to define a network structure of a training model. For linear regression, it is a simple fully connected layer from input to output. More complex result, such as Convolutional Neural Network and Recurrent Neural Network, will be introduced in later chapters. It must return mean error as the first return value in program for training, for that mean error will be used for BackPropagation.

x = fluid.layers.data(name='x', shape=[13], dtype='float32') # define shape and data type of input
y = fluid.layers.data(name='y', shape=[1], dtype='float32') # define shape and data type of output
y_predict = fluid.layers.fc(input=x, size=1, act=None) # fully connected layer connecting input and output

main_program = fluid.default_main_program() # get default/global main function
startup_program = fluid.default_startup_program() # get default/global launch program

cost = fluid.layers.square_error_cost(input=y_predict, label=y) # use label and output predicted data to estimate square error
avg_loss = fluid.layers.mean(cost) # compute mean value for square error and get mean loss

For details, please refer to: fluid.default_main_program fluid.default_startup_program

### Optimizer Function Configuration

SGD optimizer, learning_rate below are learning rate, which is related to rate of convergence for train of network.

sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)

#Clone main_program to get test_program
# operations of some operators are different between train and test. For example, batch_norm use parameter for_test to determine whether the program is for training or for testing.
#The api will not delete any operator, please apply it before backward and optimization.
test_program = main_program.clone(for_test=True)


### Define Training Place

We can define whether an operation runs on the CPU or on the GPU.

use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() # define the execution space of executor

###executor can accept input program and add data input operator and result fetch operator based on feed map and fetch list. Use close() to close executor and call run(...) to run the program.
exe = fluid.Executor(place)

For details, please refer to: fluid.executor

### Create Training Process

To train, it needs a train program and some parameters and creates a function to get test error in the process of train necessary parameters contain executor, program, reader, feeder, fetch_list, executor represents executor created before. Program created before represents program executed by executor. If the parameter is undefined, then it is defined default_main_program by default. Reader represents data read. Feeder represents forward input variable and fetch_list represents variable user wants to get or name.

num_epochs = 100

def train_test(executor, program, reader, feeder, fetch_list):
accumulated = 1 * [0]
count = 0
outs = executor.run(program=program,
feed=feeder.feed(data_test),
fetch_list=fetch_list)
accumulated = [x_c[0] + x_c[1][0] for x_c in zip(accumulated, outs)] # accumulate loss value in the process of test
count += 1 # accumulate samples in test dataset
return [x_d / count for x_d in accumulated] # compute mean loss


### Train Main Loop

give name of directory to be stored and initialize an executor

%matplotlib inline
params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe.run(startup_program)
train_prompt = "train cost"
test_prompt = "test cost"
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0

exe_test = fluid.Executor(place)

Paddlepaddle provides reader mechanism to read training data. Reader provide multiple columns of data at one time. Therefore, we need a python list to read sequence. We create a loop to train until the result of train is good enough or time of loop is enough. If the number of iterations for train is equal to the number of iterations for saving parameters, you can save train parameter into params_dirname. Set main loop for training.
for pass_id in range(num_epochs):
avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_loss])
if step % 10 == 0: # record and output train loss for every 10 batches.
plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0:  # record and output test loss for every 100 batches.
test_metics = train_test(executor=exe_test,
program=test_program,
fetch_list=[avg_loss.name],
feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # If the accuracy is up to the requirement, the train can be stopped.
break

step += 1

if math.isnan(float(avg_loss_value[0])):
sys.exit("got NaN loss, training failed.")

#save train parameters into the path given before
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], exe)


## Predict

It needs to create trained parameters to run program for prediction. The trained parameters is in params_dirname.

### Prepare Environment for Prediction

Similar to the process of training, predictor needs a program for prediction. We can slightly modify our training program to include the prediction value.

infer_exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()


### Predict

Save pictures

def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')


Via fluid.io.load_inference_model, predictor will read well-trained model from params_dirname to predict unknown data.

with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
batch_size = 10

paddle.dataset.uci_housing.test(), batch_size=batch_size) # prepare test dataset

infer_feat = numpy.array(
[data[0] for data in infer_data]).astype("float32") # extract data in test dataset
infer_label = numpy.array(
[data[1] for data in infer_data]).astype("float32") # extract label in test dataset

assert feed_target_names[0] == 'x'
results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # predict
#print predict result and label and visualize the result
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # print predict result

print("\nground truth:")
for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # print label

save_result(results[0], infer_label) # save picture


## Summary

In this chapter, we analyzed dataset of Boston House Price to introduce the basic concepts of linear regression model and how to use PaddlePaddle to implement training and testing. A number of models and theories are derived from linear regression model. Therefore, it is not unnecessary to figure out the principle and limitation of linear regression model.

## References

1. https://en.wikipedia.org/wiki/Linear_regression
2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.