- When the machine learning model has a large error, which aspect should be modified. Taking the prediction of house prices as an example, we have completed the regularized linear regression and obtained the parameter vector, but when we put it into a new sample set for testing, we found that there was a huge error. At this time, we have the following improvement methods:

(1) Use more training samples.

(2) Select fewer feature sets and only select a subset of feature sets

(3) Add other features (current features are not very effective)

(4) Adding higher order polynomial features

(5) Increase or decrease the regularization parameters to avoid over fitting or under fitting

When many people are improving the model, they will choose one method at random and study it for several months, but finally they are on a road of no return. This is because the improvement method should not be selected by feeling. Next, we will discuss how to judge which method to improve the algorithm

- Next, we introduce learning diagnosis method, that is, first test whether an algorithm is useful, and then judge which method to try to improve the algorithm effect if it is not

### New evaluation method of hypothesis function

- In the front, we choose the parameters that minimize the training error to determine the parameters of the hypothesis function. However, it can only determine the error of the parameters in the old data (training set), and can not determine whether the parameters have good generalization ability (can be applied to the new data)
- Therefore, we use a new method to evaluate the generalization ability of hypothesis function, and divide the data set into two parts: training set and test set. Generally, the typical segmentation method is 7:3, and it should be noted that 70% of the training set is randomly selected (of course, if the data are randomly distributed, it doesn’t matter if the top 70% is selected in order)

(1) In linear regression, we first get the learning parameters by minimizing the error of the training set, then put the trained parameters into the test set, and calculate the test error by the square error method

(2) Similar in logistic regression (classification problem). Only when we calculate the error, we use the 0 ~ 1 assumption function: when the predicted Y > 0.5, the actual y = 0, when y < 0.5, the actual y = 1, the function gets 1; when the predicted y is consistent with the actual, the function gets 0

- If the error of the test set is too large, we need to relearn the model

### Model selection problem

- How to determine the most appropriate polynomial degree of the hypothesis function, how to select the correct feature to construct the learning algorithm, and how to select the appropriate regularization parameters are the model selection problems
- We can divide the data into three groups: training set, verification set and test set to solve the problem of model selection

### How to choose the proper degree of polynomial

- Review over fitting: adjust the parameters to the very fitting training set to ensure the good fitting on the training set, but can not guarantee the prediction results of the new samples outside the training set
- In order to choose the appropriate polynomial degree, we introduce a new polynomial degree parameter D. When we divide the data into only two parts, let’s take a look at the thinking:

(1) For each model with different D, the training error is minimized, and the parameter vector of the hypothesis function is obtained

(2) For each hypothesis function, the error of the test set is calculated, and the minimum error of the test set is selected. The polynomial parameter D of this model is what we want

- However, this does not mean that my hypothesis can be extended to general results (generalization). We also use the test set to evaluate the parameters of the hypothesis function and the test set to find the polynomial degree parameter D. In other words, we use the test set to fit the hypothesis function, and then use the test set to see the test error is generally small. That is, the results may fit the test set well, but the new samples are unknown
- In order to solve this problem, we further divide the data into three parts: training set, validation set (cross validation set) and test set, with the ratio of 6:2:2

Training error, cross validation error and test error are defined

- It is different from the traditional method of using test set to evaluate the parameters of hypothesis function and find the polynomial parameter d at the same time. We use the cross validation set to select the polynomial parameter D, and use the test set to evaluate the parameters of the hypothesis function

First, we minimize the training error and get the parameter vector of the hypothesis function; then, we calculate the error of the model corresponding to different polynomial parameter D in the cross test set and select the polynomial parameter d with the smallest error; finally, we use the test set to evaluate the generalization error of the hypothesis function of the selected model.

- It is worth mentioning that although the previous method of only dividing data sets and test sets is out of date, there are still many people who do so because it has good accuracy when the sample size is large

### The reason why the algorithm is not ideal

- If the algorithm is not ideal, there may be two reasons: large variance or large deviation, that is, over fitting or under fitting. It is very important to judge whether the algorithm is biased or variance

- As shown in the figure, the two curves represent the training set error and verification set error (square error), and the abscissa represents the polynomial parameter D. When the polynomial degree D increases, the error of training set decreases, even to 0; while the error of cross validation set first decreases and then increases (assuming 5 times is the most appropriate).

- When the cross validation error and training error of the model are large, how to judge whether it is high square error or high deviation?

(1) At the left end of the figure is a high deviation: the model uses too small polynomial degree (d). High deviation corresponds to under fitting. The two errors are very close (even the training set is not well fitted)

(2) On the right side is high variance: the model uses a large degree of polynomial (d). The high square error corresponds to over fitting, the training error is small, the verification set error is large, and the difference between the two errors is large

### The relationship between regularization and deviation variance

- Algorithm regularization can prevent over fitting, and over regularization will lead to under fitting. However, deviation and variance are closely related to under fitting and over fitting. Let’s discuss the relationship between variance deviation and under fitting
- We use the regularization term to make the parameter as small as possible (without penalizing parameter 0).

(1) If the regularization parameter is very large (such as 1000), the parameter is greatly punished, and the parameter value is close to 0, forming a horizontal line, which is high deviation and under fitting

(2) If the regularization parameter is too small (close to 0), the parameter penalty is not enough, and the excessive pursuit of fitting training set data becomes over fitting and high variance

##### How to automatically select reasonable regularization parameters

- First, we redefine the cost function J. In the general cost function, the error is defined as the sum of the mean square of the training set data and the predicted value of the model, plus a regularization term. Here, we define the verification error and test error as the sum of the squares of the average error, and there is no regularization term
- Automatic selection of regularization parameters

(1) Select the case of non regularization (regularization parameter is 0), and some regularization parameter values I want to try. For example, I choose 0.01, 0.02,…, 10.24 to double the step size. So there are 13 models.

(2) For each model, the cost function (the cost function with regularization term) is minimized, and the corresponding parameter vector is obtained

(3) Then, the cross validation set is used to evaluate the hypothesis function (without regularization term), and the one with the least mean square error is selected to test the mean square error of the model on the training set

In essence, the cross validation error is still used to calculate the fitting ability of different regularization parameters to new samples, and the test set is used to evaluate the generalization ability of the model

- The above should focus on understanding why sometimes with regularization term, sometimes without regularization term. When calculating the parameter vector, the regularization term is needed to punish the parameter, so it is used; after that, it is used for evaluation, so it is not used
- Next, let’s look at the change of cross validation error and training set error (without regularization term) after changing the value of regularization parameter. Regularization parameter is too small, punishment is not enough, training set error is small, verification set error is large, high variance, over fitting; regularization parameter is too large, into a horizontal line, training set, verification set error is large, high deviation, under fitting

### learning curve

- Combined with all the previous concepts, we build a model diagnosis method: learning curve. To diagnose whether the learning algorithm is in deviation or variance, or both

- Let’s see how the cross validation error and the training set error change when we change the regularization parameters. It should be noted that the cost function originally contains the regularization term, but here the training error and cross validation error are defined as not including the regularization term.
- As can be seen from the above figure, when the regularization parameter is small, the punishment is not enough, and the curve tries every means to fit the training set, the error of the training set is small, and the error of the verification set is large, which is over fitting and high variance; when the regularization parameter is large, the punishment is too large, which becomes a horizontal line, and the error of the training set and the verification set is large, which is under fitting and high deviation

##### learning curve

- Learning curve can help to observe whether the algorithm is in the state of deviation and variance, so as to further improve the effect of the algorithm
- Before drawing the learning curve, let’s first understand the change of the sum of square error of the training set and the verification set with the total number of samples M.

(1) When m is small, we can easily fit to each data, and the training set error is 0; when m becomes large, it is more difficult to fit to each data, and the average training error increases gradually

(2) For the cross validation error, when the number of samples is small, the generalization degree of the model will not be very good, and the cross validation error will become larger; with the increase of the amount of data, the generalization performance will become better and better, and the cross validation error will become smaller

##### What is the learning curve like when it is in high deviation or high square error

- For example, we use a straight line to fit the model. When the number of samples gradually increases, the generalization ability of the function is enhanced, but because the straight line can not be well fitted, the cross validation error becomes smaller, but gradually horizontal; when there is only one sample, the cross validation error is good, and the error of the training machine is very small, and with the increase of M, the error of the training set becomes larger, which is close to the cross validation error.

In a word, the error of cross validation and training is very large when the deviation is high; and when we continue to increase the number of samples m, the error of cross validation gradually becomes the level and no longer decreases. Therefore, it is very meaningful to see clearly when the algorithm is in high deviation, which can avoid us wasting time to find more data (collecting more data with high deviation is meaningless to improve the generalization ability of the model)

- When the number of samples m is small, the fitting is very good, and the training error is very small; when the number of samples m is large, the fitting is difficult, so the training error gradually becomes large, but it is still very small. The cross validation error is very large, even if the generalization ability of the model is reduced by increasing the number of samples, it still remains at a large level.

In short, in the high square error, increasing the number of samples is helpful to improve the effect of the algorithm. It’s also meaningful to see clearly that we are in the high variance, which can help us decide whether it is necessary to add more sample data

- When we find that the algorithm does not achieve the desired effect, we usually draw a learning curve to see whether it is a deviation or variance problem, and further judge which methods are effective and which methods are futile

(1) Using more training set data is helpful for high variance, but not for high variance.

(2) Selecting fewer features is effective for high square deviation, but meaningless for high deviation (selecting fewer features is equivalent to reducing the number of independent variables, so the curve is not complicated and will not be over fitted)

(3) More features are selected, which are effective for high deviation and invalid for high square deviation (equivalent to adding independent variables, the function is more complex)

(4) Increasing the degree of polynomial is effective for high deviation and invalid for high square error

(5) Increasing the regularization parameter can correct the high square error, and decreasing the regularization parameter can correct the high square error

### The relationship between front content and neural network

- Generally speaking, the neural network with few hidden units and only one layer is relatively simple, because there are fewer neurons, fewer parameters and less calculation, but it is easy to over fit

On the contrary, the neural network with more hidden units or hidden layers is more complex, because there are more neurons, more parameters and large amount of calculation, it is easy to over fit

- But many parameters and large amount of calculation are not the main problems we consider, and over fitting can be solved by regularization. Therefore, using large-scale neural network and regularization to solve over fitting is better than using small-scale neural network
- When selecting the number of hidden layers, a hidden layer is generally selected by default