How to build Gradient Boosting Regressor in Python?

How to build Gradient Boosting Regressor in Python?

See the Jupyter Notebook for the concepts we’ll cover on building machine learning models, my Medium, and LinkedIn for other Data Science and Machine Learning tutorials.

Ensemble, in general, means a group of things that are usually seen as a whole. We have three main categories:

Não foi fornecido texto alternativo para esta imagem

Bagging

Used for building multiple models (typically of the same type) from different subsets in the training dataset.

Boosting

Used for constructing multiple models (typically of the same type), where each model learns to correct the errors generated by the previous model within the sequence of created models.

Voting

Used for building multiple models (typically of different types). Simple statistics (such as average) are used to combine predictions.

Ensemble methods are proven to be powerful methods to improve the accuracy and robustness of supervised, semi-supervised and unsupervised solutions.

Previously we saw a type of Ensemble method that is considered quite sophisticated, gradient boosting. The Gradient Boosting method unites the Boosting technique and gradient descent to predict the residuals of each of the base estimators. In other words, the algorithm creates a sequence of base estimators. Then, it indicates the residue for each of them so that the following estimator is more accurate and can reduce the residuals successively.

In the case of Gradient Boosting, even in classification models, the base estimators are regression trees. 

Previously we created classification models with Gradient Boosting and were able to detect Regression trees’ base estimators even though we made a classifier.

Building A Gradient Boosting Regressor in Python

From now on, we will build a Gradient Boosting Regressor. But, first, let’s create a graph to visualize any mass of data and identify the ideal regression line; that is, we will simulate and work on top of this scenario through various perspectives.

First, we import NumPy and the train_test_split the model_selection sklearn module to divide the data into training and testing. We also add matplotlib inline and pylab inline to build the graphics within the Jupyter Notebook itself, including each of the chart layers. Every chart is constructed under the grammar concept of the charts, where we have multiple layers composing a single chart.

To have all the layers organized, we use %matplotlib inline and %pylab inline.

1.The size of the figure, that is, the size of the chart through the FIGSIZE object;

2. We define a function called reg_line that will draw the approximation of the regression function, which is what we want to predict;

3. We define another function called gen_data that will create a mass of data with a sample number in 200;

4. For X, we generate a uniform random distribution based on the n_samples we pass as a parameter for the np.random_uniform;

5. For Y, we apply ravel to X to flatten the value of X for a vector since, in this case, X is in matrix format. To create a regression model, we must put the data in a position to relate the data between X and Y;

6. We divide the data into training and test with the train_test_split;

7. We call the gen_data and pass 100 as a parameter — the function will return four sets

8. We generate a linspace, a sequence of data that we will put in the regression line.

9. Finally, we define a plot_data function to define the size of the figure, call the plot function, the reg_line passing x_plot as a parameter, and the value of alpha, labels, and finally formatting.

import numpy as np
from sklearn.model_selection import train_test_split
%matplotlib inline
%pylab inline
FIGSIZE = (11, 7)

# Function approximation (optimal regression line)
def reg_line(x):
    return x * np.sin(x) + np.sin(2 * x)

# Generating training and test data
def gen_data(n_samples = 200):
    # Generating random data mass
    np.random.seed(15)
    X = np.random.uniform(0, 10, size = n_samples)[:, np.newaxis]
    y = reg_line(X.ravel()) + np.random.normal(scale = 2, size = n_samples)

# Dividing into training and testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 3)
return X_train, X_test, y_train, y_test

# Constructing datasets
X_train, X_test, y_train, y_test = gen_data(100)

# Data for the regression line
x_plot = np.linspace(0, 10, 500)

# Plotting data
def plot_data(alpha = 0.4, s = 20):

# Creating figure
fig = plt.figure(figsize = FIGSIZE)

# Generating plot
gt = plt.plot(x_plot, reg_line(x_plot), alpha = alpha)

# Plotting training and test data
plt.scatter(X_train, y_train, s = s, alpha = alpha)
plt.scatter(X_test, y_test, s = s, alpha = alpha, color = 'red')
plt.xlim((0, 10))
plt.ylabel('y')
plt.xlabel('x')

# Formatting
annotation_kw = {'xycoords': 'data', 'textcoords': 'data', 'arrowprops': {'arrowstyle': '->', 'connectionstyle': 'arc'}}

# Plot
plot_data()

# Blue - Training
# red - Test        
Não foi fornecido texto alternativo para esta imagem

X is the predictor and independent variables, while y is our dependent target variable. 

There is some relationship between the two variables. The blue data points represent the training data, and the data points in red are the test data.

The blue line represents the regression line that contains the model predictions. In our case, we have an ideal regression curve that can capture part of the variance in the data without overfitting (model learns too much) or underfitting (model learns little) to the data. Therefore, we want the regression line to capture as much of the variance as possible.

Plotting 2 Trees with Different Depths

First, we import the individual model and not the ensemble method we saw so far, so we call the DecisionTreeRegressor function of the Scikit-Learn tree module.

Let’s create two Regression Tree Machine Learning models. The only difference between the two models is the adjustment we will make in the max_depth parameter to determine the length of the Regression Trees and consequently adjust the alpha.

We apply both models to our dataset we created earlier:

from sklearn.tree import DecisionTreeRegressor
plot_data()
# Decision trees with max-depth = 1
est = DecisionTreeRegressor(max_depth = 1).fit(X_train, y_train)
plt.plot(x_plot, est.predict(x_plot[:, np.newaxis]), label = 'max_depth=1', color = 'g', alpha = 0.9, linewidth = 3)

# Decision trees with max-depth = 3
est = DecisionTreeRegressor(max_depth = 3).fit(X_train, y_train)
plt.plot(x_plot, est.predict(x_plot[:, np.newaxis]), label = 'max_depth=3', color = 'g', alpha = 0.7, linewidth = 1)

# Legend position
plt.legend(loc = 'upper left')        
Não foi fornecido texto alternativo para esta imagem

We have the thin blue line, which is our ideal regression line. The thickest green on the first regression model with max_depth in 1, that is, a regression tree with minimal length — we can see through the thicker line the lack of learning of this less deep model, that is, an underfitting problem where it cannot capture most of the variance in the data.

In the second model with the thin green line, we increased the length of the regression tree. Thus, despite having some underfitting points, the model can capture the data’s variance much better and be closer to the ideal blue regression line.

A straightforward adjustment in the hyperparameters makes all the difference in the model. So what we did, only, was change two parameters from one model to the other.

Each model will fit better according to the problem we have at hand and the available data set. However, for the dataset in question that we are simulating, the two individual models above are not ideal for making predictions. So, let’s work with the Gradient Boosting model.

Applying Gradient Boosting Regressor

Here we import the GradientBoostingRegressor and the islice function to iterate with the dataset.

Instead of plotting the green lines as we did earlier, we will plot the line with the Gradient Boosting model. First, we created the estimator by establishing 1000 base estimators for our regression model! Next, we do the fitting of the model and put everything together in a plot.

With the islice function, we apply your staged_predict method from the estimator we named as est, that is, prediction in stages. For every ten estimators, we will return the results of the forecasts at that given time to attach them to the plot and annotate them through the annotate method according to the position we have in each stage.

from itertools import islice
from sklearn.ensemble import GradientBoostingRegressor
plot_data()

# Gradient Boosting Regressor
est = GradientBoostingRegressor(n_estimators = 1000, max_depth = 1, learning_rate = 1.0)

# Training Model
est.fit(X_train, y_train)
ax = plt.gca()
first = True

# Steps through forecasts as we add more trees
for pred in islice(est.staged_predict(x_plot[:, np.newaxis]), 0, est.n_estimators, 10):
    plt.plot(x_plot, pred, color = 'r', alpha = 0.2)
    if first:
        ax.annotate('High Bias - Low Variance', 
                    xy = (x_plot[x_plot.shape[0] // 2], pred[x_plot.shape[0] // 2]), 
                    xytext = (4, 4), 
                    **annotation_kw)
        first = False

# Predictions
pred = est.predict(x_plot[:, np.newaxis])
plt.plot(x_plot, pred, color = 'r', label = 'GBRT max_depth=1')
ax.
 - High Variance', 
            xy = (x_plot[x_plot.shape[0] // 2], pred[x_plot.shape[0] // 2]), 
            xytext = (6.25, -6), 
            **annotation_kw)

# Legend position
plt.legend(loc = 'upper left')        
Não foi fornecido texto alternativo para esta imagem

Once again, we have the blue line, our ideal regression line. Of course, our model built with Gradient Boosting is suffering from overfitting. The larger the name of breaks, the greater the attempt to adjust to the variance in the data.

The blue line alone can explain much of the variance of the data by itself, even if it loses some data points — which is entirely natural. However, the red line follows all the data points; that is, this red line representing the regression model with Gradient Boosting learned excessively about the data — in practice, it did not learn the mathematical function, but rather the details of it the data.

Therefore, when we present new data to this model, it will not generalize to new entries because it cannot reach an approximate mathematical function, only the details contained in the data.

Diagnosing If the Model Suffers from Overfitting

Here we can diagnose objectively if the model suffers from overfitting.

We define the deviance_plot that will make the plot by passing the parameters of the estimator that we created, the test data, all the attributes needed to customize the chart.

Following establishing the number of estimators, we create a for loop to navigate each of the prediction stages with the staged_predict method and collect the loss function results, which is the difference between the observed value and the value predicted by the model.

def deviance_plot(est, X_test, y_test, ax=None, label='', train_color='#2c7bb6', test_color='#d7191c', alpha=1.0, ylim = (0, 10)):
    n_estimators = len(est.estimators_)
    test_dev = np.empty(n_estimators)
for i, pred in enumerate(est.staged_predict(X_test)):
       test_dev[i] = est.loss_(y_test, pred)
if ax is None:
        fig = plt.figure(figsize = FIGSIZE)
        ax = plt.gca()
        
    ax.plot(np.arange(n_estimators) + 1, test_dev, color = test_color, label = 'Test %s' % label, linewidth = 2, alpha = alpha)
    ax.plot(np.arange(n_estimators) + 1, est.train_score_, color = train_color, label = 'Train %s' % label, linewidth = 2, alpha = alpha)
    ax.set_ylabel('Error')
    ax.set_xlabel('Number of Base Estimators')
    ax.set_ylim(ylim)
    return test_dev, ax

# Applies the function to the test data to measure the overfitting of our model (est)
test_dev, ax = deviance_plot(est, X_test, y_test)
ax.legend(loc = 'upper right')

# Legend
ax.annotate('Lower level of error in test dataset', 
            xy = (test_dev.argmin() + 1, test_dev.min() + 0.02), 
            xytext = (150, 3.5), 
            **annotation_kw)
ann = ax.annotate('', xy = (800, test_dev[799]),  xycoords = 'data',
                  xytext = (800, est.train_score_[799]), textcoords = 'data',
                  arrowprops = {'arrowstyle': '<->'})
ax.text(810, 3.5, 'Gap Training-Test')        

Once the definition is complete, we apply the deviance_plot to the test data and the estimator, build the legend and put it all on the chart.

Não foi fornecido texto alternativo para esta imagem

On the X-axis, we have the number of base estimators, and on the Y-axis, we have the error rate that goes from 0 to 10. So, we have the test data on the red line and the blue line, the training data.

At first, the error rate is very high. Then, when we have few estimators, the error tends to be very high, and then the error begins to fall gradually as we increase the number of base estimators.

When it reaches 1,000, the blue line rate of errors in training is deficient, while in the test data, the error tends to fall until it comes to a minimum point and something seems to go wrong — the error in the test data increases!

There is no approximate mathematical function!

Error levels begin to follow in opposite directions where the error in training decreases and the error in testing increases; that’s what characterizes overfitting. Our model has learned so much from the training data that we can no longer make predictions with new test data.

There is no approximate mathematical function, what we have here are decorated details of the training data, and the performance ends up being terrible when you get the test data.

What is the ideal model?

The ideal model would be for training when we were at the lowest point where the arrow indicates, that is, at the point where the error begins to rise abruptly. Then, we need to apply the training data to the model and observe how the performance is in the test.

A very used technique in Deep Learning is to use validation sets. When the data set is too large, we train and validate almost simultaneously and apply a way to stop training when the test performance is at the minimum possible level — the ideal point of the model, which for the model in question is around 80 estimators.

Não foi fornecido texto alternativo para esta imagem

Gradient Boosting models are much more sophisticated and increase the accuracy of predictions. Still, the accuracy is so high that we must regularize them to not suffer from overfitting. Stochastic Gradient Boosting is a handy form of regularization for mitigating this type of obstacle.

Now, we will discuss some techniques of regularization of models that serve to avoid the overfitting of learning. Initially, we created a data mass where we had a relationship between X and Y, plotted both training and test data points, and drew the optimal regression curve, i.e., when creating the Machine Learning model, the result should be something like the blue line below:

Não foi fornecido texto alternativo para esta imagem

Next, we created two individual models and, clearly, the two models were not able to perform well; that is, they could not learn enough from the mass of data we generate:

Não foi fornecido texto alternativo para esta imagem

Therefore, we created an Ensemble model with Gradient Boosting. However, while individual models learned little, the Ensemble model ended up learning too much! 

Não foi fornecido texto alternativo para esta imagem

Above, we also found that the error rate starts at a maximum value and tends to decrease the occurrence of errors for the training and test data until, at a given time, the error rate returns to increase in the test data. In other words, while the error rate in the training data decreases and the error rate rises in the test data, this is an indication of model overfitting. 

The gap pointed out in the image above has to be as small as possible! And it is with regularization that we can adjust this divergence. 

Regularization (Avoid Overfitting)

We have 3 main techniques when working with Ensemble method models: 

  1. Change the structure of the tree
  2. Shrinkage
  3. Stochastic Gradient Boosting

Changing the Tree Structure

Let's start with the technique of changing the tree structure. Here we create the estimator with the same hyperparameters that we created our first version of the GradientBoostingRegressor estimator with 1000 base estimators, max_depth 1, and learning rate 1.0

When applying regularization or tunning of hyperparameters, we should avoid changing multiple parameters simultaneously. Instead, we should change a maximum of two parameters simultaneously so that we do not get lost.

Below we create a function called fmt_params that receives as input the dictionary params. Below we define the colors, changing hyperparameters, creating classifier, model training, plot building with test data, adding annotations to the chart:

def fmt_params(params):
    return ", ".join("{0}={1}".format(key, val) for key, val in params.items())


fig = plt.figure(figsize = FIGSIZE)
ax = plt.gca()


for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')), ({'min_samples_leaf': 3}, ('#fdae61', '#abd9e9'))]:
    est = GradientBoostingRegressor(n_estimators = 1000, max_depth = 1, learning_rate = 1.0)
    est.set_params(**params)
    est.fit(X_train, y_train)
    test_dev, ax = deviance_plot(est, 
                                 X_test, 
                                 y_test, 
                                 ax = ax, 
                                 label = fmt_params(params),
                                 train_color = train_color, 
                                 test_color = test_color)
    
ax.annotate('High Bias', xy = (900, est.train_score_[899]), xytext= ( 600, 3), **annotation_kw)
ax.annotate('Low Variance', xy = (900, test_dev[899]), xytext = (600, 3.5), **annotation_kw)
plt.legend(loc = 'upper right')        
Não foi fornecido texto alternativo para esta imagem

The lines in red and dark blue represent the first version of our model, while the lines in orange and light blue represent the changes of only one hyperparameter, which in this case was changed min_samples_leaf to ensure a larger number of samples per tree leaf.

We got a good result! We were able to reduce test errors and increase training errors, i.e., reduce the gap and reduce overfitting and composing a more generalizable model by learning less detail in the data and making predictions in new data sets. 

Shrinkage

Another regularization technique is to reduce the learning of each tree, reducing the learning_rate. Ensemble methods, in general, are so efficient and learn so much about data that we need to limit their learning ability!

If the method learns too much is bad. Therefore, we always need to maintain a level of generalization. We can reduce the learning rate by maintaining the same configuration as the model's hyperparameters and evaluate whether this helps reduce overfitting.

fig = plt.figure(figsize = FIGSIZE)
ax = plt.gca()


for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')), ({'learning_rate': 0.1}, ('#fdae61', '#abd9e9'))]:
    est = GradientBoostingRegressor(n_estimators = 1000, max_depth = 1, learning_rate = 1.0)
    est.set_params(**params)
    est.fit(X_train, y_train)
    
    test_dev, ax = deviance_plot(est, 
                                 X_test, 
                                 y_test, 
                                 ax = ax, 
                                 label = fmt_params(params),
                                 train_color = train_color, 
                                 test_color = test_color)
    
ax.annotate('Requires more trees', xy = (200, est.train_score_[199]), xytext=(300, 1.75), **annotation_kw)
ax.annotate('Minor error in test dataset', xy = (900, test_dev[899]), xytext=(600, 1.75), **annotation_kw)


plt.legend(loc = 'upper right')        
Não foi fornecido texto alternativo para esta imagem

With the learning_rate of 0.1, the error rate dropped considerably, reducing the gap between the training error and the test error, that is, closer and closer to eliminating the overfitting of the model.

When the model has good generalization and a low rate of errors under test, the model is ready for use, production, and troubleshooting for which it was created.

Stochastic Gradient Boosting

Stochastic Gradient Boosting is a statistical technique that is widely used in artificial intelligence. However, to work with Deep Learning, you need to use huge datasets, making no sense to apply Deep Learning to more restricted sets.

With this in mind, when using an entire data set, we can consider using subsamples from the complete dataset, i.e., training the Deep Learning model with the subsamples.

This time we will change two hyperparameters because we have already seen that in the shrinkage technique, the reduction of learning_rate generates an expressive result.

fig = plt.figure(figsize=FIGSIZE)
ax = plt.gca()
for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')), ({'learning_rate': 0.1, 'subsample': 0.7}, ('#fdae61', '#abd9e9'))]:
    est = GradientBoostingRegressor(n_estimators = 1000, max_depth = 1, learning_rate = 1.0, random_state = 1)
    est.set_params(**params)
    est.fit(X_train, y_train)
    test_dev, ax = deviance_plot(est, 
                                 X_test, 
                                 y_test, 
                                 ax = ax, 
                                 label = fmt_params(params), 
                                 train_color=train_color, 
                                 test_color=test_color)
    
ax.annotate('Lowest Error Rate in Test Dataset', xy = (400, test_dev[399]), xytext = (500, 3.0), **annotation_kw)


plt.legend(loc = 'upper right', fontsize='small')        
Não foi fornecido texto alternativo para esta imagem

In addition to using a learning_rate, we used the subsamples. The training was slower but gradually reduced errors both in training and in testing. Discreetly, the error rate in the test data begins to increase. We can apply another technique to stop training in advance when we reach the minimum point pointed out; that is, about 400 estimators would be ideal for training instead of 1,000.

What is done is to create subsamples of the training dataset before growing each tree. Therefore, we create subsamples of the attributes before finding the best split node (max_features). 

In practice, The Stochastic Gradient Boosting is a regularization technique that allows a more uniform learning/training of the Ensemble model, avoiding overfitting and still solving the data falling into the computer's memory.

There is no single method or technique right; there is no correct answer! We saw define the success of the model as to whether we managed, in the end, to have a generalizable model with the lowest possible error rate and a high level of accuracy.

Hyperparameter Tuning with Grid Search

We still have one more chance! We do not apply Hyperparameter Tunning at any time. 

So far, we have covered several Gradient Boosting models, and in all cases, we manually experience the values of the hyperparameters. Nothing prevents us from applying a fine-tuning to the hyperparameters to develop an ideal combination. 

Below we import the GridSearchCV from the model_selection module of Scikit-learn and then create a grid of parameters moving to the object param_grid. Then, let's let the cross-validation search find the best combination for the parameters defined in param_grid. 

from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)


# Parameters Grid
param_grid = {'learning_rate': [0.1, 0.01, 0.001],
              'max_depth': [4, 5, 6],
              'min_samples_leaf': [3, 4, 5],
              'subsample': [0.3, 0.5, 0.7],
              'n_estimators': [400, 700, 1000, 2000, 3000]
              }


# Regressor
est = GradientBoostingRegressor()


# Template created with GridSearchCV
gs_cv = GridSearchCV(est, param_grid, scoring = 'neg_mean_squared_error', n_jobs = 4).fit(X_train, y_train)


# Prints the best parameters
print('Best Hyperparameters %r' % gs_cv.best_params_)        

We create the estimator with GradientBoostingRegressor, call the GridSearchCV function to create several models with different combinations of hyperparameters, pass the estimator as a parameter, then the param_grid, and do they fit simultaneously.

Finally, we print the best_params, the best values of the hyperparameters found by gridsearchCV, where it recommends the learning_rate of 0.001, max_depth 5 min_samples_leaf 3, number of estimators of 3000, and subsamples 0.5. 

Recreates the Model with the Best Parameters

Here we create our estimator using the best parameters returned by GridSearchCV. When you look at the red line, it is very similar to the ideal blue line of regression. 

From everything we've seen so far, hyperparameter Tunning is probably the best alternative, though computationally intensive.


Leonardo Anello

To view or add a comment, sign in

More articles by Leonardo A.

Others also viewed

Explore topics