How to model with gradient boosting machine in R

Data Journalism in R, How to
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

The tutorial is part 2 of our #tidytuesday post from last week, which explored bike rental data from Washington, D.C. Check it out here. The following tutorial will use a gradient boosting machine (GBM) to figure out what drives bike rental behavior.

GBM is unique compared to other decision tree algorithms because it builds models sequentially with higher weights given to those cases that were poorly predicted in previous models, thus improving accuracy incrementally instead of simply taking an average of all models like a random forest algorithm would. By reducing the error iteratively to produce what will become the final model, GBM is an efficient and powerful algorithm for classification and regression problems.

Implementing GBM in R allows for a nice selection of exploratory plots including parameter contribution, and partial dependence plots which provide a visual representation of the effect across values of a feature in the model.

Load the packages

The packages below are needed to complete this analysis.

# packages --------------------------------------------------------------
library(rsample)      
library(caret)        
library(ggthemes)
library(scales)
library(wesanderson)
library(tidyverse)
library(gbm)
library(Metrics)
library(here)

Run previous scripts to import and wrangle data.

base::source("code/01-import.R")
base::source("code/02.3-wrangle.R")
base::source("code/03.3-visualize.R") # we don't neeed to run 03-vizualize.R,
# because it does not change the underlying data structure, but we wantto view 
# the EDA outputs

What did we learn in part 1?

The exploratory data analysis showed us that bike rentals seem to drop off at a certain temperature (~20˚C).

ggRentalsByTemp

And that rentals were lower on holidays compared to non-holidays.

So now that we have some understanding of the variables in BikeData, we can use the Generalized Boosted Regression Models (gbm) package to model the bike rental data.

The data

BikeData %>% glimpse()
 
    ## Observations: 731
    ## Variables: 30
    ## $ weekday         6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2…
    ## $ weekday_chr     "Saturday", "Sunday", "Monday", "Tuesday", "Wednesda…
    ## $ weekday_fct     Saturday, Sunday, Monday, Tuesday, Wednesday, Thursd…
    ## $ holiday         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
    ## $ holiday_chr     "Non-Holiday", "Non-Holiday", "Non-Holiday", "Non-Ho…
    ## $ holiday_fct     Non-Holiday, Non-Holiday, Non-Holiday, Non-Holiday, …
    ## $ season          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    ## $ season_chr      "Spring", "Spring", "Spring", "Spring", "Spring", "S…
    ## $ season_fct      Spring, Spring, Spring, Spring, Spring, Spring, Spri…
    ## $ workingday      0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1…
    ## $ workingday_chr  "Non-Working Day", "Non-Working Day", "Working Day",…
    ## $ workingday_fct  Non-Working Day, Non-Working Day, Working Day, Worki…
    ## $ month_ord       Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Ja…
    ## $ month_fct       January, January, January, January, January, January…
    ## $ yr              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ yr_chr          "2011", "2011", "2011", "2011", "2011", "2011", "201…
    ## $ yr_fct          2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
    ## $ weathersit      2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2…
    ## $ weathersit_chr  "Clouds/Mist", "Clouds/Mist", "Good", "Good", "Good"…
    ## $ weathersit_fct  Clouds/Mist, Clouds/Mist, Good, Good, Good, Good, Cl…
    ## $ instant         1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
    ## $ dteday          2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 201…
    ## $ mnth            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    ## $ temp            8, 9, 1, 1, 2, 1, 1, 0, -1, 0, 0, 0, 0, 0, 2, 2, 0, …
    ## $ atemp           28.36325, 28.02713, 22.43977, 23.21215, 23.79518, 23…
    ## $ hum             80, 69, 43, 59, 43, 51, 49, 53, 43, 48, 68, 59, 47, …
    ## $ windspeed       10, 16, 16, 10, 12, 6, 11, 17, 24, 14, 8, 20, 20, 8,…
    ## $ casual          331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25,…
    ## $ registered      654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 12…
    ## $ cnt             985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 13…

Variables of interest

First we want build a data frame for our model, which includes our outcome variable cnt (the ‘count of total rental bikes including both casual and registered’) and the features we want to help explain:

season_fct = Factor w/ 4 levels, “Spring”, “Summer”, “Fall”, “Winter”

yr_fct = Factor w/ 2 levels, “2011”, “2012”

month_fct = Factor w/ 12 levels, “January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, “December”

holiday_fct = Factor w/ 2 levels “Non-Holiday”, “Holiday”

weekday_fct = Factor w/ 7 levels, “Sunday”, “Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”,”Saturday”

workingday_fct = Factor w/ 2 levels, “Non-Working Day”, “Working Day”

weathersit_fct = Factor w/ 3 levels, “Good”, “Clouds/Mist”, “Rain/Snow/Storm”

temp = int [1:731] 8 9 1 1 2 1 1 0 -1 0

atemp = num [1:731] 28.4 28 22.4 23.2 23.8 ..

hum = int [1:731] 80 69 43 59 43 51 49 53 43 48 …

windspeed = int [1:731] 10 16 16 10 12 6 11 17 24 14 …

BikeDataModel <- BikeData %>% dplyr::select(cnt,
                season_fct,
                yr_fct,
                month_fct,
                holiday_fct,
                weekday_fct,
                workingday_fct,
                weathersit_fct,
                temp,
                atemp,
                hum,
                windspeed)
BikeDataModel %>% dplyr::glimpse(78)
 
    ## Observations: 731
    ## Variables: 12
    ## $ cnt             985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 13…
    ## $ season_fct      Spring, Spring, Spring, Spring, Spring, Spring, Spri…
    ## $ yr_fct          2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
    ## $ month_fct       January, January, January, January, January, January…
    ## $ holiday_fct     Non-Holiday, Non-Holiday, Non-Holiday, Non-Holiday, …
    ## $ weekday_fct     Saturday, Sunday, Monday, Tuesday, Wednesday, Thursd…
    ## $ workingday_fct  Non-Working Day, Non-Working Day, Working Day, Worki…
    ## $ weathersit_fct  Clouds/Mist, Clouds/Mist, Good, Good, Good, Good, Cl…
    ## $ temp            8, 9, 1, 1, 2, 1, 1, 0, -1, 0, 0, 0, 0, 0, 2, 2, 0, …
    ## $ atemp           28.36325, 28.02713, 22.43977, 23.21215, 23.79518, 23…
    ## $ hum             80, 69, 43, 59, 43, 51, 49, 53, 43, 48, 68, 59, 47, …
    ## $ windspeed       10, 16, 16, 10, 12, 6, 11, 17, 24, 14, 8, 20, 20, 8,…

The testing and training split

We want to build training and testing data sets that are representative of the original bike data set. To achieve this, we will randomly select observations for two subsets of data. We’ll also specify the sampling process, so we get an equal proportion of bike rentals (cnt) in our BikeTest and BikeTrain data sets (we’ll randomize our data into training and test sets with a 70% / 30% split).

The BikeTrain has the data we will use to build a model and demonstrate it’s performance. However, because our goal is to generalize as much as possible using ‘real world data’, we’ll be testing the model on data our model hasn’t seen (i.e. the BikeTest data).

Having testing and training data sets allows us to 1) build and select the best model, and 2) then assess the final model’s performance using ‘fresh’ data.

set.seed(123)
BikeSplit <- initial_split(BikeDataModel, prop = .7)
BikeTrain <- training(BikeSplit)
BikeTest  <- testing(BikeSplit)

We can call the gbm function and select a number of parameters, including cross-fold validation (cv.folds = 10).

Cross-fold validation randomly divides our training data into k sets that are relatively equal in size. Our model will be fit using all the sets with the exclusion of the first fold. The model error of the fit is estimated with the hold-out sets.

Each set is used to measure the model error and an average is calculated across the various sets. shrinkage, interaction.depth, n.minobsinnode and n.trees can be adjusted for model accuracy using the caret package in R http://topepo.github.io/caret/index.html.

# model
set.seed(123)
bike_fit_1 <- gbm::gbm(cnt ~., 
             # the formula for the model (recall that the period means, "all 
             # variables in the data set")
             data = BikeTrain, 
             # data set
             verbose = TRUE, 
             # Logical indicating whether or not to print
             #  out progress and performance indicators
             shrinkage = 0.01, 
             # a shrinkage parameter applied to each tree in the expansion. 
             # Also known as the learning rate or step-size reduction; 0.001 
             # to 0.1 usually work, but a smaller learning rate typically 
             # requires more trees.
             interaction.depth = 3, 
             # Integer specifying the maximum depth of each tree (i.e., the 
             # highest level of variable interactions allowed). A value of 1 
             # implies an additive model, a value of 2 implies a model with up
             #  to 2-way interactions
             n.minobsinnode = 5,
             # Integer specifying the minimum number of observations in the 
             # terminal nodes of the trees. Note that this is the actual number 
             # of observations, not the total weight.
             n.trees = 5000, 
             # Integer specifying the total number of trees to fit. This is 
             # equivalent to the number of iterations and the number of basis 
             # functions in the additive expansion.
             cv.folds = 10
             # Number of cross-validation folds to perform. If cv.folds>1 then
             # gbm, in addition to the usual fit, will perform a 
             # cross-validation, calculate an estimate of generalization error
             #  returned in cv.error
             )
Distribution not specified, assuming gaussian ...
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1  3887192.4522             nan     0.0100 53443.4349
     2  3835212.0554             nan     0.0100 51509.5429
     3  3781613.7580             nan     0.0100 46152.5706
     4  3734439.6068             nan     0.0100 48076.6689
     5  3684049.7222             nan     0.0100 47545.5212
     6  3636533.0042             nan     0.0100 48434.4787
     7  3589347.3456             nan     0.0100 49416.0562

NOTE: The gbm::gbm() function creates a lot of output (we included the verbose = TRUE argument).

Store and explore

Now we can inspect the model object (bike_fit_1) beginning with the optimal number of learners that GBM has produced and the error rate.

We can use the gbm::gbm.perf() function to see the the error rate at each number of learners.

# model performance
perf_gbm1 = gbm.perf(bike_fit_1, method = "cv")

In the visualization below we can see that the blue line represents the optimal number of trees with our cross validation (cv). GBM can be sensitive to over-fitting, so using the method = "cv" in our estimate protects against this.

We can see that the optimal number of trees is 1794

Make predictions

Now we can predict our bike rentals using the predict() function with our test set and the optimal number of trees based on our perf.gbm1 estimate.

RMSE = The root mean squared error (RMSE) is used to measure the prediction error in our model(s). As the name suggests, these errors are weighted by means of squaring them. The RMSE is also pretty straightforward to interpret, because it's in the same units as our outcome variable. Additional attractive qualities include the fact equally captures the overestimates and underestimates, and the misses are penalized according to their relative size.

Metrics::rmse computes the root mean squared error between two numeric vectors, so we will use it to compare our predicted values with the residuals to calculate the error of our model.

bike_prediction_1 <- stats::predict(
                           # the model from above
                          object = bike_fit_1, 
                          # the testing data
                          newdata = BikeTest,
                          # this is the number we calculated above
                          n.trees = perf_gbm1)
rmse_fit1 <- Metrics::rmse(actual = BikeTest$cnt, 
                           predicted = bike_prediction_1)
print(rmse_fit1)

GBM offers partial dependency plots to explore the correlations between a feature in our model and our outcome. For example, you can see in the graph below that ambient temperature is associated with increased numbers of bike rentals until close to 35 degrees when riders tend to be less likely to rent a bike.

gbm::plot.gbm(bike_fit_1, i.var = 9)

Similarly we can look at the interaction of two features on bike rentals. Below we can see that riders are more likely to rent a bike after Monday, despite wind speed.

gbm::plot.gbm(bike_fit_1, i.var = c(5, 11))

We can visualize the impact of different features on predicting bike rentals using the relative influence provided by GBM. First we can summarize our model then assign these data to a tibble using gbm::summary.gbm() and passing this to the tibble::as_tibble() function.

# summarize model
BikeEffects <- tibble::as_tibble(gbm::summary.gbm(bike_fit_1, 
                                         plotit = FALSE))
BikeEffects %>% utils::head()
    ## # A tibble: 6 x 2
    ##   var            rel.inf
    ##               
    ## 1 atemp            35.3 
    ## 2 yr_fct           22.1 
    ## 3 month_fct         9.81
    ## 4 hum               8.41
    ## 5 season_fct        8.35
    ## 6 weathersit_fct    3.91

This creates a new data set with var, a factor variable with the variables in our model, and rel.inf, the relative influence each variable had on our model predictions.

We can then plot the top ten features by impact using ggpplot and our new data frame containing our model summary (BikeEffects).

# plot effects
BikeEffects %>% 
  # arrange descending to get the top influencers
  dplyr::arrange(desc(rel.inf)) %>%
  # sort to top 10
  dplyr::top_n(10) %>%
  # plot these data using columns
  ggplot(aes(x = forcats::fct_reorder(.f = var, 
                                      .x = rel.inf), 
             y = rel.inf, 
             fill = rel.inf)) +
  geom_col() +
  # flip
  coord_flip() +
  # format
  scale_color_brewer(palette = "Dark2") +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + 
  xlab('Features') +
  ylab('Relative Influence') +
  ggtitle("Top 10 Drivers of Bike Rentals")

This let's us know what is being used for selection: Selecting by rel.inf

We can visualize the distribution of our predicted compared with actual bike rentals by predicting these values and plotting the difference.

# Predicted bike rentals
BikeTest$predicted <- base::as.integer(predict(bike_fit_1, 
                                         newdata = BikeTest, 
                                         n.trees = perf_gbm1))

# plot predicted v actual
ggplot(BikeTest) +
  geom_point(aes(y = predicted, 
                 x = cnt, 
                 color = predicted - cnt), alpha = 0.7) +
  # add theme
  theme_fivethirtyeight() +
  # strip text
  theme(axis.title = element_text()) + 
  # add format to labels
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  # add axes/legend titles
  scale_color_continuous(name = "Predicted - Actual", 
                         labels = comma) +
  ylab('Predicted Bike Rentals') +
  xlab('Actual Bike Rentals') +
  ggtitle('Predicted vs Actual Bike Rentals') 

We can see that our model did a fairly good job predicting bike rentals.

What did we learn?

We learned the ambient temperature is the largest influencer for predicting bike rentals and that rental numbers go down when the temperature reaches ~35 degrees. We can also see holiday (or non-holiday) was not much of an influencer for predicting bike rentals.

Peter Spangler
Peter is an entrepreneurial minded data scientist whose expertise in building analytic solutions and insights to tell data driven stories have impacted organizations including Alibaba, Citrix and Lyft. Peter has lead experimentation design and analytics projects focused on retention, user acquisition and channel optimization in the SaaS and rideshare spaces. The solutions he has produced include incrementality testing, segmentation, ML models and fraud. Peter is a passionate advocate for building analytics teams in cross-functional environments.

Leave a Reply