NYC Taxi Fare Predictions

Author

Clay Ndugga

Summary

The goal of the project is to analyze a subset of data from the New York City Taxi dataset and build 2 models to predict a trips fare price. First, exploratory data analysis will be conducted to provide insight into useful features, and help engineer features that will aid in predictions. Then, XGBoost will be used to make initial predictions on the data. Finally, an Ensemble Model (Stacking) will combine the results from multiple models to achieve an even greater prediction accuracy.

Figure 1: Diagram of Stacked Model Ensemble

Multiple models were considered as candidates for the final ensemble, as even models that perform poorly on their own could uncover trends that other models may not be able too due to their differing structure Level 0 models were trained directly on the training data, and their predictions were used as inputs in addition to the training data to train the meta model which outputs the final predictions. A stepwise model selector was used to pick the best perfoming set of level0 and meta models. The list of models considered for both level0 and the meta model were: Linear Regression, Lasso, Huber Regression, Regression, Ridge Regression, Elastic Net, K-Nearest Neighbors, XGB Regressor, and LGBM Regressor.

1 Library Imports

Import libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import seaborn as sns
import matplotlib.pyplot as plt
from  matplotlib.colors import LinearSegmentedColormap
import plotly.express as px
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
Show Code
# Join the train and test data so feature engineering only has to be done once
# The datasets are seperated before any scaling or predictive analysis to prevent leakage

df = pd.read_csv('raw data/W23P1_train.csv')
df['train'] = 1

test_df = pd.read_csv('raw data/W23P1_test.csv')
test_df['train'] = 0

df = pd.concat([df,test_df])
print(f"Dataframe size: {df.shape}")

df.head()
Dataframe size: (70000, 9)
uid fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count train
0 31722 9.0 2013-01-07 01:50:51 UTC -73.991421 40.750160 -73.989490 40.726085 2 1
1 14674 14.0 2013-01-15 20:08:00 UTC -73.997945 40.741057 -73.956223 40.767312 6 1
2 37571 19.5 2013-01-20 00:25:55 UTC -73.999161 40.688531 -74.026611 40.616634 1 1
3 47583 6.0 2013-01-01 02:30:00 UTC -73.991490 40.744257 -73.980912 40.748492 1 1
4 29473 33.5 2013-01-02 10:45:00 UTC -73.972773 40.677702 -73.862242 40.768117 1 1

2 Feature Engineering

Feature engineering is the most important part of building a successful predictive model, as a model can only be as good as the information it has access to. Thus, the goal of exploratory data analysis is to provide insight into which features will be the most useful in predicting taxi fare. According to NYC government website, in 2013, taxi fare prices were determined using the following calculation:

  • An initial charge of $2.5
  • $0.30 improvement surcharge
  • $0.50 charge per 1/5 mile OR $0.50 per minute in slow traffic
  • $0.50 nighttime surcharge 8pm to 6am
  • $1.00 rush hour surcharge 4pm-8pm weekdays
  • Tolls for bridges and tunnels
  • Airport special rates for JFK and Newark

The initial charge of $2.50 and $0.30 charge are present on every trip meaning they have no effect on our feature selection. Most of the charge comes from the trip distance/time, which implies features that contain information about the trip distance/time will be the most useful. There are additional small charges for certain time periods, and travel through tunnels and bridges. Finally, there are significantly higher rates for trips to and from the nearby airports.

2.1 Onehot encoding categorical features

#Reorder columns
df = df[['uid','fare_amount', 'pickup_datetime','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude','passenger_count','train']]

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['pickup_hour'] = df['pickup_datetime'].dt.hour   
df['pickup_day'] = df['pickup_datetime'].dt.day_name().astype("category")
df['pickup_day_num'] = df['pickup_datetime'].dt.day
df['pickup_day_type'] = np.where((df['pickup_day'] == 'Sunday') | (df['pickup_day'] =='Saturday'), 'Weekend', 'Weekday')
df['pickup_day_type'] = df['pickup_day_type'].astype("category")

df = pd.get_dummies(data=df)
df['pickup_day'] = df['pickup_datetime'].dt.day_name().astype("category")

2.2 Creating Indicator features

# Airport flag
df['pickup_JFK_airport_flag'] = np.where((df['pickup_latitude'] < 40.67) & (df['pickup_longitude'] > -73.82), 1, 0) 
df['dropoff_JFK_airport_flag'] = np.where((df['dropoff_latitude'] < 40.67) & (df['dropoff_longitude'] > -73.82), 1, 0) 
df['pickup_laguardia_airport_flag'] = np.where((df['pickup_latitude'] > 40.7667) & (df['pickup_longitude'] > -73.888), 1, 0) 
df['dropoff_laguardia_airport_flag'] = np.where((df['dropoff_latitude'] > 40.7667) & (df['dropoff_longitude'] > -73.888), 1, 0) 
df['jfk_airport_flag'] = np.where((df['dropoff_JFK_airport_flag'] == 1) |  (df['pickup_JFK_airport_flag'] == 1), 1, 0) 
df['lgr_airport_flag'] = np.where((df['dropoff_laguardia_airport_flag'] == 1) |  (df['pickup_laguardia_airport_flag'] == 1), 1, 0) 
df['airport'] = np.where((df['dropoff_laguardia_airport_flag'] == 1) |  (df['pickup_laguardia_airport_flag'] == 1) |  (df['dropoff_JFK_airport_flag'] == 1) |  (df['pickup_JFK_airport_flag'] == 1), 1, 0) 

# Night time surcharge 8pm to 6am
df['night_flag'] = np.where((df['pickup_hour'] <= 6) & (df['pickup_hour'] >= 20), 1, 0) 

# Weekday rush hour surcharge 4pm to 8pm
df['rushhour_flag'] = np.where((df['pickup_hour'] >= 16) & (df['pickup_hour'] <= 20), 1, 0) 

2.3 Creating distance features

The trip distance can be extracted from the raw data using a function that translates earths latitude longitude coordinates into the straight-line distance between them. However, when driving between destinations it is unlikely the travel path is a straight line. Two possible solutions to this problem were developed. “Manhattan distance” is the sum of the x and y components of a vector and provides the total distance travelled between two points on a grid, much like the streets of New York. By also computing the angle between the trips origin and destination the model will have enough information to extract a more accurate distance estimate.

def haversine_dist(lat1, lon1, lat2, lon2): #returns trip distance in kilometers
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*(np.sin(dlon/2)**2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    return 6371 * c

def bearing(lat1, lon1, lat2, lon2):    #returns trip bearing angle in degrees
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    y = np.sin(dlon)*np.cos(lat2)
    x = np.cos(lat1)*np.sin(lat2) - np.sin(lat1)*np.cos(lat2)*np.cos(dlon)
    theta = np.arctan2(y,x)
    return (theta * 180/np.pi +360) % 360

df['trip_distance'] =  np.vectorize(haversine_dist)(df['dropoff_latitude'], df['dropoff_longitude'],df['pickup_latitude'],df['pickup_longitude'])
df['trip_bearing'] = np.vectorize(bearing)(df['pickup_latitude'],df['pickup_longitude'],df['dropoff_latitude'], df['dropoff_longitude'])

The next largest factor in trip fare price from the NYC government website was travel to and from an airport. The two airports found in the dataset were LGR and JFK. The latitude and longitude coordinates of each airport was extracted from google maps and used to create 5 engineered features: dropoff dist to JFK, pickup dist to JFK, dropoff dist to LGR, pickup dist to LGR, and airport flag which was 0 for no airport involved in the trip and 1 for airport involved in trip.

Show Code
df['pickup_dist_to_jfk'] = np.vectorize(haversine_dist)(40.65209631703223, -73.7833074176638,df['pickup_latitude'],df['pickup_longitude'])
df['dropoff_dist_to_jfk'] = np.vectorize(haversine_dist)(40.65209631703223, -73.7833074176638,df['dropoff_latitude'],df['dropoff_longitude'])

df['pickup_dist_to_lgr'] = np.vectorize(haversine_dist)(40.77958127849166, -73.87315146148384,df['pickup_latitude'],df['pickup_longitude'])
df['dropoff_dist_to_lgr'] = np.vectorize(haversine_dist)(40.77958127849166, -73.87315146148384,df['dropoff_latitude'],df['dropoff_longitude'])

2.4 Google Maps API data

The google maps street API was used to calculate the travel distance and travel time between the origin and drop off. A final feature distance delta was created as the difference between the straight-line distance and API distance.

Show Code
train_mask = df['train'] == 1
test_df = df[~train_mask]
df = df[train_mask]

train_dist_df = pd.read_csv('data/api_train.csv')
df = pd.merge(df, train_dist_df, left_on='uid', right_on='uid', how='left')
df['dist_delta'] = abs(df['trip_distance'] - df['api_trip_distance'])

test_dist_df = pd.read_csv('data/api_test.csv')
test_df = pd.merge(test_df, test_dist_df, left_on='uid', right_on='uid', how='left')

test_df.api_trip_distance.fillna(test_df.trip_distance, inplace=True)
test_df.api_trip_duration.fillna(test_df.trip_distance/0.5, inplace=True)   #0.5 km/minute

test_df['dist_delta'] = abs(test_df['trip_distance'] - test_df['api_trip_distance'])

2.5 Create PCA Features

The numeric feature columns were ran through the PCA algorithm to generate 5 vectors for identifying the main axes of variance within a data set.

Show Code
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
numComponents = 5
pca = PCA(n_components=numComponents)
scaler = StandardScaler()

SCALED_FEATURES = ['pickup_latitude','pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','pickup_hour', 'pickup_day_num','trip_distance',
                    'trip_bearing', 'pickup_dist_to_jfk', 'dropoff_dist_to_jfk',
                    'pickup_dist_to_lgr', 'dropoff_dist_to_lgr', 'api_trip_distance',
                    'api_trip_duration', 'dist_delta']

scaled_df = df.copy()          
scaled_df[SCALED_FEATURES] = scaler.fit_transform(df[SCALED_FEATURES])
scaled_test_df = test_df.copy()          
scaled_test_df[SCALED_FEATURES] = scaler.transform(test_df[SCALED_FEATURES])


PCA_DIMENSION_REDUCTION_FEATURES = ['trip_bearing','dropoff_dist_to_jfk','trip_distance','dropoff_dist_to_lgr','pickup_dist_to_jfk','pickup_dist_to_lgr','api_trip_duration','pickup_latitude','dist_delta','pickup_longitude','dropoff_latitude','dropoff_longitude','pickup_hour']

pca_result = pca.fit_transform(scaled_df[PCA_DIMENSION_REDUCTION_FEATURES].values)
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1] 
df['pca-three'] = pca_result[:,2] 

pca_result = pca.transform(scaled_test_df[PCA_DIMENSION_REDUCTION_FEATURES].values)
test_df['pca-one'] = pca_result[:,0]
test_df['pca-two'] = pca_result[:,1] 
test_df['pca-three'] = pca_result[:,2] 
Show Code
print(f"The final feature columns are:")
df.head()
The final feature columns are:
uid fare_amount pickup_datetime pickup_latitude pickup_longitude dropoff_latitude dropoff_longitude passenger_count train pickup_hour pickup_day_num pickup_day_Friday pickup_day_Monday pickup_day_Saturday pickup_day_Sunday pickup_day_Thursday pickup_day_Tuesday pickup_day_Wednesday pickup_day_type_Weekday pickup_day_type_Weekend pickup_day pickup_JFK_airport_flag dropoff_JFK_airport_flag pickup_laguardia_airport_flag dropoff_laguardia_airport_flag jfk_airport_flag lgr_airport_flag airport night_flag rushhour_flag trip_distance trip_bearing pickup_dist_to_jfk dropoff_dist_to_jfk pickup_dist_to_lgr dropoff_dist_to_lgr api_trip_distance api_trip_duration dist_delta pca-one pca-two pca-three
0 31722 9.0 2013-01-07 01:50:51+00:00 40.750160 -73.991421 40.726085 -73.989490 2 1 1 7 0 1 0 0 0 0 0 1 0 Monday 0 0 0 0 0 0 0 0 0 2.681957 176.521589 20.656392 19.232673 10.483987 11.463754 3.7 15 1.018043 0.067223 1.114535 -0.108649
1 14674 14.0 2013-01-15 20:08:00+00:00 40.741057 -73.997945 40.767312 -73.956223 6 1 20 15 0 0 0 0 0 1 0 1 0 Tuesday 0 0 0 0 0 0 0 0 1 4.568758 50.269314 20.622338 19.405047 11.350059 7.127057 6.6 19 2.031242 0.331220 -0.484364 -1.585731
2 37571 19.5 2013-01-20 00:25:55+00:00 40.688531 -73.999161 40.616634 -74.026611 1 1 0 20 0 0 0 1 0 0 0 0 1 Sunday 0 0 0 0 0 0 0 0 0 8.323209 196.163064 18.650063 20.906095 14.670655 22.263499 10.2 12 1.876791 2.684151 6.474439 0.776196
3 47583 6.0 2013-01-01 02:30:00+00:00 40.744257 -73.991490 40.748492 -73.980912 1 1 2 1 0 0 0 0 0 1 0 1 0 Tuesday 0 0 0 0 0 0 0 0 0 1.007887 62.142206 20.323242 19.808678 10.712804 9.711654 1.4 7 0.392113 -0.854687 0.616436 -1.010020
4 29473 33.5 2013-01-02 10:45:00+00:00 40.677702 -73.972773 40.768117 -73.862242 1 1 10 2 0 0 0 0 0 0 1 1 0 Wednesday 0 0 0 1 0 1 1 0 0 13.705437 42.778729 16.232151 14.515472 14.099731 1.571294 18.2 25 4.494563 4.657020 -2.091841 -4.205518

3 Exploratory Data Analysis

3.1 Pickup location

Finding the location with the most pickups may provide insight into predicting trip fare, as busy areas will have more traffic, which means longer trips, which means higher trip fare. The location with the most pickups can be found by running the DBSCAN clustering algorithm on the latitude and longitude coordinates of the trips. Unlike other clustering algorithms, the primary factor used to create clusters in DBSCAN is point density making it suitable for this application. Running the algorithm on the location data and coloring the data points based on the number of points in each cluster produces the plot Figure 1

Show Code
from sklearn.cluster import DBSCAN

coords = df[['pickup_latitude','pickup_longitude']].to_numpy()
max_dist = 0.05
kms_per_radian = 6371.0088
epsilon = max_dist / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples = 30, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
df['db_cluster_labels'] = db.labels_
dbcluster_to_count_map = df['db_cluster_labels'].value_counts().to_dict()
df['db_cluster_count'] = df['db_cluster_labels'].apply(lambda cluster: dbcluster_to_count_map[cluster])

tPickup_fig = px.scatter_mapbox(df.drop(df[df['db_cluster_labels'] == -1].index), lat='pickup_latitude', lon='pickup_longitude', hover_name='db_cluster_labels', hover_data=['db_cluster_labels'],
    color='db_cluster_count', color_continuous_scale=px.colors.diverging.RdBu, opacity=.2,
    center={'lat': 40.75153, 'lon': -73.98987}, 
    zoom=15,
    height=500,width=1150,
    mapbox_style="carto-positron", labels=None);
tPickup_fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0}).update_coloraxes(showscale=True);
tPickup_fig.show();
# tPickup_fig.show()

Figure 1: Locations with highest pickup density

The locations with the highest density of pickups are Pennsylvania Station, and Grand Central Terminal. This makes sense as both locations are large hubs for public transit.

3.2 Trip Time

Determining whether the different days of the week have a different average trip fare will help discern whether weekday will be a useful feature for the predictive model. It is likely people exhibit different behaviour on the weekends as they are not working. Additionally, from the NYC government website, there are charges the occur only on weekdays. By aggregating the data by weekday, the following plots are produced.

Show Code
fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(12, 4), dpi = 175)

sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': '.9',
                'patch.force_edgecolor': False})

sns.barplot(data=df,x="pickup_day",y="fare_amount",order=['Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday'],
            errwidth=1.2,capsize=0.05, ax=ax[1],palette=sns.color_palette("rocket",7))

sns.countplot(data=df,x="pickup_day", ax=ax[0],order=['Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday'],palette=sns.color_palette("rocket",7))#,palette=sns.color_palette("mako",7))#,hue='pickup_day')

ax[0].tick_params(axis='x', labelrotation=20)
ax[1].tick_params(axis='x', labelrotation=20)
ax[1].set_ylim(11,13);

Figure 2: Fare price by trip day

For similar reasoning to trip weekday, trip hour may be a useful predictor.

Show Code
fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(12, 4), dpi = 175)
# sns.lineplot(x=df3.index, y=df3['fare_amount']['mean'], ax=ax)
sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': 'black',
                'patch.force_edgecolor': False})#,'image.cmap': 'mako')

sns.barplot(data=df,x="pickup_hour",y="fare_amount",order=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
            errwidth=1.2,capsize=0.05,palette=sns.color_palette("rocket",24),ax=ax[1])#palette=sns.color_palette("mako",24))

sns.countplot(data=df,x="pickup_hour", ax=ax[0],order=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],palette=sns.color_palette("rocket",24))#,palette=sns.color_palette("mako",7))#,hue='pickup_day')

ax[1].set_ylim(10,20);

Figure 3: Fare price by trip hour

Both trip weekday, and trip hour exhibit a varyinga average fare price and therefore will be useful feature columns.

3.3 Trip distance

Show Code
fig, ax = plt.subplots(figsize=(12, 4), dpi = 175)
sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': 'black','patch.force_edgecolor': False})
sns.scatterplot(data=df, x='trip_distance', y='fare_amount',s=2,ax=ax,hue='airport',palette=[sns.color_palette("rocket",5)[0],sns.color_palette("rocket",5)[3]])
# sns.boxplot(data=df, x='api_trip_duration', y='fare_amount',ax=ax,hue='airport')#,fliersize=0)
ax.set_xlim(0,24);
ax.set_ylim(0,60);

Figure 4: Trip distance vs fare price

As indicated by the NYC government website, fare price is a linear function of the trip distance, which is reflected in Figure 4. Trips involving the airport are colored in orange and have a higher average fare price than normal trips. The horizontal lines at the top right of the plot indicate fixed airport trip fees. This confirms distance and airport related metrics will be helpful to the predictive model.

4 Prediction Models

4.1 XGBoost

4.1.1 Hyperparameter Tuning

A common approach to select the best model parameters is to create a grid containing all possible parameter configurations and evaluate each one, however, for a model like XGBoost that has over 7 parameters—each taking a wide range of values—this is not computationally feasible. Instead, we will use the Hyperopt library that applies the Tree of Parzens (TPE) Algorithm to check a subset of parameter configurations and estimate the best combination.

from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

#Define the sample space of parameter combinations
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
        'gamma': hp.uniform ('gamma', 0,50),
        'alpha' : hp.uniform('alpha', 0,50),
        'reg_lambda' : hp.uniform('reg_lambda', 0,50),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.4,1),
        'min_child_weight' : hp.choice('min_child_weight',[0,1,2,3,5,8]),
        'n_estimators': 3000,
        'eta': hp.choice('eta',[0.001,0.01,0.02,0.03])
    }
Show Code
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

def RMS_error(Y,Y_hat):
    return np.sqrt(np.mean(np.power(np.subtract(Y,Y_hat),2)))

#Objective function that the TPE algorithm will attempt to minimize. 
#Returns the RMSE of a set of parameters calcualted by 5-fold cross validation
def objective(space):
    model=xgb.XGBRegressor(
                    max_depth = int(space['max_depth']),
                    gamma = space['gamma'],
                    alpha = space['alpha'],
                    reg_lambda = space['reg_lambda'],
                    colsample_bytree=space['colsample_bytree'],
                    min_child_weight=int(space['min_child_weight']),
                    n_estimators = int(space['n_estimators']), 
                    eta = space['eta'],
                    early_stopping_rounds = 15,
                    eval_metric='rmse'
                    )

    crossval_error = []
    for i, (train_idx, test_idx) in enumerate(kf.split(x_train)):
        fold_train_x = np.take(x_train, train_idx, axis=0)
        fold_train_y = np.take(y_train, train_idx, axis=0)
        fold_test_x = np.take(x_train, test_idx, axis=0)
        fold_test_y = np.take(y_train, test_idx, axis=0)

        val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
        val_x = np.take(fold_train_x, val_idxs, axis=0)
        val_y = np.take(fold_train_y, val_idxs, axis=0)

        fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
        fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
        model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
        pred = model.predict(fold_test_x)
        error = RMS_error(fold_test_y, pred)
        crossval_error.append(error)
    error = np.mean(crossval_error)
    print (f"max_depth: {int(space['max_depth'])} | gamma: {space['gamma']:.5} | alpha : {space['alpha']:.5} | reg_lambda : {space['reg_lambda']:.5} | colsample_bytree : {space['colsample_bytree']:.5} | min_child_weight : {int(space['min_child_weight'])} | eta : {space['eta']} | error: {error:.5}")

    return {'loss': error, 'status': STATUS_OK }
FEATURE_COLUMNS = ['trip_bearing', 'dropoff_dist_to_jfk','trip_distance', 'dropoff_dist_to_lgr','pickup_dist_to_jfk','pickup_dist_to_lgr',
       'api_trip_duration','pickup_latitude','dist_delta',
       'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','pickup_hour','airport',
       'pickup_day_Monday', 'pickup_day_Saturday', 'pickup_day_Sunday',
       'pickup_day_Thursday','pickup_day_Tuesday', 'pickup_day_Wednesday','pickup_day_Friday','pca-one', 'pca-two', 'pca-three','night_flag','rushhour_flag']
TARGET_COLUMN = ['fare_amount']

x_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df[TARGET_COLUMN].to_numpy()

trials = Trials()

# Hyperopt can take multiple hours to run, commented out to decrease render time
# best_hyperparams = fmin(fn = objective, 
#                         space = space, 
#                         algo = tpe.suggest, 
#                         max_evals = 200,
#                         trials = trials)

Hyperopt returned the following parameters to minimize the RMSE:

{'max_depth': 18 , 'gamma': 8.6581 , 'alpha' : 45.457 , 'reg_lambda' : 48.254 , 'colsample_bytree' : 0.555 , 'min_child_weight' : 8 , 'eta' : 0.03,'n_estimators': 2000} 

4.1.2 Model Fitting

import xgboost as xgb

params = {'max_depth': 18 , 'gamma': 8.6581 , 'alpha' : 45.457 , 'reg_lambda' : 48.254 , 'colsample_bytree' : 0.555 , 'min_child_weight' : 8 , 'eta' : 0.03,'n_estimators': 2000,'verbose_eval': True} #best params

num_rounds = 5000
model = None

val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
X_val = val_df[FEATURE_COLUMNS].to_numpy()
y_val = val_df[TARGET_COLUMN].to_numpy().reshape(5000,)

train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
X_train = train_df[FEATURE_COLUMNS].to_numpy()
y_train = train_df[TARGET_COLUMN].to_numpy().reshape(30000,)

dtrain = xgb.DMatrix(X_train, y_train, feature_names=FEATURE_COLUMNS)
dval = xgb.DMatrix(X_val, y_val, feature_names=FEATURE_COLUMNS)


main_model = xgb.train(params, dtrain, num_rounds, early_stopping_rounds=10, #10
                      evals=[(dtrain, 'train'), (dval, 'eval')],
                      xgb_model=model, verbose_eval=False)
print(f"The best validation RMSE acheived: {main_model.best_score:.5}")
[16:23:35] WARNING: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-0fc7796c793e6356f-1/xgboost/xgboost-ci-windows/src/learner.cc:767: 
Parameters: { "n_estimators", "verbose_eval" } are not used.
The best validation RMSE acheived: 3.9169

4.1.3 Output Predictions

Show Code
X_testmain = test_df[FEATURE_COLUMNS]
dtest = xgb.DMatrix(X_testmain, feature_names=FEATURE_COLUMNS)
y_pred = main_model.predict(dtest)

submission_df = pd.DataFrame(columns=['uid','fare_amount'])
submission_df['uid'] = test_df['uid']
submission_df['fare_amount'] = y_pred
# submission_df[['uid','fare_amount']].to_csv('submissions/XGB_pca123_night_rushhour_earlystop50_21.csv',index=False)
display(submission_df)
uid fare_amount
0 3 5.879074
1 10 4.249468
2 15 10.557648
3 16 16.221519
4 17 19.020933
... ... ...
34995 69986 14.189826
34996 69989 5.366324
34997 69993 18.416437
34998 69996 13.369841
34999 70000 7.867848

35000 rows × 2 columns

Uploading the submissions to kaggle, XGBoost scored an RMSE of 3.36 on the test dataset.

4.2 Ensemble Model

4.2.1 Prepare Model Candidates

The hyper parameters for each model (aside from XGB, and LightGBM) were selected by conducting and exhuastive grid search of all possible combiantions of hyper parameters.

Show Code
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, HuberRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
import lightgbm as lgbm
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.model_selection import KFold
import itertools

def RMS_error(Y,Y_hat):
    return np.sqrt(np.mean(np.power(np.subtract(Y,Y_hat),2)))

def param_grid(*lists):
    return list(itertools.product(*lists))

df = pd.read_csv('processed dataframes/scaled_train.csv')
test_df = pd.read_csv('processed dataframes/scaled_test.csv')
svr_model = SVR(kernel = 'linear',epsilon=.1)
ridge_model = Ridge(alpha=198)                                                                           
lasso_model = Lasso(alpha=0.04)                                                                          
elastic_model = ElasticNet(alpha=0.062 , l1_ratio= 1)                                                    
huber_model = HuberRegressor(epsilon=7,max_iter=3000,alpha=30)                                             
knn_model = KNeighborsRegressor(n_neighbors = 15, weights='distance', leaf_size=15,p=2)                          
linear_model = LinearRegression()                                                                     

xgb_model = xgb.XGBRegressor(alpha = 45.457, colsample_bytree = 0.555, reg_lambda = 48.254,n_estimators=2000, min_child_weight = 8, gamma=8.6581, max_depth=18, eta=0.03, eval_metric='rmse', early_stopping_rounds = 12) #best xgb
lgbm_model = lgbm.LGBMRegressor(num_leaves = 80, max_depth = 25, min_child_samples = 60, learning_rate = 0.05, subsample_freq = 20, subsample = 1, colsample_bytree = 0.8, n_estimators = 50000, max_bin = 10000, boosting_type='gbdt', objective= 'regression',n_jobs= -1, metric= 'rmse',early_stopping_round=10, verbose=-1)   

forest_model = RandomForestRegressor(bootstrap= 1, ccp_alpha= 2.973, max_depth= 16, max_features= 18, min_samples_leaf= 5, min_samples_split= 6, n_estimators= 718) 

chosen_models = [elastic_model, knn_model, xgb_model, lgbm_model,lasso_model,ridge_model]# chosen_models = [elastic_model, knn_model, xgb_model, lgbm_model,lasso_model,huber_model,ridge_model]

4.2.2 Get Out-of-Fold (OOF) Predictions

Using 10 folds each model was fitted on 9 folds then used to predict the fare amount on the remaining fold. This was repeated with each fold being left out once, generating out-of-fold predictions for the entire train dataset. Doing this prevents data leakage, as otherwise the models would be predicting on datapoints they have already seen.

Show Code
def get_oof_predictions(x, y, models):

    # instantiate a KFold with 10 splits
    kfold = KFold(n_splits=10, shuffle=True, random_state=1)
    
    # prepare lists to hold the re-ordered x and y values
    data_x, data_y  = [], []
    
    # run the following block for each of the 10 kfold splits
    for i, (train_idx, test_idx) in enumerate(kfold.split(x_train)):
        print(f"Working Fold: {i}")

        fold_train_x = np.take(x_train, train_idx, axis=0)
        fold_train_y = np.take(y_train, train_idx, axis=0)
        fold_test_x = np.take(x_train, test_idx, axis=0)
        fold_test_y = np.take(y_train, test_idx, axis=0)
    
        data_x.extend(fold_test_x)
        data_y.extend(fold_test_y)
    
        for item in models:
            print(f"Fold: {i} | Fitting model: {item}")
            model = models[item][0] 

            if item == 'XGBRegressor':
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)

                model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                print(f"Fold: {i} | Validation RMSE: {model.best_score}")
                predictions = model.predict(fold_test_x)
            elif item == 'LGBMRegressor':
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)

                model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                print(f"Fold: {i} | Validation RMSE: {model.best_score_['valid_0']['rmse']}")
                predictions = model.predict(fold_test_x)
            else:
                model.fit(fold_train_x, fold_train_y) 
                predictions = model.predict(fold_test_x)

            models[item][1].extend(predictions) 
    
    return data_x, data_y, models
x_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df['fare_amount'].to_numpy()

model_dict = {str(model).split('(')[0]: [model,[]] for model in chosen_models}
data_x, data_y, trained_models = get_oof_predictions(x_train,y_train,model_dict)
Working Fold: 0
Fold: 0 | Fitting model: ElasticNet
Fold: 0 | Fitting model: KNeighborsRegressor
Fold: 0 | Fitting model: XGBRegressor
Fold: 0 | Validation RMSE: 3.34340422098736
Fold: 0 | Fitting model: LGBMRegressor
Fold: 0 | Validation RMSE: 2.5347304308734233
Fold: 0 | Fitting model: Lasso
Fold: 0 | Fitting model: Ridge
Working Fold: 1
Fold: 1 | Fitting model: ElasticNet
Fold: 1 | Fitting model: KNeighborsRegressor
Fold: 1 | Fitting model: XGBRegressor
Fold: 1 | Validation RMSE: 3.0859730253115103
Fold: 1 | Fitting model: LGBMRegressor
Fold: 1 | Validation RMSE: 2.5913792900382613
Fold: 1 | Fitting model: Lasso
Fold: 1 | Fitting model: Ridge
Working Fold: 2
Fold: 2 | Fitting model: ElasticNet
Fold: 2 | Fitting model: KNeighborsRegressor
Fold: 2 | Fitting model: XGBRegressor
Fold: 2 | Validation RMSE: 3.11564875559071
Fold: 2 | Fitting model: LGBMRegressor
Fold: 2 | Validation RMSE: 2.5491994753213163
Fold: 2 | Fitting model: Lasso
Fold: 2 | Fitting model: Ridge
Working Fold: 3
Fold: 3 | Fitting model: ElasticNet
Fold: 3 | Fitting model: KNeighborsRegressor
Fold: 3 | Fitting model: XGBRegressor
Fold: 3 | Validation RMSE: 3.2941259665900944
Fold: 3 | Fitting model: LGBMRegressor
Fold: 3 | Validation RMSE: 2.5414740589396803
Fold: 3 | Fitting model: Lasso
Fold: 3 | Fitting model: Ridge
Working Fold: 4
Fold: 4 | Fitting model: ElasticNet
Fold: 4 | Fitting model: KNeighborsRegressor
Fold: 4 | Fitting model: XGBRegressor
Fold: 4 | Validation RMSE: 3.159533681316195
Fold: 4 | Fitting model: LGBMRegressor
Fold: 4 | Validation RMSE: 2.542693798429106
Fold: 4 | Fitting model: Lasso
Fold: 4 | Fitting model: Ridge
Working Fold: 5
Fold: 5 | Fitting model: ElasticNet
Fold: 5 | Fitting model: KNeighborsRegressor
Fold: 5 | Fitting model: XGBRegressor
Fold: 5 | Validation RMSE: 3.3497618072423796
Fold: 5 | Fitting model: LGBMRegressor
Fold: 5 | Validation RMSE: 2.6832628410176644
Fold: 5 | Fitting model: Lasso
Fold: 5 | Fitting model: Ridge
Working Fold: 6
Fold: 6 | Fitting model: ElasticNet
Fold: 6 | Fitting model: KNeighborsRegressor
Fold: 6 | Fitting model: XGBRegressor
Fold: 6 | Validation RMSE: 3.1627186555477724
Fold: 6 | Fitting model: LGBMRegressor
Fold: 6 | Validation RMSE: 2.708755394817404
Fold: 6 | Fitting model: Lasso
Fold: 6 | Fitting model: Ridge
Working Fold: 7
Fold: 7 | Fitting model: ElasticNet
Fold: 7 | Fitting model: KNeighborsRegressor
Fold: 7 | Fitting model: XGBRegressor
Fold: 7 | Validation RMSE: 3.330761072419705
Fold: 7 | Fitting model: LGBMRegressor
Fold: 7 | Validation RMSE: 2.6401084772362657
Fold: 7 | Fitting model: Lasso
Fold: 7 | Fitting model: Ridge
Working Fold: 8
Fold: 8 | Fitting model: ElasticNet
Fold: 8 | Fitting model: KNeighborsRegressor
Fold: 8 | Fitting model: XGBRegressor
Fold: 8 | Validation RMSE: 3.342755661846802
Fold: 8 | Fitting model: LGBMRegressor
Fold: 8 | Validation RMSE: 2.6109948358743615
Fold: 8 | Fitting model: Lasso
Fold: 8 | Fitting model: Ridge
Working Fold: 9
Fold: 9 | Fitting model: ElasticNet
Fold: 9 | Fitting model: KNeighborsRegressor
Fold: 9 | Fitting model: XGBRegressor
Fold: 9 | Validation RMSE: 3.2663211603775753
Fold: 9 | Fitting model: LGBMRegressor
Fold: 9 | Validation RMSE: 2.611919757939597
Fold: 9 | Fitting model: Lasso
Fold: 9 | Fitting model: Ridge

4.2.3 Stepwise Model Selection

Not all level 0 model may necessarily improve the performance of the final meta model, so they must be selected carefully. To do this, level0 models will be added to the meta model forward stepwise until performance of the meta model stops increasing, like a stepwise feature selector with level 0 models instead of features.

Show Code
from sklearn.model_selection import cross_validate
def stepwise_model_selector(X, y, meta_model, models_dict, model_label, verbose=False):
    
    print(f"Current Meta Model: {str(meta_model).split('(')[0]}",end="")

    included_models = []
    
    while True:
        changed=False
        
        excluded_models = list(set(models_dict.keys())-set(included_models)) 
        new_rmse = pd.Series(index=excluded_models)
        current_meta_x = np.array(X)
        
        if len(included_models) > 0:
            for included in included_models:
                included = np.array(models_dict[included][1]).reshape((len(models_dict[included][1]), 1))
                current_meta_x = np.hstack((current_meta_x, included))


        kf = KFold(n_splits=5)
        crossval_error = []
        for i, (train_idx, test_idx) in enumerate(kf.split(current_meta_x)):

            fold_train_x = np.take(current_meta_x, train_idx, axis=0)
            fold_train_y = np.take(y, train_idx, axis=0)
            fold_test_x = np.take(current_meta_x, test_idx, axis=0)
            fold_test_y = np.take(y, test_idx, axis=0)

            if (meta_model == xgb_model):
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
            elif (meta_model == lgbm_model):
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
            else:
                meta_model.fit(fold_train_x, fold_train_y)
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
        starting_rmse = np.mean(crossval_error)*-1


        for excluded in excluded_models:
            
            new_yhat = np.array(models_dict[excluded][1]).reshape(-1, 1) 
            meta_x = np.hstack((current_meta_x, new_yhat)) 

            kf = KFold(n_splits=5)
            crossval_error = []
            for i, (train_idx, test_idx) in enumerate(kf.split(meta_x)):

                fold_train_x = np.take(meta_x, train_idx, axis=0)
                fold_train_y = np.take(y, train_idx, axis=0)
                fold_test_x = np.take(meta_x, test_idx, axis=0)
                fold_test_y = np.take(y, test_idx, axis=0)

                if (meta_model == xgb_model):
                    val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                    val_x = np.take(fold_train_x, val_idxs, axis=0)
                    val_y = np.take(fold_train_y, val_idxs, axis=0)

                    fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                    fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                    meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
                elif (meta_model == lgbm_model):
                    val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                    val_x = np.take(fold_train_x, val_idxs, axis=0)
                    val_y = np.take(fold_train_y, val_idxs, axis=0)

                    fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                    fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                    meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
                else:
                    meta_model.fit(fold_train_x, fold_train_y)
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
            rmse = np.mean(crossval_error)*-1
            new_rmse[excluded] = rmse 
        
        best_rmse = new_rmse.max() 
        
        if best_rmse > starting_rmse:  # if the best rmse is better than the initial rmse
            best_feature = new_rmse.idxmax()  # define this as the new best feature
            included_models.append(str(best_feature)) # append this model name to the included list
            changed=True # flag that we changed it
            print(f" | Adding {str(best_feature).split('(')[0]}",end="")
        else: changed = False
        
        if not changed:
            break
    print("")
    return included_models, starting_rmse
meta_model_df = pd.DataFrame(columns=['meta_model','rmse','level0-models'])
meta_errors = []
level0s = []

for model in chosen_models:
    level0_models = model_dict.copy()
    level0_models.pop(str(model).split('(')[0])
    level0, error = stepwise_model_selector(data_x, data_y, model, level0_models, f"{str(model).split('(')[0]} Meta Regressor", verbose=False)
    meta_errors.append(error)
    level0s.append(level0)

meta_model_df['meta_model'] = [str(model).split('(')[0] for model in chosen_models]
meta_model_df['rmse'] = meta_errors
meta_model_df['level0-models'] = level0s

display(meta_model_df.sort_values('rmse', ascending=False))
Current Meta Model: ElasticNet
 | Adding XGBRegressor
 | Adding LGBMRegressor

Current Meta Model: KNeighborsRegressor
 | Adding XGBRegressor
 | Adding Ridge
 | Adding ElasticNet

Current Meta Model: XGBRegressor

Current Meta Model: LGBMRegressor

Current Meta Model: Lasso
 | Adding XGBRegressor
 | Adding LGBMRegressor

Current Meta Model: Ridge
 | Adding XGBRegressor
 | Adding Lasso
 | Adding ElasticNet
 | Adding LGBMRegressor
 | Adding KNeighborsRegressor
meta_model rmse level0-models
0 ElasticNet -3.465764 [XGBRegressor, LGBMRegressor]
4 Lasso -3.473172 [XGBRegressor, LGBMRegressor]
2 XGBRegressor -3.493721 []
1 KNeighborsRegressor -3.496053 [XGBRegressor, Ridge, ElasticNet]
5 Ridge -3.497683 [XGBRegressor, Lasso, ElasticNet, LGBMRegresso...
3 LGBMRegressor -3.505002 []

4.2.4 Output Predictions

Using the best identified meta a level0 models from the stepwise model selector, predictions will be made on the test dataset.

Show Code
#Get models selected by stepwise model selector

meta_model_name = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][0]
best_models = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][2]

chosen_models_names = [str(model).split('(')[0] for model in chosen_models]
best_models_idxs = np.where(np.isin(chosen_models_names,best_models))[0]

meta_model = chosen_models[np.where(np.isin(chosen_models_names,meta_model_name))[0][0]]

stepwise_models = []
for idx in best_models_idxs:
    stepwise_models.append(chosen_models[idx])

stepwise_models_names = [str(model).split('(')[0] for model in stepwise_models]
Show Code
#Fit the level0 models to the training data

X_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df[TARGET_COLUMN].to_numpy()

for model in stepwise_models:
    if str(model).split('(')[0] == 'LGBMRegressor':
        val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
        X_val = val_df[FEATURE_COLUMNS].to_numpy()
        y_val = val_df[TARGET_COLUMN].to_numpy()

        train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
        X_train = train_df[FEATURE_COLUMNS].to_numpy()
        y_train = train_df[TARGET_COLUMN].to_numpy()
        model.fit(X_train, y_train, eval_set=[(X_val, y_val)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])

    elif str(model).split('(')[0] == 'XGBRegressor':
        val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
        X_val = val_df[FEATURE_COLUMNS].to_numpy()
        y_val = val_df[TARGET_COLUMN].to_numpy()

        train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
        X_train = train_df[FEATURE_COLUMNS].to_numpy()
        y_train = train_df[TARGET_COLUMN].to_numpy()
        model.fit(X_train,y_train,eval_set=[(X_val, y_val)],verbose=0)
    else:
        model.fit(X_train,y_train)
Show Code
# Output the meta model predictions

def create_level1_data(x_data, models):
    meta_x_data = x_data
    for model in models:
        pred = model.predict(x_data)
        meta_x_data = np.column_stack((meta_x_data,pred))
    return meta_x_data

# print("shape without meta features",np.array(data_x).shape)

predictions = np.array([model_dict[model][1] for model in model_dict if str(model).split('(')[0] in stepwise_models_names]).T
meta_x_train = np.column_stack((data_x,predictions))

x_test = test_df[FEATURE_COLUMNS].to_numpy()
meta_x_test = create_level1_data(x_test,stepwise_models)


meta_model.fit(meta_x_train, data_y)
ensb_pred = meta_model.predict(meta_x_test)

submission_df = pd.DataFrame(columns=['uid','fare_amount'])
submission_df['uid'] = test_df['uid']
submission_df['fare_amount'] = ensb_pred
# submission_df[['uid','fare_amount']].to_csv('submissions/Metamodel',index=False)
print("Final Submission")
display(submission_df)
Final Submission
uid fare_amount
0 3 5.818966
1 10 4.357765
2 15 10.557566
3 16 16.233197
4 17 19.386255
... ... ...
34995 69986 14.424490
34996 69989 5.409180
34997 69993 18.343031
34998 69996 13.543978
34999 70000 7.939707

35000 rows × 2 columns

Uploading the submissions to kaggle, the Ensemble model scored an RMSE of 3.35 on the test dataset.