NYC Taxi Fare Predictions

Author

Clay Ndugga

Summary

The goal of the project is to analyze a subset of data from the New York City Taxi dataset and build 2 models to predict a trips fare price. First, exploratory data analysis will be conducted to provide insight into useful features, and help engineer features that will aid in predictions. Then, XGBoost will be used to make initial predictions on the data. Finally, an Ensemble Model (Stacking) will combine the results from multiple models to achieve an even greater prediction accuracy.

Figure 1: Diagram of Stacked Model Ensemble

Multiple models were considered as candidates for the final ensemble, as even models that perform poorly on their own could uncover trends that other models may not be able too due to their differing structure Level 0 models were trained directly on the training data, and their predictions were used as inputs in addition to the training data to train the meta model which outputs the final predictions. A stepwise model selector was used to pick the best perfoming set of level0 and meta models. The list of models considered for both level0 and the meta model were: Linear Regression, Lasso, Huber Regression, Regression, Ridge Regression, Elastic Net, K-Nearest Neighbors, XGB Regressor, and LGBM Regressor.

1 Library Imports

Import libraries

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import seaborn as sns
import matplotlib.pyplot as plt
from  matplotlib.colors import LinearSegmentedColormap
import plotly.express as px
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

Show Code

# Join the train and test data so feature engineering only has to be done once
# The datasets are seperated before any scaling or predictive analysis to prevent leakage

df = pd.read_csv('raw data/W23P1_train.csv')
df['train'] = 1

test_df = pd.read_csv('raw data/W23P1_test.csv')
test_df['train'] = 0

df = pd.concat([df,test_df])
print(f"Dataframe size: {df.shape}")

df.head()

Dataframe size: (70000, 9)

	uid	fare_amount	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count	train
0	31722	9.0	2013-01-07 01:50:51 UTC	-73.991421	40.750160	-73.989490	40.726085	2	1
1	14674	14.0	2013-01-15 20:08:00 UTC	-73.997945	40.741057	-73.956223	40.767312	6	1
2	37571	19.5	2013-01-20 00:25:55 UTC	-73.999161	40.688531	-74.026611	40.616634	1	1
3	47583	6.0	2013-01-01 02:30:00 UTC	-73.991490	40.744257	-73.980912	40.748492	1	1
4	29473	33.5	2013-01-02 10:45:00 UTC	-73.972773	40.677702	-73.862242	40.768117	1	1

2 Feature Engineering

Feature engineering is the most important part of building a successful predictive model, as a model can only be as good as the information it has access to. Thus, the goal of exploratory data analysis is to provide insight into which features will be the most useful in predicting taxi fare. According to NYC government website, in 2013, taxi fare prices were determined using the following calculation:

An initial charge of $2.5
$0.30 improvement surcharge
$0.50 charge per 1/5 mile OR $0.50 per minute in slow traffic
$0.50 nighttime surcharge 8pm to 6am
$1.00 rush hour surcharge 4pm-8pm weekdays
Tolls for bridges and tunnels
Airport special rates for JFK and Newark

The initial charge of $2.50 and $0.30 charge are present on every trip meaning they have no effect on our feature selection. Most of the charge comes from the trip distance/time, which implies features that contain information about the trip distance/time will be the most useful. There are additional small charges for certain time periods, and travel through tunnels and bridges. Finally, there are significantly higher rates for trips to and from the nearby airports.

2.1 Onehot encoding categorical features

#Reorder columns
df = df[['uid','fare_amount', 'pickup_datetime','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude','passenger_count','train']]

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['pickup_hour'] = df['pickup_datetime'].dt.hour   
df['pickup_day'] = df['pickup_datetime'].dt.day_name().astype("category")
df['pickup_day_num'] = df['pickup_datetime'].dt.day
df['pickup_day_type'] = np.where((df['pickup_day'] == 'Sunday') | (df['pickup_day'] =='Saturday'), 'Weekend', 'Weekday')
df['pickup_day_type'] = df['pickup_day_type'].astype("category")

df = pd.get_dummies(data=df)
df['pickup_day'] = df['pickup_datetime'].dt.day_name().astype("category")

2.2 Creating Indicator features

# Airport flag
df['pickup_JFK_airport_flag'] = np.where((df['pickup_latitude'] < 40.67) & (df['pickup_longitude'] > -73.82), 1, 0) 
df['dropoff_JFK_airport_flag'] = np.where((df['dropoff_latitude'] < 40.67) & (df['dropoff_longitude'] > -73.82), 1, 0) 
df['pickup_laguardia_airport_flag'] = np.where((df['pickup_latitude'] > 40.7667) & (df['pickup_longitude'] > -73.888), 1, 0) 
df['dropoff_laguardia_airport_flag'] = np.where((df['dropoff_latitude'] > 40.7667) & (df['dropoff_longitude'] > -73.888), 1, 0) 
df['jfk_airport_flag'] = np.where((df['dropoff_JFK_airport_flag'] == 1) |  (df['pickup_JFK_airport_flag'] == 1), 1, 0) 
df['lgr_airport_flag'] = np.where((df['dropoff_laguardia_airport_flag'] == 1) |  (df['pickup_laguardia_airport_flag'] == 1), 1, 0) 
df['airport'] = np.where((df['dropoff_laguardia_airport_flag'] == 1) |  (df['pickup_laguardia_airport_flag'] == 1) |  (df['dropoff_JFK_airport_flag'] == 1) |  (df['pickup_JFK_airport_flag'] == 1), 1, 0) 

# Night time surcharge 8pm to 6am
df['night_flag'] = np.where((df['pickup_hour'] <= 6) & (df['pickup_hour'] >= 20), 1, 0) 

# Weekday rush hour surcharge 4pm to 8pm
df['rushhour_flag'] = np.where((df['pickup_hour'] >= 16) & (df['pickup_hour'] <= 20), 1, 0)

2.3 Creating distance features

The trip distance can be extracted from the raw data using a function that translates earths latitude longitude coordinates into the straight-line distance between them. However, when driving between destinations it is unlikely the travel path is a straight line. Two possible solutions to this problem were developed. “Manhattan distance” is the sum of the x and y components of a vector and provides the total distance travelled between two points on a grid, much like the streets of New York. By also computing the angle between the trips origin and destination the model will have enough information to extract a more accurate distance estimate.

def haversine_dist(lat1, lon1, lat2, lon2): #returns trip distance in kilometers
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*(np.sin(dlon/2)**2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    return 6371 * c

def bearing(lat1, lon1, lat2, lon2):    #returns trip bearing angle in degrees
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    y = np.sin(dlon)*np.cos(lat2)
    x = np.cos(lat1)*np.sin(lat2) - np.sin(lat1)*np.cos(lat2)*np.cos(dlon)
    theta = np.arctan2(y,x)
    return (theta * 180/np.pi +360) % 360

df['trip_distance'] =  np.vectorize(haversine_dist)(df['dropoff_latitude'], df['dropoff_longitude'],df['pickup_latitude'],df['pickup_longitude'])
df['trip_bearing'] = np.vectorize(bearing)(df['pickup_latitude'],df['pickup_longitude'],df['dropoff_latitude'], df['dropoff_longitude'])

The next largest factor in trip fare price from the NYC government website was travel to and from an airport. The two airports found in the dataset were LGR and JFK. The latitude and longitude coordinates of each airport was extracted from google maps and used to create 5 engineered features: dropoff dist to JFK, pickup dist to JFK, dropoff dist to LGR, pickup dist to LGR, and airport flag which was 0 for no airport involved in the trip and 1 for airport involved in trip.

Show Code

df['pickup_dist_to_jfk'] = np.vectorize(haversine_dist)(40.65209631703223, -73.7833074176638,df['pickup_latitude'],df['pickup_longitude'])
df['dropoff_dist_to_jfk'] = np.vectorize(haversine_dist)(40.65209631703223, -73.7833074176638,df['dropoff_latitude'],df['dropoff_longitude'])

df['pickup_dist_to_lgr'] = np.vectorize(haversine_dist)(40.77958127849166, -73.87315146148384,df['pickup_latitude'],df['pickup_longitude'])
df['dropoff_dist_to_lgr'] = np.vectorize(haversine_dist)(40.77958127849166, -73.87315146148384,df['dropoff_latitude'],df['dropoff_longitude'])

2.4 Google Maps API data

The google maps street API was used to calculate the travel distance and travel time between the origin and drop off. A final feature distance delta was created as the difference between the straight-line distance and API distance.

Show Code

train_mask = df['train'] == 1
test_df = df[~train_mask]
df = df[train_mask]

train_dist_df = pd.read_csv('data/api_train.csv')
df = pd.merge(df, train_dist_df, left_on='uid', right_on='uid', how='left')
df['dist_delta'] = abs(df['trip_distance'] - df['api_trip_distance'])

test_dist_df = pd.read_csv('data/api_test.csv')
test_df = pd.merge(test_df, test_dist_df, left_on='uid', right_on='uid', how='left')

test_df.api_trip_distance.fillna(test_df.trip_distance, inplace=True)
test_df.api_trip_duration.fillna(test_df.trip_distance/0.5, inplace=True)   #0.5 km/minute

test_df['dist_delta'] = abs(test_df['trip_distance'] - test_df['api_trip_distance'])

2.5 Create PCA Features

The numeric feature columns were ran through the PCA algorithm to generate 5 vectors for identifying the main axes of variance within a data set.

Show Code

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
numComponents = 5
pca = PCA(n_components=numComponents)
scaler = StandardScaler()

SCALED_FEATURES = ['pickup_latitude','pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','pickup_hour', 'pickup_day_num','trip_distance',
                    'trip_bearing', 'pickup_dist_to_jfk', 'dropoff_dist_to_jfk',
                    'pickup_dist_to_lgr', 'dropoff_dist_to_lgr', 'api_trip_distance',
                    'api_trip_duration', 'dist_delta']

scaled_df = df.copy()          
scaled_df[SCALED_FEATURES] = scaler.fit_transform(df[SCALED_FEATURES])
scaled_test_df = test_df.copy()          
scaled_test_df[SCALED_FEATURES] = scaler.transform(test_df[SCALED_FEATURES])


PCA_DIMENSION_REDUCTION_FEATURES = ['trip_bearing','dropoff_dist_to_jfk','trip_distance','dropoff_dist_to_lgr','pickup_dist_to_jfk','pickup_dist_to_lgr','api_trip_duration','pickup_latitude','dist_delta','pickup_longitude','dropoff_latitude','dropoff_longitude','pickup_hour']

pca_result = pca.fit_transform(scaled_df[PCA_DIMENSION_REDUCTION_FEATURES].values)
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1] 
df['pca-three'] = pca_result[:,2] 

pca_result = pca.transform(scaled_test_df[PCA_DIMENSION_REDUCTION_FEATURES].values)
test_df['pca-one'] = pca_result[:,0]
test_df['pca-two'] = pca_result[:,1] 
test_df['pca-three'] = pca_result[:,2]

Show Code

print(f"The final feature columns are:")
df.head()

The final feature columns are:

	uid	fare_amount	pickup_datetime	pickup_latitude	pickup_longitude	dropoff_latitude	dropoff_longitude	passenger_count	train	pickup_hour	pickup_day_num	pickup_day_Monday	pickup_day_Sunday	pickup_day_Tuesday	pickup_day_Wednesday	pickup_day_type_Weekday	pickup_day_type_Weekend	pickup_day	dropoff_laguardia_airport_flag	lgr_airport_flag	airport	rushhour_flag	trip_distance	trip_bearing	pickup_dist_to_jfk	dropoff_dist_to_jfk	pickup_dist_to_lgr	dropoff_dist_to_lgr	api_trip_distance	api_trip_duration	dist_delta	pca-one	pca-two	pca-three
0	31722	9.0	2013-01-07 01:50:51+00:00	40.750160	-73.991421	40.726085	-73.989490	2	1	1	7	1	0	0	0	1	0	Monday	0	0	0	0	2.681957	176.521589	20.656392	19.232673	10.483987	11.463754	3.7	15	1.018043	0.067223	1.114535	-0.108649
1	14674	14.0	2013-01-15 20:08:00+00:00	40.741057	-73.997945	40.767312	-73.956223	6	1	20	15	0	0	1	0	1	0	Tuesday	0	0	0	1	4.568758	50.269314	20.622338	19.405047	11.350059	7.127057	6.6	19	2.031242	0.331220	-0.484364	-1.585731
2	37571	19.5	2013-01-20 00:25:55+00:00	40.688531	-73.999161	40.616634	-74.026611	1	1	0	20	0	1	0	0	0	1	Sunday	0	0	0	0	8.323209	196.163064	18.650063	20.906095	14.670655	22.263499	10.2	12	1.876791	2.684151	6.474439	0.776196
3	47583	6.0	2013-01-01 02:30:00+00:00	40.744257	-73.991490	40.748492	-73.980912	1	1	2	1	0	0	1	0	1	0	Tuesday	0	0	0	0	1.007887	62.142206	20.323242	19.808678	10.712804	9.711654	1.4	7	0.392113	-0.854687	0.616436	-1.010020
4	29473	33.5	2013-01-02 10:45:00+00:00	40.677702	-73.972773	40.768117	-73.862242	1	1	10	2	0	0	0	1	1	0	Wednesday	1	1	1	0	13.705437	42.778729	16.232151	14.515472	14.099731	1.571294	18.2	25	4.494563	4.657020	-2.091841	-4.205518

3 Exploratory Data Analysis

3.1 Pickup location

Finding the location with the most pickups may provide insight into predicting trip fare, as busy areas will have more traffic, which means longer trips, which means higher trip fare. The location with the most pickups can be found by running the DBSCAN clustering algorithm on the latitude and longitude coordinates of the trips. Unlike other clustering algorithms, the primary factor used to create clusters in DBSCAN is point density making it suitable for this application. Running the algorithm on the location data and coloring the data points based on the number of points in each cluster produces the plot Figure 1

Show Code

from sklearn.cluster import DBSCAN

coords = df[['pickup_latitude','pickup_longitude']].to_numpy()
max_dist = 0.05
kms_per_radian = 6371.0088
epsilon = max_dist / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples = 30, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
df['db_cluster_labels'] = db.labels_
dbcluster_to_count_map = df['db_cluster_labels'].value_counts().to_dict()
df['db_cluster_count'] = df['db_cluster_labels'].apply(lambda cluster: dbcluster_to_count_map[cluster])

tPickup_fig = px.scatter_mapbox(df.drop(df[df['db_cluster_labels'] == -1].index), lat='pickup_latitude', lon='pickup_longitude', hover_name='db_cluster_labels', hover_data=['db_cluster_labels'],
    color='db_cluster_count', color_continuous_scale=px.colors.diverging.RdBu, opacity=.2,
    center={'lat': 40.75153, 'lon': -73.98987}, 
    zoom=15,
    height=500,width=1150,
    mapbox_style="carto-positron", labels=None);
tPickup_fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0}).update_coloraxes(showscale=True);
tPickup_fig.show();
# tPickup_fig.show()

Figure 1: Locations with highest pickup density

The locations with the highest density of pickups are Pennsylvania Station, and Grand Central Terminal. This makes sense as both locations are large hubs for public transit.

3.2 Trip Time

Determining whether the different days of the week have a different average trip fare will help discern whether weekday will be a useful feature for the predictive model. It is likely people exhibit different behaviour on the weekends as they are not working. Additionally, from the NYC government website, there are charges the occur only on weekdays. By aggregating the data by weekday, the following plots are produced.

Show Code

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(12, 4), dpi = 175)

sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': '.9',
                'patch.force_edgecolor': False})

sns.barplot(data=df,x="pickup_day",y="fare_amount",order=['Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday'],
            errwidth=1.2,capsize=0.05, ax=ax[1],palette=sns.color_palette("rocket",7))

sns.countplot(data=df,x="pickup_day", ax=ax[0],order=['Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday'],palette=sns.color_palette("rocket",7))#,palette=sns.color_palette("mako",7))#,hue='pickup_day')

ax[0].tick_params(axis='x', labelrotation=20)
ax[1].tick_params(axis='x', labelrotation=20)
ax[1].set_ylim(11,13);

For similar reasoning to trip weekday, trip hour may be a useful predictor.

Show Code

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(12, 4), dpi = 175)
# sns.lineplot(x=df3.index, y=df3['fare_amount']['mean'], ax=ax)
sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': 'black',
                'patch.force_edgecolor': False})#,'image.cmap': 'mako')

sns.barplot(data=df,x="pickup_hour",y="fare_amount",order=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
            errwidth=1.2,capsize=0.05,palette=sns.color_palette("rocket",24),ax=ax[1])#palette=sns.color_palette("mako",24))

sns.countplot(data=df,x="pickup_hour", ax=ax[0],order=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],palette=sns.color_palette("rocket",24))#,palette=sns.color_palette("mako",7))#,hue='pickup_day')

ax[1].set_ylim(10,20);

Both trip weekday, and trip hour exhibit a varyinga average fare price and therefore will be useful feature columns.

3.3 Trip distance

Show Code

fig, ax = plt.subplots(figsize=(12, 4), dpi = 175)
sns.set_style("darkgrid", {"axes.facecolor": ".9",'axes.edgecolor': 'white','patch.edgecolor': 'black','patch.force_edgecolor': False})
sns.scatterplot(data=df, x='trip_distance', y='fare_amount',s=2,ax=ax,hue='airport',palette=[sns.color_palette("rocket",5)[0],sns.color_palette("rocket",5)[3]])
# sns.boxplot(data=df, x='api_trip_duration', y='fare_amount',ax=ax,hue='airport')#,fliersize=0)
ax.set_xlim(0,24);
ax.set_ylim(0,60);

As indicated by the NYC government website, fare price is a linear function of the trip distance, which is reflected in Figure 4. Trips involving the airport are colored in orange and have a higher average fare price than normal trips. The horizontal lines at the top right of the plot indicate fixed airport trip fees. This confirms distance and airport related metrics will be helpful to the predictive model.

4 Prediction Models

4.1 XGBoost

4.1.1 Hyperparameter Tuning

A common approach to select the best model parameters is to create a grid containing all possible parameter configurations and evaluate each one, however, for a model like XGBoost that has over 7 parameters—each taking a wide range of values—this is not computationally feasible. Instead, we will use the Hyperopt library that applies the Tree of Parzens (TPE) Algorithm to check a subset of parameter configurations and estimate the best combination.

from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

#Define the sample space of parameter combinations
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
        'gamma': hp.uniform ('gamma', 0,50),
        'alpha' : hp.uniform('alpha', 0,50),
        'reg_lambda' : hp.uniform('reg_lambda', 0,50),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.4,1),
        'min_child_weight' : hp.choice('min_child_weight',[0,1,2,3,5,8]),
        'n_estimators': 3000,
        'eta': hp.choice('eta',[0.001,0.01,0.02,0.03])
    }

Show Code

from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

def RMS_error(Y,Y_hat):
    return np.sqrt(np.mean(np.power(np.subtract(Y,Y_hat),2)))

#Objective function that the TPE algorithm will attempt to minimize. 
#Returns the RMSE of a set of parameters calcualted by 5-fold cross validation
def objective(space):
    model=xgb.XGBRegressor(
                    max_depth = int(space['max_depth']),
                    gamma = space['gamma'],
                    alpha = space['alpha'],
                    reg_lambda = space['reg_lambda'],
                    colsample_bytree=space['colsample_bytree'],
                    min_child_weight=int(space['min_child_weight']),
                    n_estimators = int(space['n_estimators']), 
                    eta = space['eta'],
                    early_stopping_rounds = 15,
                    eval_metric='rmse'
                    )

    crossval_error = []
    for i, (train_idx, test_idx) in enumerate(kf.split(x_train)):
        fold_train_x = np.take(x_train, train_idx, axis=0)
        fold_train_y = np.take(y_train, train_idx, axis=0)
        fold_test_x = np.take(x_train, test_idx, axis=0)
        fold_test_y = np.take(y_train, test_idx, axis=0)

        val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
        val_x = np.take(fold_train_x, val_idxs, axis=0)
        val_y = np.take(fold_train_y, val_idxs, axis=0)

        fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
        fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
        model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
        pred = model.predict(fold_test_x)
        error = RMS_error(fold_test_y, pred)
        crossval_error.append(error)
    error = np.mean(crossval_error)
    print (f"max_depth: {int(space['max_depth'])} | gamma: {space['gamma']:.5} | alpha : {space['alpha']:.5} | reg_lambda : {space['reg_lambda']:.5} | colsample_bytree : {space['colsample_bytree']:.5} | min_child_weight : {int(space['min_child_weight'])} | eta : {space['eta']} | error: {error:.5}")

    return {'loss': error, 'status': STATUS_OK }

FEATURE_COLUMNS = ['trip_bearing', 'dropoff_dist_to_jfk','trip_distance', 'dropoff_dist_to_lgr','pickup_dist_to_jfk','pickup_dist_to_lgr',
       'api_trip_duration','pickup_latitude','dist_delta',
       'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','pickup_hour','airport',
       'pickup_day_Monday', 'pickup_day_Saturday', 'pickup_day_Sunday',
       'pickup_day_Thursday','pickup_day_Tuesday', 'pickup_day_Wednesday','pickup_day_Friday','pca-one', 'pca-two', 'pca-three','night_flag','rushhour_flag']
TARGET_COLUMN = ['fare_amount']

x_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df[TARGET_COLUMN].to_numpy()

trials = Trials()

# Hyperopt can take multiple hours to run, commented out to decrease render time
# best_hyperparams = fmin(fn = objective, 
#                         space = space, 
#                         algo = tpe.suggest, 
#                         max_evals = 200,
#                         trials = trials)

Hyperopt returned the following parameters to minimize the RMSE:

{'max_depth': 18 , 'gamma': 8.6581 , 'alpha' : 45.457 , 'reg_lambda' : 48.254 , 'colsample_bytree' : 0.555 , 'min_child_weight' : 8 , 'eta' : 0.03,'n_estimators': 2000}

4.1.2 Model Fitting

import xgboost as xgb

params = {'max_depth': 18 , 'gamma': 8.6581 , 'alpha' : 45.457 , 'reg_lambda' : 48.254 , 'colsample_bytree' : 0.555 , 'min_child_weight' : 8 , 'eta' : 0.03,'n_estimators': 2000,'verbose_eval': True} #best params

num_rounds = 5000
model = None

val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
X_val = val_df[FEATURE_COLUMNS].to_numpy()
y_val = val_df[TARGET_COLUMN].to_numpy().reshape(5000,)

train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
X_train = train_df[FEATURE_COLUMNS].to_numpy()
y_train = train_df[TARGET_COLUMN].to_numpy().reshape(30000,)

dtrain = xgb.DMatrix(X_train, y_train, feature_names=FEATURE_COLUMNS)
dval = xgb.DMatrix(X_val, y_val, feature_names=FEATURE_COLUMNS)


main_model = xgb.train(params, dtrain, num_rounds, early_stopping_rounds=10, #10
                      evals=[(dtrain, 'train'), (dval, 'eval')],
                      xgb_model=model, verbose_eval=False)
print(f"The best validation RMSE acheived: {main_model.best_score:.5}")

[16:23:35] WARNING: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-0fc7796c793e6356f-1/xgboost/xgboost-ci-windows/src/learner.cc:767: 
Parameters: { "n_estimators", "verbose_eval" } are not used.

The best validation RMSE acheived: 3.9169

4.1.3 Output Predictions

Show Code

X_testmain = test_df[FEATURE_COLUMNS]
dtest = xgb.DMatrix(X_testmain, feature_names=FEATURE_COLUMNS)
y_pred = main_model.predict(dtest)

submission_df = pd.DataFrame(columns=['uid','fare_amount'])
submission_df['uid'] = test_df['uid']
submission_df['fare_amount'] = y_pred
# submission_df[['uid','fare_amount']].to_csv('submissions/XGB_pca123_night_rushhour_earlystop50_21.csv',index=False)
display(submission_df)

	uid	fare_amount
0	3	5.879074
1	10	4.249468
2	15	10.557648
3	16	16.221519
4	17	19.020933
...	...	...
34995	69986	14.189826
34996	69989	5.366324
34997	69993	18.416437
34998	69996	13.369841
34999	70000	7.867848

35000 rows × 2 columns

Uploading the submissions to kaggle, XGBoost scored an RMSE of 3.36 on the test dataset.

4.2 Ensemble Model

4.2.1 Prepare Model Candidates

The hyper parameters for each model (aside from XGB, and LightGBM) were selected by conducting and exhuastive grid search of all possible combiantions of hyper parameters.

Show Code

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, HuberRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
import lightgbm as lgbm
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.model_selection import KFold
import itertools

def RMS_error(Y,Y_hat):
    return np.sqrt(np.mean(np.power(np.subtract(Y,Y_hat),2)))

def param_grid(*lists):
    return list(itertools.product(*lists))

df = pd.read_csv('processed dataframes/scaled_train.csv')
test_df = pd.read_csv('processed dataframes/scaled_test.csv')

svr_model = SVR(kernel = 'linear',epsilon=.1)
ridge_model = Ridge(alpha=198)                                                                           
lasso_model = Lasso(alpha=0.04)                                                                          
elastic_model = ElasticNet(alpha=0.062 , l1_ratio= 1)                                                    
huber_model = HuberRegressor(epsilon=7,max_iter=3000,alpha=30)                                             
knn_model = KNeighborsRegressor(n_neighbors = 15, weights='distance', leaf_size=15,p=2)                          
linear_model = LinearRegression()                                                                     

xgb_model = xgb.XGBRegressor(alpha = 45.457, colsample_bytree = 0.555, reg_lambda = 48.254,n_estimators=2000, min_child_weight = 8, gamma=8.6581, max_depth=18, eta=0.03, eval_metric='rmse', early_stopping_rounds = 12) #best xgb
lgbm_model = lgbm.LGBMRegressor(num_leaves = 80, max_depth = 25, min_child_samples = 60, learning_rate = 0.05, subsample_freq = 20, subsample = 1, colsample_bytree = 0.8, n_estimators = 50000, max_bin = 10000, boosting_type='gbdt', objective= 'regression',n_jobs= -1, metric= 'rmse',early_stopping_round=10, verbose=-1)   

forest_model = RandomForestRegressor(bootstrap= 1, ccp_alpha= 2.973, max_depth= 16, max_features= 18, min_samples_leaf= 5, min_samples_split= 6, n_estimators= 718) 

chosen_models = [elastic_model, knn_model, xgb_model, lgbm_model,lasso_model,ridge_model]# chosen_models = [elastic_model, knn_model, xgb_model, lgbm_model,lasso_model,huber_model,ridge_model]

4.2.2 Get Out-of-Fold (OOF) Predictions

Using 10 folds each model was fitted on 9 folds then used to predict the fare amount on the remaining fold. This was repeated with each fold being left out once, generating out-of-fold predictions for the entire train dataset. Doing this prevents data leakage, as otherwise the models would be predicting on datapoints they have already seen.

Show Code

def get_oof_predictions(x, y, models):

    # instantiate a KFold with 10 splits
    kfold = KFold(n_splits=10, shuffle=True, random_state=1)
    
    # prepare lists to hold the re-ordered x and y values
    data_x, data_y  = [], []
    
    # run the following block for each of the 10 kfold splits
    for i, (train_idx, test_idx) in enumerate(kfold.split(x_train)):
        print(f"Working Fold: {i}")

        fold_train_x = np.take(x_train, train_idx, axis=0)
        fold_train_y = np.take(y_train, train_idx, axis=0)
        fold_test_x = np.take(x_train, test_idx, axis=0)
        fold_test_y = np.take(y_train, test_idx, axis=0)
    
        data_x.extend(fold_test_x)
        data_y.extend(fold_test_y)
    
        for item in models:
            print(f"Fold: {i} | Fitting model: {item}")
            model = models[item][0] 

            if item == 'XGBRegressor':
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)

                model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                print(f"Fold: {i} | Validation RMSE: {model.best_score}")
                predictions = model.predict(fold_test_x)
            elif item == 'LGBMRegressor':
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)

                model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                print(f"Fold: {i} | Validation RMSE: {model.best_score_['valid_0']['rmse']}")
                predictions = model.predict(fold_test_x)
            else:
                model.fit(fold_train_x, fold_train_y) 
                predictions = model.predict(fold_test_x)

            models[item][1].extend(predictions) 
    
    return data_x, data_y, models

x_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df['fare_amount'].to_numpy()

model_dict = {str(model).split('(')[0]: [model,[]] for model in chosen_models}
data_x, data_y, trained_models = get_oof_predictions(x_train,y_train,model_dict)

Working Fold: 0
Fold: 0 | Fitting model: ElasticNet
Fold: 0 | Fitting model: KNeighborsRegressor

Fold: 0 | Fitting model: XGBRegressor

Fold: 0 | Validation RMSE: 3.34340422098736
Fold: 0 | Fitting model: LGBMRegressor

Fold: 0 | Validation RMSE: 2.5347304308734233
Fold: 0 | Fitting model: Lasso

Fold: 0 | Fitting model: Ridge
Working Fold: 1
Fold: 1 | Fitting model: ElasticNet
Fold: 1 | Fitting model: KNeighborsRegressor

Fold: 1 | Fitting model: XGBRegressor

Fold: 1 | Validation RMSE: 3.0859730253115103
Fold: 1 | Fitting model: LGBMRegressor

Fold: 1 | Validation RMSE: 2.5913792900382613
Fold: 1 | Fitting model: Lasso

Fold: 1 | Fitting model: Ridge
Working Fold: 2
Fold: 2 | Fitting model: ElasticNet
Fold: 2 | Fitting model: KNeighborsRegressor

Fold: 2 | Fitting model: XGBRegressor

Fold: 2 | Validation RMSE: 3.11564875559071
Fold: 2 | Fitting model: LGBMRegressor

Fold: 2 | Validation RMSE: 2.5491994753213163
Fold: 2 | Fitting model: Lasso

Fold: 2 | Fitting model: Ridge
Working Fold: 3
Fold: 3 | Fitting model: ElasticNet

Fold: 3 | Fitting model: KNeighborsRegressor
Fold: 3 | Fitting model: XGBRegressor

Fold: 3 | Validation RMSE: 3.2941259665900944
Fold: 3 | Fitting model: LGBMRegressor

Fold: 3 | Validation RMSE: 2.5414740589396803
Fold: 3 | Fitting model: Lasso

Fold: 3 | Fitting model: Ridge
Working Fold: 4
Fold: 4 | Fitting model: ElasticNet
Fold: 4 | Fitting model: KNeighborsRegressor

Fold: 4 | Fitting model: XGBRegressor

Fold: 4 | Validation RMSE: 3.159533681316195
Fold: 4 | Fitting model: LGBMRegressor

Fold: 4 | Validation RMSE: 2.542693798429106
Fold: 4 | Fitting model: Lasso

Fold: 4 | Fitting model: Ridge
Working Fold: 5
Fold: 5 | Fitting model: ElasticNet
Fold: 5 | Fitting model: KNeighborsRegressor

Fold: 5 | Fitting model: XGBRegressor

Fold: 5 | Validation RMSE: 3.3497618072423796
Fold: 5 | Fitting model: LGBMRegressor

Fold: 5 | Validation RMSE: 2.6832628410176644
Fold: 5 | Fitting model: Lasso

Fold: 5 | Fitting model: Ridge
Working Fold: 6
Fold: 6 | Fitting model: ElasticNet

Fold: 6 | Fitting model: KNeighborsRegressor
Fold: 6 | Fitting model: XGBRegressor

Fold: 6 | Validation RMSE: 3.1627186555477724
Fold: 6 | Fitting model: LGBMRegressor

Fold: 6 | Validation RMSE: 2.708755394817404
Fold: 6 | Fitting model: Lasso

Fold: 6 | Fitting model: Ridge
Working Fold: 7
Fold: 7 | Fitting model: ElasticNet
Fold: 7 | Fitting model: KNeighborsRegressor

Fold: 7 | Fitting model: XGBRegressor

Fold: 7 | Validation RMSE: 3.330761072419705
Fold: 7 | Fitting model: LGBMRegressor

Fold: 7 | Validation RMSE: 2.6401084772362657
Fold: 7 | Fitting model: Lasso

Fold: 7 | Fitting model: Ridge
Working Fold: 8
Fold: 8 | Fitting model: ElasticNet
Fold: 8 | Fitting model: KNeighborsRegressor

Fold: 8 | Fitting model: XGBRegressor

Fold: 8 | Validation RMSE: 3.342755661846802
Fold: 8 | Fitting model: LGBMRegressor

Fold: 8 | Validation RMSE: 2.6109948358743615
Fold: 8 | Fitting model: Lasso

Fold: 8 | Fitting model: Ridge
Working Fold: 9
Fold: 9 | Fitting model: ElasticNet
Fold: 9 | Fitting model: KNeighborsRegressor

Fold: 9 | Fitting model: XGBRegressor

Fold: 9 | Validation RMSE: 3.2663211603775753
Fold: 9 | Fitting model: LGBMRegressor

Fold: 9 | Validation RMSE: 2.611919757939597
Fold: 9 | Fitting model: Lasso

Fold: 9 | Fitting model: Ridge

4.2.3 Stepwise Model Selection

Not all level 0 model may necessarily improve the performance of the final meta model, so they must be selected carefully. To do this, level0 models will be added to the meta model forward stepwise until performance of the meta model stops increasing, like a stepwise feature selector with level 0 models instead of features.

Show Code

from sklearn.model_selection import cross_validate
def stepwise_model_selector(X, y, meta_model, models_dict, model_label, verbose=False):
    
    print(f"Current Meta Model: {str(meta_model).split('(')[0]}",end="")

    included_models = []
    
    while True:
        changed=False
        
        excluded_models = list(set(models_dict.keys())-set(included_models)) 
        new_rmse = pd.Series(index=excluded_models)
        current_meta_x = np.array(X)
        
        if len(included_models) > 0:
            for included in included_models:
                included = np.array(models_dict[included][1]).reshape((len(models_dict[included][1]), 1))
                current_meta_x = np.hstack((current_meta_x, included))


        kf = KFold(n_splits=5)
        crossval_error = []
        for i, (train_idx, test_idx) in enumerate(kf.split(current_meta_x)):

            fold_train_x = np.take(current_meta_x, train_idx, axis=0)
            fold_train_y = np.take(y, train_idx, axis=0)
            fold_test_x = np.take(current_meta_x, test_idx, axis=0)
            fold_test_y = np.take(y, test_idx, axis=0)

            if (meta_model == xgb_model):
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
            elif (meta_model == lgbm_model):
                val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                val_x = np.take(fold_train_x, val_idxs, axis=0)
                val_y = np.take(fold_train_y, val_idxs, axis=0)

                fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
            else:
                meta_model.fit(fold_train_x, fold_train_y)
                pred = meta_model.predict(fold_test_x)
                error = RMS_error(fold_test_y, pred)
                crossval_error.append(error)
        starting_rmse = np.mean(crossval_error)*-1


        for excluded in excluded_models:
            
            new_yhat = np.array(models_dict[excluded][1]).reshape(-1, 1) 
            meta_x = np.hstack((current_meta_x, new_yhat)) 

            kf = KFold(n_splits=5)
            crossval_error = []
            for i, (train_idx, test_idx) in enumerate(kf.split(meta_x)):

                fold_train_x = np.take(meta_x, train_idx, axis=0)
                fold_train_y = np.take(y, train_idx, axis=0)
                fold_test_x = np.take(meta_x, test_idx, axis=0)
                fold_test_y = np.take(y, test_idx, axis=0)

                if (meta_model == xgb_model):
                    val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                    val_x = np.take(fold_train_x, val_idxs, axis=0)
                    val_y = np.take(fold_train_y, val_idxs, axis=0)

                    fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                    fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                    meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
                elif (meta_model == lgbm_model):
                    val_idxs = np.arange(int(fold_train_x.shape[0]*0.1))
                    val_x = np.take(fold_train_x, val_idxs, axis=0)
                    val_y = np.take(fold_train_y, val_idxs, axis=0)

                    fold_train_x = np.delete(fold_train_x,val_idxs, axis=0)
                    fold_train_y = np.delete(fold_train_y,val_idxs, axis=0)
                    meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
                else:
                    meta_model.fit(fold_train_x, fold_train_y)
                    pred = meta_model.predict(fold_test_x)
                    error = RMS_error(fold_test_y, pred)
                    crossval_error.append(error)
            rmse = np.mean(crossval_error)*-1
            new_rmse[excluded] = rmse 
        
        best_rmse = new_rmse.max() 
        
        if best_rmse > starting_rmse:  # if the best rmse is better than the initial rmse
            best_feature = new_rmse.idxmax()  # define this as the new best feature
            included_models.append(str(best_feature)) # append this model name to the included list
            changed=True # flag that we changed it
            print(f" | Adding {str(best_feature).split('(')[0]}",end="")
        else: changed = False
        
        if not changed:
            break
    print("")
    return included_models, starting_rmse

meta_model_df = pd.DataFrame(columns=['meta_model','rmse','level0-models'])
meta_errors = []
level0s = []

for model in chosen_models:
    level0_models = model_dict.copy()
    level0_models.pop(str(model).split('(')[0])
    level0, error = stepwise_model_selector(data_x, data_y, model, level0_models, f"{str(model).split('(')[0]} Meta Regressor", verbose=False)
    meta_errors.append(error)
    level0s.append(level0)

meta_model_df['meta_model'] = [str(model).split('(')[0] for model in chosen_models]
meta_model_df['rmse'] = meta_errors
meta_model_df['level0-models'] = level0s

display(meta_model_df.sort_values('rmse', ascending=False))

Current Meta Model: ElasticNet

 | Adding XGBRegressor

 | Adding LGBMRegressor


Current Meta Model: KNeighborsRegressor

 | Adding XGBRegressor

 | Adding Ridge

 | Adding ElasticNet


Current Meta Model: XGBRegressor


Current Meta Model: LGBMRegressor


Current Meta Model: Lasso

 | Adding XGBRegressor

 | Adding LGBMRegressor


Current Meta Model: Ridge

 | Adding XGBRegressor

 | Adding Lasso

 | Adding ElasticNet

 | Adding LGBMRegressor

 | Adding KNeighborsRegressor

	meta_model	rmse	level0-models
0	ElasticNet	-3.465764	[XGBRegressor, LGBMRegressor]
4	Lasso	-3.473172	[XGBRegressor, LGBMRegressor]
2	XGBRegressor	-3.493721	[]
1	KNeighborsRegressor	-3.496053	[XGBRegressor, Ridge, ElasticNet]
5	Ridge	-3.497683	[XGBRegressor, Lasso, ElasticNet, LGBMRegresso...
3	LGBMRegressor	-3.505002	[]

4.2.4 Output Predictions

Using the best identified meta a level0 models from the stepwise model selector, predictions will be made on the test dataset.

Show Code

#Get models selected by stepwise model selector

meta_model_name = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][0]
best_models = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][2]

chosen_models_names = [str(model).split('(')[0] for model in chosen_models]
best_models_idxs = np.where(np.isin(chosen_models_names,best_models))[0]

meta_model = chosen_models[np.where(np.isin(chosen_models_names,meta_model_name))[0][0]]

stepwise_models = []
for idx in best_models_idxs:
    stepwise_models.append(chosen_models[idx])

stepwise_models_names = [str(model).split('(')[0] for model in stepwise_models]

Show Code

#Fit the level0 models to the training data

X_train = df[FEATURE_COLUMNS].to_numpy()
y_train = df[TARGET_COLUMN].to_numpy()

for model in stepwise_models:
    if str(model).split('(')[0] == 'LGBMRegressor':
        val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
        X_val = val_df[FEATURE_COLUMNS].to_numpy()
        y_val = val_df[TARGET_COLUMN].to_numpy()

        train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
        X_train = train_df[FEATURE_COLUMNS].to_numpy()
        y_train = train_df[TARGET_COLUMN].to_numpy()
        model.fit(X_train, y_train, eval_set=[(X_val, y_val)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])

    elif str(model).split('(')[0] == 'XGBRegressor':
        val_df = df[FEATURE_COLUMNS + TARGET_COLUMN].sample(n=5000, random_state=1)       
        X_val = val_df[FEATURE_COLUMNS].to_numpy()
        y_val = val_df[TARGET_COLUMN].to_numpy()

        train_df = df[FEATURE_COLUMNS + TARGET_COLUMN].drop(val_df.index)
        X_train = train_df[FEATURE_COLUMNS].to_numpy()
        y_train = train_df[TARGET_COLUMN].to_numpy()
        model.fit(X_train,y_train,eval_set=[(X_val, y_val)],verbose=0)
    else:
        model.fit(X_train,y_train)

Show Code

# Output the meta model predictions

def create_level1_data(x_data, models):
    meta_x_data = x_data
    for model in models:
        pred = model.predict(x_data)
        meta_x_data = np.column_stack((meta_x_data,pred))
    return meta_x_data

# print("shape without meta features",np.array(data_x).shape)

predictions = np.array([model_dict[model][1] for model in model_dict if str(model).split('(')[0] in stepwise_models_names]).T
meta_x_train = np.column_stack((data_x,predictions))

x_test = test_df[FEATURE_COLUMNS].to_numpy()
meta_x_test = create_level1_data(x_test,stepwise_models)


meta_model.fit(meta_x_train, data_y)
ensb_pred = meta_model.predict(meta_x_test)

submission_df = pd.DataFrame(columns=['uid','fare_amount'])
submission_df['uid'] = test_df['uid']
submission_df['fare_amount'] = ensb_pred
# submission_df[['uid','fare_amount']].to_csv('submissions/Metamodel',index=False)
print("Final Submission")
display(submission_df)

Final Submission

	uid	fare_amount
0	3	5.818966
1	10	4.357765
2	15	10.557566
3	16	16.233197
4	17	19.386255
...	...	...
34995	69986	14.424490
34996	69989	5.409180
34997	69993	18.343031
34998	69996	13.543978
34999	70000	7.939707

35000 rows × 2 columns

Uploading the submissions to kaggle, the Ensemble model scored an RMSE of 3.35 on the test dataset.