The goal of the project is to analyze a subset of data from the New York City Taxi dataset and build 2 models to predict a trips fare price. First, exploratory data analysis will be conducted to provide insight into useful features, and help engineer features that will aid in predictions. Then, XGBoost will be used to make initial predictions on the data. Finally, an Ensemble Model (Stacking) will combine the results from multiple models to achieve an even greater prediction accuracy.
Multiple models were considered as candidates for the final ensemble, as even models that perform poorly on their own could uncover trends that other models may not be able too due to their differing structure Level 0 models were trained directly on the training data, and their predictions were used as inputs in addition to the training data to train the meta model which outputs the final predictions. A stepwise model selector was used to pick the best perfoming set of level0 and meta models. The list of models considered for both level0 and the meta model were: Linear Regression, Lasso, Huber Regression, Regression, Ridge Regression, Elastic Net, K-Nearest Neighbors, XGB Regressor, and LGBM Regressor.
1 Library Imports
Import libraries
import numpy as npimport pandas as pdpd.set_option('display.max_columns', 500)import seaborn as snsimport matplotlib.pyplot as pltfrom matplotlib.colors import LinearSegmentedColormapimport plotly.express as pximport warningswarnings.filterwarnings("ignore", category=DeprecationWarning)
Show Code
# Join the train and test data so feature engineering only has to be done once# The datasets are seperated before any scaling or predictive analysis to prevent leakagedf = pd.read_csv('raw data/W23P1_train.csv')df['train'] =1test_df = pd.read_csv('raw data/W23P1_test.csv')test_df['train'] =0df = pd.concat([df,test_df])print(f"Dataframe size: {df.shape}")df.head()
Dataframe size: (70000, 9)
uid
fare_amount
pickup_datetime
pickup_longitude
pickup_latitude
dropoff_longitude
dropoff_latitude
passenger_count
train
0
31722
9.0
2013-01-07 01:50:51 UTC
-73.991421
40.750160
-73.989490
40.726085
2
1
1
14674
14.0
2013-01-15 20:08:00 UTC
-73.997945
40.741057
-73.956223
40.767312
6
1
2
37571
19.5
2013-01-20 00:25:55 UTC
-73.999161
40.688531
-74.026611
40.616634
1
1
3
47583
6.0
2013-01-01 02:30:00 UTC
-73.991490
40.744257
-73.980912
40.748492
1
1
4
29473
33.5
2013-01-02 10:45:00 UTC
-73.972773
40.677702
-73.862242
40.768117
1
1
2 Feature Engineering
Feature engineering is the most important part of building a successful predictive model, as a model can only be as good as the information it has access to. Thus, the goal of exploratory data analysis is to provide insight into which features will be the most useful in predicting taxi fare. According to NYC government website, in 2013, taxi fare prices were determined using the following calculation:
An initial charge of $2.5
$0.30 improvement surcharge
$0.50 charge per 1/5 mile OR $0.50 per minute in slow traffic
$0.50 nighttime surcharge 8pm to 6am
$1.00 rush hour surcharge 4pm-8pm weekdays
Tolls for bridges and tunnels
Airport special rates for JFK and Newark
The initial charge of $2.50 and $0.30 charge are present on every trip meaning they have no effect on our feature selection. Most of the charge comes from the trip distance/time, which implies features that contain information about the trip distance/time will be the most useful. There are additional small charges for certain time periods, and travel through tunnels and bridges. Finally, there are significantly higher rates for trips to and from the nearby airports.
The trip distance can be extracted from the raw data using a function that translates earths latitude longitude coordinates into the straight-line distance between them. However, when driving between destinations it is unlikely the travel path is a straight line. Two possible solutions to this problem were developed. “Manhattan distance” is the sum of the x and y components of a vector and provides the total distance travelled between two points on a grid, much like the streets of New York. By also computing the angle between the trips origin and destination the model will have enough information to extract a more accurate distance estimate.
The next largest factor in trip fare price from the NYC government website was travel to and from an airport. The two airports found in the dataset were LGR and JFK. The latitude and longitude coordinates of each airport was extracted from google maps and used to create 5 engineered features: dropoff dist to JFK, pickup dist to JFK, dropoff dist to LGR, pickup dist to LGR, and airport flag which was 0 for no airport involved in the trip and 1 for airport involved in trip.
The google maps street API was used to calculate the travel distance and travel time between the origin and drop off. A final feature distance delta was created as the difference between the straight-line distance and API distance.
Finding the location with the most pickups may provide insight into predicting trip fare, as busy areas will have more traffic, which means longer trips, which means higher trip fare. The location with the most pickups can be found by running the DBSCAN clustering algorithm on the latitude and longitude coordinates of the trips. Unlike other clustering algorithms, the primary factor used to create clusters in DBSCAN is point density making it suitable for this application. Running the algorithm on the location data and coloring the data points based on the number of points in each cluster produces the plot Figure 1
The locations with the highest density of pickups are Pennsylvania Station, and Grand Central Terminal. This makes sense as both locations are large hubs for public transit.
3.2 Trip Time
Determining whether the different days of the week have a different average trip fare will help discern whether weekday will be a useful feature for the predictive model. It is likely people exhibit different behaviour on the weekends as they are not working. Additionally, from the NYC government website, there are charges the occur only on weekdays. By aggregating the data by weekday, the following plots are produced.
As indicated by the NYC government website, fare price is a linear function of the trip distance, which is reflected in Figure 4. Trips involving the airport are colored in orange and have a higher average fare price than normal trips. The horizontal lines at the top right of the plot indicate fixed airport trip fees. This confirms distance and airport related metrics will be helpful to the predictive model.
4 Prediction Models
4.1 XGBoost
4.1.1 Hyperparameter Tuning
A common approach to select the best model parameters is to create a grid containing all possible parameter configurations and evaluate each one, however, for a model like XGBoost that has over 7 parameters—each taking a wide range of values—this is not computationally feasible. Instead, we will use the Hyperopt library that applies the Tree of Parzens (TPE) Algorithm to check a subset of parameter configurations and estimate the best combination.
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe#Define the sample space of parameter combinationsspace={'max_depth': hp.quniform("max_depth", 3, 18, 1),'gamma': hp.uniform ('gamma', 0,50),'alpha' : hp.uniform('alpha', 0,50),'reg_lambda' : hp.uniform('reg_lambda', 0,50),'colsample_bytree' : hp.uniform('colsample_bytree', 0.4,1),'min_child_weight' : hp.choice('min_child_weight',[0,1,2,3,5,8]),'n_estimators': 3000,'eta': hp.choice('eta',[0.001,0.01,0.02,0.03]) }
Show Code
from sklearn.model_selection import KFoldkf = KFold(n_splits=5)def RMS_error(Y,Y_hat):return np.sqrt(np.mean(np.power(np.subtract(Y,Y_hat),2)))#Objective function that the TPE algorithm will attempt to minimize. #Returns the RMSE of a set of parameters calcualted by 5-fold cross validationdef objective(space): model=xgb.XGBRegressor( max_depth =int(space['max_depth']), gamma = space['gamma'], alpha = space['alpha'], reg_lambda = space['reg_lambda'], colsample_bytree=space['colsample_bytree'], min_child_weight=int(space['min_child_weight']), n_estimators =int(space['n_estimators']), eta = space['eta'], early_stopping_rounds =15, eval_metric='rmse' ) crossval_error = []for i, (train_idx, test_idx) inenumerate(kf.split(x_train)): fold_train_x = np.take(x_train, train_idx, axis=0) fold_train_y = np.take(y_train, train_idx, axis=0) fold_test_x = np.take(x_train, test_idx, axis=0) fold_test_y = np.take(y_train, test_idx, axis=0) val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0) pred = model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error) error = np.mean(crossval_error)print (f"max_depth: {int(space['max_depth'])} | gamma: {space['gamma']:.5} | alpha : {space['alpha']:.5} | reg_lambda : {space['reg_lambda']:.5} | colsample_bytree : {space['colsample_bytree']:.5} | min_child_weight : {int(space['min_child_weight'])} | eta : {space['eta']} | error: {error:.5}")return {'loss': error, 'status': STATUS_OK }
FEATURE_COLUMNS = ['trip_bearing', 'dropoff_dist_to_jfk','trip_distance', 'dropoff_dist_to_lgr','pickup_dist_to_jfk','pickup_dist_to_lgr','api_trip_duration','pickup_latitude','dist_delta','pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','pickup_hour','airport','pickup_day_Monday', 'pickup_day_Saturday', 'pickup_day_Sunday','pickup_day_Thursday','pickup_day_Tuesday', 'pickup_day_Wednesday','pickup_day_Friday','pca-one', 'pca-two', 'pca-three','night_flag','rushhour_flag']TARGET_COLUMN = ['fare_amount']x_train = df[FEATURE_COLUMNS].to_numpy()y_train = df[TARGET_COLUMN].to_numpy()trials = Trials()# Hyperopt can take multiple hours to run, commented out to decrease render time# best_hyperparams = fmin(fn = objective, # space = space, # algo = tpe.suggest, # max_evals = 200,# trials = trials)
Hyperopt returned the following parameters to minimize the RMSE:
Uploading the submissions to kaggle, XGBoost scored an RMSE of 3.36 on the test dataset.
4.2 Ensemble Model
4.2.1 Prepare Model Candidates
The hyper parameters for each model (aside from XGB, and LightGBM) were selected by conducting and exhuastive grid search of all possible combiantions of hyper parameters.
Using 10 folds each model was fitted on 9 folds then used to predict the fare amount on the remaining fold. This was repeated with each fold being left out once, generating out-of-fold predictions for the entire train dataset. Doing this prevents data leakage, as otherwise the models would be predicting on datapoints they have already seen.
Show Code
def get_oof_predictions(x, y, models):# instantiate a KFold with 10 splits kfold = KFold(n_splits=10, shuffle=True, random_state=1)# prepare lists to hold the re-ordered x and y values data_x, data_y = [], []# run the following block for each of the 10 kfold splitsfor i, (train_idx, test_idx) inenumerate(kfold.split(x_train)):print(f"Working Fold: {i}") fold_train_x = np.take(x_train, train_idx, axis=0) fold_train_y = np.take(y_train, train_idx, axis=0) fold_test_x = np.take(x_train, test_idx, axis=0) fold_test_y = np.take(y_train, test_idx, axis=0) data_x.extend(fold_test_x) data_y.extend(fold_test_y)for item in models:print(f"Fold: {i} | Fitting model: {item}") model = models[item][0] if item =='XGBRegressor': val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0)print(f"Fold: {i} | Validation RMSE: {model.best_score}") predictions = model.predict(fold_test_x)elif item =='LGBMRegressor': val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)])print(f"Fold: {i} | Validation RMSE: {model.best_score_['valid_0']['rmse']}") predictions = model.predict(fold_test_x)else: model.fit(fold_train_x, fold_train_y) predictions = model.predict(fold_test_x) models[item][1].extend(predictions) return data_x, data_y, models
x_train = df[FEATURE_COLUMNS].to_numpy()y_train = df['fare_amount'].to_numpy()model_dict = {str(model).split('(')[0]: [model,[]] for model in chosen_models}data_x, data_y, trained_models = get_oof_predictions(x_train,y_train,model_dict)
Not all level 0 model may necessarily improve the performance of the final meta model, so they must be selected carefully. To do this, level0 models will be added to the meta model forward stepwise until performance of the meta model stops increasing, like a stepwise feature selector with level 0 models instead of features.
Show Code
from sklearn.model_selection import cross_validatedef stepwise_model_selector(X, y, meta_model, models_dict, model_label, verbose=False):print(f"Current Meta Model: {str(meta_model).split('(')[0]}",end="") included_models = []whileTrue: changed=False excluded_models =list(set(models_dict.keys())-set(included_models)) new_rmse = pd.Series(index=excluded_models) current_meta_x = np.array(X)iflen(included_models) >0:for included in included_models: included = np.array(models_dict[included][1]).reshape((len(models_dict[included][1]), 1)) current_meta_x = np.hstack((current_meta_x, included)) kf = KFold(n_splits=5) crossval_error = []for i, (train_idx, test_idx) inenumerate(kf.split(current_meta_x)): fold_train_x = np.take(current_meta_x, train_idx, axis=0) fold_train_y = np.take(y, train_idx, axis=0) fold_test_x = np.take(current_meta_x, test_idx, axis=0) fold_test_y = np.take(y, test_idx, axis=0)if (meta_model == xgb_model): val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error)elif (meta_model == lgbm_model): val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)]) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error)else: meta_model.fit(fold_train_x, fold_train_y) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error) starting_rmse = np.mean(crossval_error)*-1for excluded in excluded_models: new_yhat = np.array(models_dict[excluded][1]).reshape(-1, 1) meta_x = np.hstack((current_meta_x, new_yhat)) kf = KFold(n_splits=5) crossval_error = []for i, (train_idx, test_idx) inenumerate(kf.split(meta_x)): fold_train_x = np.take(meta_x, train_idx, axis=0) fold_train_y = np.take(y, train_idx, axis=0) fold_test_x = np.take(meta_x, test_idx, axis=0) fold_test_y = np.take(y, test_idx, axis=0)if (meta_model == xgb_model): val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],verbose=0) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error)elif (meta_model == lgbm_model): val_idxs = np.arange(int(fold_train_x.shape[0]*0.1)) val_x = np.take(fold_train_x, val_idxs, axis=0) val_y = np.take(fold_train_y, val_idxs, axis=0) fold_train_x = np.delete(fold_train_x,val_idxs, axis=0) fold_train_y = np.delete(fold_train_y,val_idxs, axis=0) meta_model.fit(fold_train_x, fold_train_y, eval_set=[(val_x, val_y)],callbacks=[log_evaluation(0),early_stopping(10,verbose=False)]) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error)else: meta_model.fit(fold_train_x, fold_train_y) pred = meta_model.predict(fold_test_x) error = RMS_error(fold_test_y, pred) crossval_error.append(error) rmse = np.mean(crossval_error)*-1 new_rmse[excluded] = rmse best_rmse = new_rmse.max() if best_rmse > starting_rmse: # if the best rmse is better than the initial rmse best_feature = new_rmse.idxmax() # define this as the new best feature included_models.append(str(best_feature)) # append this model name to the included list changed=True# flag that we changed itprint(f" | Adding {str(best_feature).split('(')[0]}",end="")else: changed =Falseifnot changed:breakprint("")return included_models, starting_rmse
meta_model_df = pd.DataFrame(columns=['meta_model','rmse','level0-models'])meta_errors = []level0s = []for model in chosen_models: level0_models = model_dict.copy() level0_models.pop(str(model).split('(')[0]) level0, error = stepwise_model_selector(data_x, data_y, model, level0_models, f"{str(model).split('(')[0]} Meta Regressor", verbose=False) meta_errors.append(error) level0s.append(level0)meta_model_df['meta_model'] = [str(model).split('(')[0] for model in chosen_models]meta_model_df['rmse'] = meta_errorsmeta_model_df['level0-models'] = level0sdisplay(meta_model_df.sort_values('rmse', ascending=False))
Current Meta Model: ElasticNet
| Adding XGBRegressor
| Adding LGBMRegressor
Current Meta Model: KNeighborsRegressor
| Adding XGBRegressor
| Adding Ridge
| Adding ElasticNet
Current Meta Model: XGBRegressor
Current Meta Model: LGBMRegressor
Current Meta Model: Lasso
| Adding XGBRegressor
| Adding LGBMRegressor
Current Meta Model: Ridge
| Adding XGBRegressor
| Adding Lasso
| Adding ElasticNet
| Adding LGBMRegressor
| Adding KNeighborsRegressor
meta_model
rmse
level0-models
0
ElasticNet
-3.465764
[XGBRegressor, LGBMRegressor]
4
Lasso
-3.473172
[XGBRegressor, LGBMRegressor]
2
XGBRegressor
-3.493721
[]
1
KNeighborsRegressor
-3.496053
[XGBRegressor, Ridge, ElasticNet]
5
Ridge
-3.497683
[XGBRegressor, Lasso, ElasticNet, LGBMRegresso...
3
LGBMRegressor
-3.505002
[]
4.2.4 Output Predictions
Using the best identified meta a level0 models from the stepwise model selector, predictions will be made on the test dataset.
Show Code
#Get models selected by stepwise model selectormeta_model_name = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][0]best_models = meta_model_df.sort_values('rmse', ascending=False).to_numpy()[0][2]chosen_models_names = [str(model).split('(')[0] for model in chosen_models]best_models_idxs = np.where(np.isin(chosen_models_names,best_models))[0]meta_model = chosen_models[np.where(np.isin(chosen_models_names,meta_model_name))[0][0]]stepwise_models = []for idx in best_models_idxs: stepwise_models.append(chosen_models[idx])stepwise_models_names = [str(model).split('(')[0] for model in stepwise_models]
# Output the meta model predictionsdef create_level1_data(x_data, models): meta_x_data = x_datafor model in models: pred = model.predict(x_data) meta_x_data = np.column_stack((meta_x_data,pred))return meta_x_data# print("shape without meta features",np.array(data_x).shape)predictions = np.array([model_dict[model][1] for model in model_dict ifstr(model).split('(')[0] in stepwise_models_names]).Tmeta_x_train = np.column_stack((data_x,predictions))x_test = test_df[FEATURE_COLUMNS].to_numpy()meta_x_test = create_level1_data(x_test,stepwise_models)meta_model.fit(meta_x_train, data_y)ensb_pred = meta_model.predict(meta_x_test)submission_df = pd.DataFrame(columns=['uid','fare_amount'])submission_df['uid'] = test_df['uid']submission_df['fare_amount'] = ensb_pred# submission_df[['uid','fare_amount']].to_csv('submissions/Metamodel',index=False)print("Final Submission")display(submission_df)
Final Submission
uid
fare_amount
0
3
5.818966
1
10
4.357765
2
15
10.557566
3
16
16.233197
4
17
19.386255
...
...
...
34995
69986
14.424490
34996
69989
5.409180
34997
69993
18.343031
34998
69996
13.543978
34999
70000
7.939707
35000 rows × 2 columns
Uploading the submissions to kaggle, the Ensemble model scored an RMSE of 3.35 on the test dataset.