Exploratory Analysis, Machine Learning, Large Datasets

Flight delay prediction

The main purpose of this study is to create a model to predict departure delay for flights in the US, where a delay is defined as 15-minute delay or more. Data is a subset of the flight's on-time performance data provided by the United States Bureau of Transportation Statistics. The data comprises data of flights departing from all major US airports in 2019.

The exploratory data analysis clearly pointed to non-linearities in correlation between the various features that made up the predictive model and a flight's delay. This non-linearity carried over into the viability of an adaptive machine learning algorithm that would keep to that narrative. And as expected, linear models like logistic regression, didn't perform as well, but Gradient-Boosted Trees did.

The fact that the data captured by the Department of Transportation from the carriers had substantially less cases of flight delays, led to complement the imbalanced dataset with synthetic minority class data through oversampling. That led to a weighted F-1 score of 0.81.

A calculated PageRank statistic to measure airport importance based on their connections proved to be a very important feature in the models. Also, as expected, a feature that accounts for previous aircraft delays (based on the tail number) is a good predictor for delays.

Overall, this study showed that predicting flight delays in a more comprehensive way is indeed possible. What would make up for better predictive power of-course, will depend on additional methods of implementation. The underbalanced aspects of the data could be supplanted with better methods of synthetic data generation (e.g., SMOTE); time-series forecasting methods (e.g., ARIMA or RNNs) could provide for complementing features. Surely those and others would make for a better calibrated approach on future model extensions of the work.

- Tools: Spark, MLlib, PySpark.
- Related documents and code: HERE