Modelling and exploring airline booking data

25 January 2019

From April–December 2018, Rosa Filgueira and I worked at the Dynamic Forecasting project, as members of the Research Engineering Group of the Alan Turing Institute. In collaboration with British Airways, this project is examining how machine learning techniques can be used to improve dynamic forecasting using large-scale business datasets. During this period we cooperated with Radka Jersakova (Research Data Scientist), Evelina Gabasova (Senior Research Data Scientist), and James Geddes (Principal Research Data Scientist).

The goal of this particular project was not strictly defined and was mainly focused around two main concepts:

Contributing to the deeper understanding of a British Airways' dataset, by answering certain questions regarding the data contained.
Creating a report that would be helpful as a starting guide for a potential new member of the project.

Our work was performed on one sample of the raw dataset. This sample dataset (7.2 GB) contained information about flights between a particular pair of departure-arrival airports, for one particular period of time.

The work was performed with the use of Python 3, while Jupyter Notebooks were used to improve the visualised presentation of the data-processing and data-wrangling pipeline.

For the first goal, the complex raw format of the dataset required an initial thorough data-wrangling process, so that the data would reach a form suitable for data analysis and modelling. Moreover, due to the fact that the sample dataset was representing a data dump generated by a database of an unknown schema, a certain amount of work was carried so that this schema would be inferred from the available data. After the end of the data pre-processing procedure, exploratory analysis was performed so that we could ask certain questions about the data such as “What is the part of the year with the most overbooked flights?” or “What is the effect of holidays on the number of bookings of a flight?”. After data analysis, we used Facebook Prophet to construct a time-series model, forecasting the number of bookings of a flight.

Regarding the second part of the project, we built a report generator, which ‘takes over’ the work of the data scientists, since it automatically generates a big part of the exploratory analysis. Data scientists just need to add as input:

The entities extracted from the dataset and their relationships with each other
Different normalied tables for each entity extracted.

The output of the report generator is a LaTeX document which plots information regarding each attribute of the given entities and provides information on the relations between the different entities. We used pyLatex and Latexmk for the automatic generation of this report.

Finally, we also wrote a best-practice document, in which we described the entire procedure that was followed, starting from pre-processing and finishing with the construction of the model.

With this best-practice document and the report that is automatically built by our generator, we aim to reduce the time required to understand this complex dataset.

Working on a project without a strictly defined goal was sometimes a challenge, as we had to be flexible enough to adjust to changes in its direction. Processing the sample data itself was also another challenge, since detecting and extracting the schema (and its entities) were not trivial tasks.

Of course, facing those challenges is a business-as-usual scenario for the work of a data scientist, thus dealing with them provided us with valuable experience. Moreover, the collaboration with the data experts of the Alan Turing Institute were a very stimulating experience by itself.

This work was part of the Alan Turing Institute-Scottish Enterprise Data Engineering Programme and was funded by Scottish Enterprise