Innovative Learning Week: doing a Kaggle Competition

Author: Marc Sabate
Posted: 3 Mar 2016 | 15:31

The week of the 15-19 February an Innovative Learning Week event was held at the University of Edinburgh, where students and staff are invited to collaborate and explore their teaching and learning experience respectively in a new and creative way. EPCC organized a three-day event for its HPC and data science MSc students that consisted in participating in a Kaggle competition.

Kaggle is a platform built with the goal of forming a home and building a community of data scientists.  It is a perfect site for companies that still do not have access to Machine Learning predictive power, or have an unsolved problem but loads of data to solve it, to upload their data, their goals, and offer a non-negligible prize to motivate talented data scientists around the world to work on their problems. The prize could be a cash prize or you could win an interview with an employer such as Facebook or AirBnB. An overview of the process is shown schematically in the diagram below.  

During the Innovative Week, three EPCC staff and seven master students tackled the Home Depot competition. Home Depot is an American retailer of home improvement and construction products and services whose web page has a search engine, that returns several results after a search is made. For this competition, Home Depot asked Kagglers to develop a model that would accurately predict the relevance of these search results. Search relevancy is a proxy for how quickly customers reach the desired products. The quicker a customer reaches the right product, the quicker a purchase is more likely to be done. The outcome of this is a happy customer and a happy retailer – a win-win result. Of course, this can only be achieved if the search engine is able to provide relevant results.

The data provided contained several thousands of products and real customer search terms obtained from Home Depot’s website. To create a relevance score, which measures the relevance between search and product results pairs, Home Depot manually rated a few thousand search/product pairs. The training data set provided contained for each search/product result pair the search text, the title of the product, the description of the product, and the manually created relevance score ranging between 1 (not relevant) to 3 (highly relevant). Home Depot was interested in the development of a model that would predict this relevance score for those search/product pairs that had not been manually rated. Additional tables provided more attributes of the Home Depot products, such as colour, material, brand, product measures, etc.

This competition offered interesting challenges for the students, the most important of them being the fact that we had real-world data in our hands to play with. The students formed into teams of 2 or 3 people, and they used either R or Python (students were free to choose their tool) to explore the data, clean the text fields and do the appropriate joins and aggregations to get the first insights of what the data could model and thus used to predict an outcome.

At this point, the real challenge was to create numeric/categorical/boolean predictive features from the raw text fields. We discussed what features we could extract in terms of what we had learned from the data, and each team generated a couple of the discussed features. After generating these, the final training set contained features related to the material, the colour, the brand, the percentage of common words in the search term and the product title/description returned by the search engine and the presence of some relevant words previously identified in both the search term and the product title/description.  

Finally, since the value to predict was a numeric score between 1 and 3, we decided to train a Random Forest on our generated training set.

In order to evaluate the participants’ models, Kaggle provides a test data set without the relevance score, although Kaggle knows the relevance score for the search/products pairs in this test set. When a team submits the predicted relevant scores for the records in the test set, Kaggle evaluates the model using a performance metric. In this competition, the evaluation metric was the Root Mean Squared Error.

After submitting our results we obtained a Root Mean Squared Error of 0.50296. Not bad for our first attempt. A lot of improvement still could be done in our model, but we ran out of time. The final result of these three days was a fun experience where the students were encouraged to apply their originality and the learned Machine Learning techniques on a real-world problem.