Analysing humanities data using Cray Urika-GX

11 October 2018

During the last six months, in our role as members of the Research Engineering Group of the Alan Turing Institute, we have been working with Melissa Terras, University of Edinburgh's College of Arts, Humanities and Social Sciences (CAHSS), and Raquel Alegre, Research IT Services, University College London (UCL), to explore text analysis of humanities data. This work was funded by Scottish Enterprise as part of the Alan Turing Institute-Scottish Enterprise Data Engineering Programme.

The goal of our collaboration was to run text analysis codes developed by UCL upon data used by CAHSS to exercise the data access, transfer and analysis services of the Turing Institute’s deployment of a Cray Urika-GX system. The Cray Urika GX system is a high-performance analytics cluster with a pre-integrated stack of popular analytics packages, including Apache Spark, Apache Hadoop, Jupyter notebooks and complemented by frameworks to develop data analytics applications in Python, Scala, R and Java.

We used two data sets of interest to CAHSS and hosted within the University of Edinburgh’s DataStore: British Library newspapers data, ~1TB of digitised newspapers from the 18th to the early 20th century; and British Library books data, ~224 GB of compressed digitised books from the 16th to the 19th centuries. Both data sets are collections of XML documents, however, they have been organised differently and conform to different XML schemas, which affects how the data can be queried.

To access both data sets from within Urika, we mounted their DataStore directories into our home directories on Urika using SSHFS. We then copied the data into Urika’s own Lustre file system. We did this because, unlike Urika’s login nodes, Urika’s compute nodes have no network access and so cannot access the DataStore via the mount points. Also, by moving the data to Lustre, we minimised the need for data movement and network transfer during analysis.

For exercising Urika’s data analytics capabilities, we ran two text analysis codes, one for each collection, which were initially developed by UCL with the British Library.

UCL's code for analysing the newspapers data is written in Python and runs queries via the Apache Spark framework. The code was originally designed to initially extract data held within a UCL deployment of the data management software, iRODS, and run queries on a user’s local machine or on UCL’s high performance computing (HPC) services. This code makes use of a file listing the locations of the XML data files constructed by iRODS. A range of queries are supported eg count the number of articles per year, count the frequencies of a given list of words, find expressions matching a pattern.

UCL's code for analysing the books data is also written in Python and runs queries via mpi4py, a wrapper for the message-passing interface (MPI) for parallel programming. However, work had been started on migrating some of these queries to use Spark. This code was also originally designed to use UCL’s data management and HPC services. A range of queries are supported eg count the total number of pages across all books, count the frequencies of a given list of words, etc. This code is complemented with a set of Jupyter notebooks to visualise query results and to perform further analyses.

To run the codes within Urika we needed to modify them both so that they could run without any dependence on iRODS or UCL’s local environment, and instead access data located within Lustre. As a result, the modified newspapers code now allows the location of XML documents to be specified using either URLs or absolute file paths. The modified books code now runs its MPI-based queries via Urika’s Apache Mesos resource manager.

For the books data, Melissa suggested we look at both “diseases” and “normaliser” queries to try to reproduce the results from her Jisc Research Data Spring 2015 project at UCL. “diseases” searches for occurrences of the names of thirteen diseases (eg “cholera”, “tuberculosis” etc) and returns the total number of occurrences of each name. “normaliser” builds a derived data set (counts of books, pages and words per year) which allows us to see how these change over time. Combining the results from both queries, we can examine the extent to which occurrences of the thirteen diseases are affected by increases in the number, and size, of books published over the measurement period. As each query is run over the data for successive time periods (eg 1810-1819, 1820-1829 etc), data files are output with the results for each period. We wrote code to combine these into a single data file holding the aggregated query results. These results are visualised in a Jupyter notebook from the Research Data Spring project, which we modified to extend the number of graphs presented. This notebook can be run using Urika’s own Jupyter notebook server. We compared our results to the original results and they were generally consistent but with some anomalies which we identified as arising from data missing from the books data set held within DataStore, which has been reported back to Melissa.

Using Spark, we can query data in parallel using up to the full complement of 432 cores (12 nodes) available on the Turing’s deployment of Urika. However, if mpi4py is used, we can only use up to 36 cores (one node) in parallel, since we haven’t yet identified a way to distribute MPI codes across more than one Urika node. However, Urika is designed with the use of Spark in mind and Spark is well-suited for this form of text analysis. Migrating the mpi4py books queries to Spark would be a good area for future work, combining this with the newspapers code which already uses Spark and can handle several XML Schemas. This would then yield a single code, with a common underlying data model, that could run queries across both the newspapers and books data.

Our updated code, plus documentation on how to run these, are publicly available on GitHub (newspaper code, books code (Spark version), books code, Jupyter notebook for books code).