Using OpenRefine to create new datasets

28 April 2019

One of the benefits of teaching a Carpentry course is that it can increase or deepen your understanding of a subject. A recent instance for me was in using OpenRefine, a tool that runs locally on your machine (you do not have to export your data to a third party service).

OpenRefine can help you:

• Explore and clean/transform your data. You can reconcile your data with other external data sources, i.e. enrich your data using external data

• Create a new dataset. It does not modify your original data and keeps provenance of all the steps. Depending on the capabilities of your local machine it can deal with data sets that are up to about 100k rows.

Watch the videos on the OpenRefine website for a good overview. If you want to know more, follow the Carpentry OpenRefine for Ecologists lesson. In this example, I am going to show how easy is to generate a new dataset from the EPCC website. Follow along after you have installed OpenRefine on your system.

 

While preparing for the OpenRefine lesson, I set myself the task of producing a table with the contents of the EPCC staff list. I am going to walk you through the steps required to extract this data. You can follow if you install OpenRefine on your system. You should use either Google Chrome or Firefox – it may not work with Microsoft Edge or Internet Explorer.

Once you have started OpenRefine go to the Web Address menu item and add the URL for EPCC's stafflist: http://www.epcc.ed.ac.uk/about/staff.

 

Click on the Next button to import the pertinent HTML. This will show you the HTML for the page. Now, the first person on the list alphabetically happens to be me, Mario Antonioletti. If you scroll down you will see that we can ignore the first 72 lines so – as shown in the image below – you can specify that you want the first 72 lines:

 

There is stuff we want to remove from the end but we will have to do that later. Now Click on  Create Project on the top right. At the top of the page there is a Show which specifies the number of Rows  shown. Click on the 50 to maximise the number or rows shown.

If you look carefully you will see that most of the pieces of information lie on a single row which is nice except the member of staff member and their position are on the same row. We do not want that (I had to go back to this stage to be able to do this but did not want regenerate all the screen shots so you may see something different here). If you look carefully, we could split that information on an HTML token, viz <span class="position">. At the top of the second column, if you click on the down arrow on Column 1 and choose Column 1>Edit cells>Split multi-valued cells..., you will get a popup – for the Separator put <span class="position"> instead of a comma. You will now see that names and job titles are on different lines.

Now there is a lot of HTML on this page we do not care for. In the first column labeled All, if you click on the down button and select Transform you will get a popup. Change the Expression from value to value.parseHtml().htmlText() (I got that expression from a quick Google. I am not that smart.) 

 

Click on OK  and apply it to Column 1. You should now see a list of names with a list of empty rows that we do not want, and a number of lines with View Profile , which we are also not interested in. Let's start by removing the blank lines. This time use the Column 1 menu item. At the top of the second column, labeled Column 1 , click on the down arrow and select Column 1>Facet>Customized Facet>Facet by blank (null or empty string). You will see a new facet pane on your right:

 

Click on the true hyperlink as shown by the arrow above and the view will change to all the blank lines. Now use All>Edit rows>Remove all matching rows to remove all the blank lines. You can now click on the top left cross of the facet window to remove that view. You should now see the data without any spaces.

To remove the View Profile text, select Column 1>Facet>Text Facet. You will get a new Column 1 facet window. Click on the count hyperlink in the facet window to order by the count (the number of times each bit of text is found). You will see this has View Profile at the top. Click on the hyperlink for the View Profile and you will see all the instances of View Profile, again select All>Edit Rows>Remove all matching rows and they will all disappear. You can now remove the facet window by clicking on the cross at the top left of the facet window. We are now one step closer to getting the data we want. 

 

There was some stuff at the bottom of the page that we wanted to remove. If you click on the last link on the top right to see the bottom of your data set and then click on 

 

Select All>Edit rows>Remove all matching rows to remove the bottom of the data that you do not want. Remove the facet window by clicking on its top left cross. You now have a column with all the data that you want. We will now convert it to a tabular format.

We will make an assumption: there are mainly four rows of information per person so if we do Column 1>Transpose>Transpose cells in rows to columns... and say 4. This almost gives us what we want.

 

If you look carefully you will see things begin to go awry with Dr Steven Carlysle-Davies who has no phone specified, which destroys the pattern. Now one of the neat things about OpenRefine is that you can go back to the previous state. If you click on the Undo/Redo tab at the top of the Facet/Filter pane you will see all the modifications you have made to your data. Scroll to the bottom and select the second last line (assuming you have not done any of your own modifications). This will take you back to the single column view of the data. Scroll down until you find Stephen Carlysle-Davies, you may have to click on the next button (in my data set he appears on row 69). If you hover your mouse over his title you will see an edit button. Click on that and change Applications Consultant in Data Science to Applications Consultant in Data Science,,,,Tel: then apply the change. Now do Column 1>Edit cells>Split multi-valued cells and split on ,,,,. Make sure that the cell has been split. Now reapply the Column 1>Transpose>Transpose cells in rows to columns... and say 4 again. You will see that the problem has been fixed for Steven Carlysle-Davies but it recurs later on with (at the time of writing) Johnny Hay. You will have to go back and reapply the same process. Other people such as Adam Carter volunteer additional bits of information, like their Skype ID. If you are a purist you can remove that piece of information by editing the cell or just leave as is.

Once you are happy with the data, click on Export on the top right of the OpenRefine browser window to export the dataset in a variety of different format including the popular Comma-separated value, which you can import into a number of different applications including Excel.

So, this is how to create a new dataset by more or less manually scraping a website, which is certainly better than cutting and pasting the values into a tabular format. OpenRefine is a powerful tool worth becoming familiar with.