TTDATAVIS how to series: How to… merge and tidy datasets with Microsoft Excel

19 January 2015
In this how to we use Microsoft Excel to merge two different datasets. We mainly focus on the VLOOKUP function, but also remove duplicates.

When I was at university, I was working in the Office for Institutional Research. It was a data-heavy job, and I was lucky enough to have a boss that took the time to show me a thing or two about Excel. And one of those things was the VLOOKUP function.

It’s probably one of the most practical things I learned at university!

The VLOOKUP function is a straightforward way to merge data in Excel. I use it daily to combine two different dataset when a simply copy and paste just won’t do. The most common reason for that is different numbers of rows (entries) between the two datasets we’re working to combine. For example one dataset might be a list of people and their job title, the other might be a list of phone calls, with the same person dialled more than once. In that case we could use VLOOKUP to append the list of phone calls with the job title of everyone who is called.

So let’s get to it. In this case, we’re going to walk through a real life example. In 2013, thanks to a Freedom of Information request, the UK government published all of the flight information for the UK Department of International Development for 2011.

Less helpfully, they released it with a bunch of non-standard names of where the flights originated and ended up. What we want to do is to combine the information they have in that data release (i.e. on pounds spent per flight, on airline and on class) with information we already have on where airports are.

Luckily we don’t have to create that list of all airports in the world. A mostly updated list of global airports is available on Data Hub, a site that allows users to share data sets. You can follow along after downloading both these data sets.

Step 1: Get a list of unique departure and destination names using the ‘de-duplicate’ feature

There are thousands of flights by DFID staff in 2011, but only about 300 or so departure or destination locations. In order to merge our data, the first thing we need to find is that list of locations.

I’m just going to start by opening a new sheet in the workbook and copying both columns D and E (‘Departure Airport’ and ‘Destination Airport’) into a single column, one below the other, on the same sheet. It gives us about 17000 entries.

I’m going to select the column and then sort it from A-Z, just to confirm there are a lot of duplicates (this step is optional).

And then, also in the Data tab, I will select the ‘Remove Duplicates’ button, and then approve it to de-duplicate column A only. It notes that there are only 303 unique items after de-duplication.

Step 2: Open global airports data, paste the list of locations and create VLOOKUP fields

Open up the Global Airports data, create a new sheet and copy and paste our list of 303 airport names. It’s not necessary to work from this spreadsheet, but I have to keep things contained within single workbooks.

In cell B1, insert an ‘=’ followed by VLOOKUP. A helpful box that explains the formula might popup:

The VLOOKUP (or vertical lookup) function basically takes a value that you want to lookup, searches within a particular table ‘array’ (a table of data) and returns the value that is in whatever column of the array you want. Its companion function is HLOOKUP, which will search horizontally and look up a number in a given row.

In this case we will lookup the value A1 (which is DFID’s name for an airport) in the table array columns A (column A is our airport name, column B is the airport city, column C is the airport country and column D is the IATA code) through D in the other sheet in our workbook called Global Airports, and we want the three-letter IATA code, which is found in column 4. So our function looks like:

=VLOOKUP(A2,'global_airports.csv'!A:D,4,FALSE)

The ‘FALSE’ after the last comma indicates that we want to find an exact match, and not just a close match. I will double click the square in the bottom-righthand corner of the cell to fill down the entire series.

But I also want to check to see if it matches the city. The VLOOKUP function will only match a value with the left-most column in the table array. So to find the IATA code from the city, our function in cell A3 would be:

=VLOOKUP(A2,'global_airports.csv'!B:D,3,FALSE)

You’ll note that the array now starts from column B and not A, and only looks up the third column, not the fourth.

To fill these two formulae down the series, I select both of them and double-click the square in the bottom-righthand side of the right cell.

 

Step 3: Copy and paste the values

These VLOOKUP values are now dynamic, and if something in the worksheet changes, they might break. So I always select and copy these values and paste special, paste as values.

Step 4: Further tidying

This has got a number of the cells filled, but not all. In order to finish I take the following approaches:

  1. Duplicate the ‘name’ column (one will be associated with the airports dataset, one with the DFID dataset). Try editing names like ‘Baltimore MD’ to simply ‘Baltimore’ in one column and see if the VLOOKUP now finds it.
  2. For some airports we get a value of ‘0’ rather than ‘#NA’. That’s because the cell it has found is blank (rather than not having found the cell), so we can try inserting the IACO code instead.
  3. When in doubt, just Google ‘[place] airport’ and the three-digit code usually comes up on the home screen. Copy and paste that into a spreadsheet.

It’s also worth noting that, if you’re using the city as the lookup value, we need to be careful. Big cities may have more than one airport, and the lookup will pull whichever one it finds first.

Step 5: Repeat VLOOKUPs to add in latitude, longitude, city, country columns and merge back with DFID data

After I got all of the airport codes, I used those as the basis for a VLOOKUP to pull all latitude, longitude, city and country data into our spreadsheet as well.

Finally, I used the VLOOKUP code back in our DFID flights spreadsheet to lookup departure and destination information. I then copied, and pasted as values that information and deleted the formulae.

I have made this final version of this spreadsheet available on Google Sheets for download to check your work.