All you need is one data scientist with a PhD!
That was the closing thought summarising the ‘data capsule’ afternoon at the Cartagena Data Festival. When it was said in the closing plenary, I had something of a ‘violent’ reaction to the statement. Ok, maybe it wasn’t physically violent, but it certainly was visceral.
Why such a reaction? Working with and visualising data isn’t just a science, it’s also an art.
What’s more, it requires many different skills to get right, and prioritising just one or two of these skills over the others is disingenuous at best. And talking about a ‘data scientist’ puts the focus on individuals, when what we need to be talking about is building a comprehensive set of skills – whether they lie in one person or across teams – including research, technology, design and communication.
The importance of data visualisation for interpreting and analysing data
The data festival spent a lot of time highlighting the gaps we have in the available data, especially development-related data. Indeed, while we were there, ODI launched a report suggesting that global estimates of the number of people living in poverty may be off by as much as a quarter because more than 350 million people aren’t included in governmental household surveys. That’s more than the entire population of the United States! This is an important point, but it only part of the picture.
Another big part of the discussion was the need for a relentless focus on making data useable. There were the expected discussions about strengthening capacity to search out and interpret existing data, coupled with discussions of ‘open data’ and making more data more widely available. And Civicus was there arguing that there needed to be more emphasis on citizen engagement in data collection and use as part of shifting inherent power imbalances.
Visualising data cuts straight to this point. In order for us to make greater use of the existing data – to democratise data – we need to be better at visualising the information they contain.
Visualising allows us to get beyond raw numbers and (misleading?) statistics to interpretation and understanding. This allows not only for better communication of key messages emerging from the data, but also for a wider audience to explore and engage with the data themselves.
What is data ‘science’ anyway?
During the data capsule session, ‘data scientists’ were given silver badges and asked to divide themselves across the groups. When asked, I didn’t volunteer as a ‘data scientist’. To be honest, I wasn’t (and I’m still not) very sure what the term even means – don’t most ‘scientists’ (i.e. those who pursue knowledge) work with, interpret and analyse data (i.e. the ‘givens’, the raw material that constitute knowledge in the way atoms constitute matter)?
Given who volunteered for that title at the event, my best guess is that ‘data science’ focuses on two skills: statistics and coding. As for me, I have a basic stats knowledge, but I would need to re-read the textbook to be able to calculate chi-squared values. As for computer programming, I’m on slightly firmer ground but don’t spend my days immersed in Python (as was also later implied as a requirement for the session in the same statement at the closing plenary as above). All of that is to say that I couldn’t make a visualisation, right?
Wrong! Working together, we created an animated heatmap of homicides in Cartagena by time of day in 2014.
But perhaps that’s because I see the required skill set as slightly different.
Four data visualisation skills
I’ve already mentioned a few of the skills needed for data visualisation: statistics and coding. But I’ve also said it’s both and art and a science? So what other skills are needed?
As part of the data visualisation trainings that I do, I usually frame discussions around four different skills groups: research, technology, communication and design. Perhaps this springs from how think tanks are often organised, but let me explain in more detail what each of those areas constitute –
- Research: This heading more than the others is particularly broad. What I mean by ‘research’ is everything from strong data literacy skills to strong understanding of the context from which the data spring. In terms of data literacy, this means being able to merge and tidy datasets, as well as knowing what sorts of statistical analyses are appropriate to run on the data at hand. It’s about knowing the difference between categorical variables and ordinal variables, for instance. And in terms of context, it’s about having knowledge about the area of study – whether it be social or cultural context, the political environment (as are more typical in development studies) or the physical, biological or chemical processes at play. During the data capsule, it was suggested that this could be achieved by working together in interdisciplinary teams, where a ‘data scientist’ worked with a ‘social scientist’ – or, at the event, we also had local Cartagenans at hand to help us interpret maps of the local area.
- Technology: There are a few distinct elements of technology that might be important when it comes to the process of data visualisation. Firstly, there might be new technologies involved in the collection of data – for example, knowing how to use programmes like Import.io to scrape data from websites, or using new GPS hardware to map the location of things like schools. As for data processing, technology often plays out in terms of knowing how to use software and languages to clean and process data. That might mean everything from programming in R or SPSS, to knowing how to use functions or PivotTables in Excel. And as for creating visualisations? It might mean everything from knowing how to use javascript libraries like D3 and Highcharts to create interactive charts, to HTML5 to present the visuals on the web, to backend knowledge about how to query MySQL databases.
- Design: The impact of data visualisations often come down to the strength of their visual design and user experience design. In terms of visual design, it might be about knowing the appropriate types of visuals for the data or about understanding chart design fundamentals. But it’s about balance and flow. And it’s also about appropriate use of colour, typography and other visual cues… and more! And as for user experience, there are a number of elements to get right, from navigation to information structuring.
- Communication: At one of the training events at the Cartagena Data Festival, we talked about different types of data visualisations: ones designed with a clear message in mind, and others that support users to explore data in more detail and draw their own conclusions. Both of these approaches require clear communication. For the former, it’s about finding and refining messages that are appropriate to target audiences and ensuring that the visuals support them. This is an extremely difficult task and requires practice, especially when there are so many possibilities when it comes to visualising data. The number of visualisations I’ve seen that either don’t have a clear message or a clear purpose is staggering. As for the latter approach, understanding how to layer information and point users in the ‘right’ direction is also an important communication element. This can also take the shape of developing appropriate user controls and labels.
The role of expert knowledge
If that sounds like a long list of skills, that’s because it is. And we need to be honest, it’s very rare to find such a diverse skill set in a single person. We also have to recognise that each of these skills are important when it comes to effective data visualisation and not to prioritise one area over another.
Rather, let’s take a different starting point. Let’s assume that individuals have strengths in different areas and find ways to support the areas where they have less capacity. This can be done by putting together interdisciplinary teams, to be sure.
But let’s also recognise that technology has evolved quickly in this area. Just like one no longer needs to know how to code in HTML to create a website because of tools like WordPress or Squarespace, there are tools that make it possible to visualise data without knowing how to code or being an expert designer (though having expert knowledge certainly will help to refine visuals). It might not work with a dataset of 7.5 million survey responses, but that’s not a typical scenario that most individuals and think tanks find themselves in.
If anything, the problem is that there are so many of them, it’s impossible to know where to start. That’s why On Think Tanks, as part of the TTDATAVIS competition, have reviewed many of the tools to see where they do well and where they fall down. We will continue to do so, and also invite other external reviews from users.