On 16 November, the Royal Society, in partnership with the Royal Statistical Society, hosted senior professionals from the business, research and policy communities to consider the skills that the UK needs to make the most of the opportunities in big data.
The event highlighted a variety of skills required across the big data supply chain and focussed specifically on the skills gap in data analytics and visualisation in the UK. Speakers and participants discussed a number of barriers companies have encountered in recruiting data scientists, and suggested potential solutions to help build the skills base.
One session looked at the issue from a different angle: what if data science could be automated? Here we provide highlights from this panel session.
What do data scientists do?
Alex Harvey, Head of Research at Ocado Technology and chair of the session, explained that the company employs data scientists and machine learning experts in many of their activities, from optimising warehouse design to maintenance.
Anat Elhalal, Lead Technologist for machine learning and AI at the Digital Catapult, further explained that the work of a data scientist involves many different tasks and elements: analysing the business context, data collection, data ‘wrangling’ (the process of preparing the data so that it is amenable for analysis), data analysis itself, potentially using machine learning algorithms, visualisation, and again applying findings in the business context.
How could machine learning help?
Anat explained that preparing data for analysis can take up a lot of time, and the processes involved include a number of tasks that could potentially be automated. For example, one time-consuming aspect of data wrangling is to combine sets from different data streams and harmonise the way data is labelled (the so-called metadata). Using an algorithm for this would save time and allow data scientists to focus on analysis, interpretation and application of the findings. It might also help large organisations who during the event reported challenges with data legacy and to consolidate all their databases into one ‘data lake’.
A further challenge in data ‘wrangling’ is dealing with missing values in datasets. James Geddes, research software engineer at the Alan Turing Institute, explained that data scientists use a lot of different ad hoc rules to tackle missing values: sometimes they might leave out missing values, or replace them with the average of a column. Together with Christian Steinruecken, a researcher in the Computational and Biological Learning group at the University of Cambridge, James explained that understanding the process by which the data was obtained could help create a model to find the missing value. This is a difficult problem to solve, and the panel agreed it was an area where machine learning could really help.
Christian introduced the Automatic Statistician, a research project he coordinates. It uses machine learning to provide an explanation of what data means by automatically generating a report with figures and text readable by humans. Such developments could help make machine learning and data science more accessible, and therefore enable more people to make use of it.
How might machines and data scientists work together?
Beyond the automation of human tasks, Christian argued that there was a role for machine learning to augment human capabilities, and to solve tasks that exceed human capacity. He explained that such technology can be used in plenty of ways, for example to optimise energy grids.
However, one issue is interpretability. Anat noted that there may be cases where a program might be able to carry out a function without the person who designed the program knowing exactly how it worked. Further developing systems to explain how machine learning works would help ensure that the output of any automatic data scientist would be as helpful as possible to users, for example by explaining which features of the data were most important, or how a decision was reached.
This echoed what others said during the day about the importance of the visual interface between machine and users, and the need to make the outputs from data science, whether human or machine, meaningful and applicable to business metrics. In addition, James suggested that, to make the best of the technology, users themselves ought to be trained in general skills of hypothesis making, exploration, and allowing themselves to be confused and think about an issue.
In any case, all panellists agreed that a key question underpinning the development of the technology should be how machines can be most useful to human analysts. As the automation of data science is still very much at an early stage, there is a unique opportunity for human coders to make their digital collaborators as useful as possible.
The Royal Society is undertaking a project about machine learning and its impact for the UK economy and society. To find out more please visit our project page.