Rich Pugh, co-founder and chief data scientist at Mango Solutions, looks back in time to find out where the term ‘data science’ first originated, what it means today and how it can add value in a data-driven organisation
Much of our time is spent talking to organisations looking to build a data science capability, or generally looking to use analytics to drive better decision making. As part of this, we’re often asked to present on a range of topics around data science. The two most popular topics are: ‘What is data science?’ and ‘What is a data scientist?’.
Where did the term ‘data science’ come from?
Professor Jeff Wu —the Coca-Cola Chair in Engineering Statistics at Georgia Institute of Technology— popularised the term ‘data science’ during a talk in 1997. Before this, the term statistician was widely used instead. Professor Wu felt that the title ‘statistician’ no longer covered the array of work being done by statisticians, and that ‘data scientist’ better encapsulated the multi-facetted role.
So, surely defining what a data scientist is and what they do should be a simple task – just bring up an image of Professor Wu and reference his 1997 lecture and ask for questions. However, the original definition has evolved since then and, in fact, most data scientists we meet are unfamiliar with Professor Wu.
What does data science mean today?
What ‘data science’ meant originally and what it means today are two very different things. One early definition of what a data scientist means, is from Josh Wills, current director of data engineering at Slack. Back in 2012, he described a data scientist as follows: “Data Scientist (n): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” This speaks more directly to the data scientist being a ‘merging’ of different skillsets – a mix of a ‘statistician’ and ‘software engineer’.
Drew Conway took this concept further with a heavily used Data Science Venn Diagram depicting hacking skills, maths and stats knowledge, and substantive expertise as the primary colours of data.
It is clear that today data science has come to represent a lot more than Professor Wu’s original definition.
To us, data science is the proactive use of data and advanced analytics to drive better decision making.
The four key pillars:
It might be stating the obvious, but we can’t do data science without the data. What’s interesting is that data science is often associated with the extremes of Doug Laney’s famous ‘3 V’s’:
• Volume – the size of data to be analysed, driving data science’s ongoing association with the world of ‘big data’
• Variety – with algorithms focused on analysing a range of structured and unstructured data types (e.g. image, text, video) being developed faster perhaps than the business cases are understood
• Velocity – the speed at which new data is created and speed of decision therefore required, leading to stream analytics and increased usage of machine learning approaches
However, data science is equally applicable to small, rectangular, static datasets.
Generally, analytics can be thought of in four categories:
• Descriptive Analytics: the study of ‘what happened?’ This is largely concerned with the reporting of results and summaries via static or interactive (e.g. dashboards) and is more commonly referred to as ‘Business Intelligence’
• Diagnostic Analytics: a study of why something happened. This typically involves feature engineering, model development etc.
• Predictive Analytics: the modelling of what might happen under different circumstances. This is a mechanism for understanding possible outcomes and the certainty (or lack of) with which we can make predictions
• Prescriptive Analytics: the analysis of ‘optimum’ ways to behave in which to ‘minimise’ or ‘maximise’ a desired outcome
As we progress through these categories, the complexity increases, and hopefully the value that is added to the business as well. But this isn’t a list of steps – you could jump straight to predictive or prescriptive analytics without touching on either descriptive or diagnostic.
It’s important to distinguish that data science is focused on advanced analytics and using the above definitions, this would mean dealing with everything beyond descriptive analytics.
‘Proactive’ was included to distinguish data science from the more traditional ‘statistical analysis’. In my experience, when I started my career as a statistician in industry, an organisation’s analytic function seemed a largely ‘reactive’ practice. Modern data science needs to be an active part of the business function and look for ways to improve the business.
‘To drive better decision making’
The last part of the definition is possibly the most important part. If we ignore this, then there’s a danger of doing the expensive cool stuff and not actually adding any value. With organisations investing heavily in data science as an industry, we need to deliver – otherwise we may be in a situation where data science as a phrase becomes associated with high-cost initiatives that never truly add value.
We need to be very clear about something: we can use the best tech, leverage the most clever algorithms, and apply them to the cleanest data, but unless we change the way something is done then we’re not adding value. To move the needle with data science, we need to positively impact the way the business does something.
So, what is a data scientist?
Each part of our definition hints at a particular skill that’s needed:
• Data: ability to manipulate data across a number of dimensions (volume, variety, velocity)
• Advanced analytics: understanding of a range of analytic approaches
• Proactive: communication skills that allow us to interact with the business
• Decision making: the ability to turn analytic thinking (e.g. models) into production code so they can be embedded in systems that deliver insight or action
If data science, as a proactive pursuit, is concerned with the meeting of a range of business challenges, then a data scientist must —understand at least the possibilities— of a wider range of analytic approaches.
So… we just need to hire unicorns?
It may sound from all this that you just need to hire people who understand every analytic technique, code in every language, and so on. The fact that unicorns don’t exist leads to a very important part of data science: data science is a team sport.
While we can’t hire people with all the skills required, we can hire data scientists with some of the required skills, and then create a team of complementary skillsets. This way we can create a team that, as a collective, contains all of the skills required for data science, and in doing that, create a solid foundation for driving data-driven digital transformation.