Citizen data science suffers from inflated expectations

The notion of data analytics being democratised for all of us is overblown

Gartner's Hype Cycle is a useful device for examining perceptions of technologies. The analyst firm's framework spans the gamut from "innovation trigger" (original idea) to "plateau of productivity" (works as advertised so get on with it) but the stages in between represent a sort of Pilgrim's Progress of hazards from the "peak of inflated expectations" via the "trough of disillusionment" ("slough of despond" in Bunyan's language) and "slope of enlightenment". If ever a sector was at the peak of inflated expectations it is surely citizen data science, an overhyped, over here example of the sort of thing that happens when the marketers' fantasises run ahead of the engineers' realities. [Google "hype cycle" and "citizen data science".] And it seems that the good people of Gartner agree.

Over the last year, I've spoken to dozens of people about the topic and everything I've heard suggests to me that the idea of data science as a job for the rest of us has been overcooked to ashes. This is not to say that data science is not an important and growing discipline, only that the idea that it is being very broadly democratised is mendacious and fallacious: it remains the domain of skilled individuals with knowledge of statistics, programming, data structures, business domain and more.

In September, speakers at the EARL (Enterprise Applications of the R language) in London queued up to speak to the power of R but how many people know R? The language is not listed in the latest top 10 of Tiobe's index on popular languages and Tiobe CEO Paul Jensen recently told ZDNet:

"Almost every professional software engineer has some knowledge in Python but not in R. So, if you want to do some serious stuff in the statistical domain professional software engineers will use Python. R is limited to field experts in the domain of statistical engineering and that is a more restricted set of people."

The EARL speakers in the excellent Data For Good session I attended made compelling cases for the application of data science to tasks such as harnessing open source technologies for local communities (David Baker, Toynbee Hall), improving NHS processes (Mohammed Amin Mohammed) and preventing human trafficking (Sandro Matos, Merkle Aquila), preventing human trafficking through the power of advanced analytics, and improving life in Colombia (Amit Kohli, ACDI/VOCA). But these were all people with deep skills, not novices drafted in from departments and given basic training, the MO for the fantastic world of data science.

Nationwide is an example of the zeitgeist. The UK building society has 15 million customers but, as Lee Raybould, chief data officer, said at September's Qlik user conference, "a trusted brand can disappear very, very quickly" these days - ask Thomas Cook. Staying fresh and progressive will often lie in data analytics but the company had a "culture of discrediting information unless it came from the team you happened to originate from … [and] we were data-rich but it was all in siloes". Raybould gained an outside-in view by visiting partners in India and Australia as well as key suppliers such as Apple and Cisco. But he admits that Nationwide had to invest seriously in its people to think about data-enabling its workforce. It emphatically did not attempt to make everyone a data scientist.

Even the biggest data science advocates are cautious. At EARL I spoke briefly to Rich Pugh, chief data scientist at Mango Solutions and a committee member in the data science section of the Royal Statistical Society.

"The idea that data science is for everybody isn't just wrong, it's dangerous," he said. "If you misuse data and don't understand data relationships you can make bad decisions that can have serious results."

Pugh says that data science is comparable to Big Data and AI in the sense that the term has been misused to the point of senselessness and hyped to the skies. "It's hugely important, hugely valuable but it's not some quick win or quick fix. It's a long-term thing and it needs investment," he said.

Mike Capone, CEO of pioneering data analytics company Qlik, agrees. The company is spearheading a campaign to improve data literacy, but he laughs at the idea that "'we're all data scientists now' … You can't just be one without some knowledge of Python, or R or data structures." It's more important, he says, that we develop the base layer of knowledge that will help more of us understand and respect data as consumers of information, even if we can't all be at the sharp end.

It's very likely true that companies that best exploit data will be well positioned to make smarter decisions, market their intelligence and foster healthy, innovative partnerships. But data science for all? It's for the birds.