Data poisoning: how to prevent the threat of biased AI outcomes

Machine learning models are not safe. Data poisoning is an emerging threat. But what is it, and how can organisations avoid it? Spiros Potamitis, Senior Data Scientist at SAS Global Technology Practice shares how this threat arises, plus prevention methods that can stop an attack in its tracks, or prevent it from happening in the first place.


This is a contributed article by Spiros Potamitis, Senior Data Scientist, Global Technology Practice at SAS.

Increasingly in business, organisations are implementing machine learning models to advance their AI systems. As is natural with any new technology, these systems are being tested by fresh threats which risk their function and integrity. One such threat is data poisoning.

Machine learning 101

Before we discuss data poisoning, it’s worth revisiting how machine learning models work. We train these models to make predictions by ‘feeding’ them with historical data. From these data, we already know the outcome that we would like to predict in the future and the characteristics that drive this outcome. These data ‘teach’ the model to learn from the past. The model can then use what it has learned to predict the future. As a rule of thumb, when more data are available to train the model, its predictions will be more accurate and stable.

AI systems that include machine learning models are normally developed by experienced data scientists. They thoroughly examine and explore the data, remove outliers and run several sanity and validation checks before, during and after the model development process. This means that, as far as possible, the data used for training genuinely reflect the outcomes that the developers want to achieve.

Feed with caution: Data poisoning explained

However, what happens when this training process is automated? This does not very often occur during development, but there are many occasions when we want models to continuously learn from new operational data: ‘on the job’ learning. At that stage, it would not be difficult for someone to develop ‘misleading’ data that would directly feed into AI systems to make them produce faulty predictions.

Consider, for example, Amazon or Netflix’s recommendation engines. Think how easy it is to change the recommendations you receive by buying something for someone else. Now consider that it is possible to set up bot-based accounts to rate programmes or products millions of times. This will clearly change ratings and recommendations, and ‘poison’ the recommendation engine.

This is known as data poisoning. It is particularly easy if those involved suspect that they are dealing with a self-learning system, like a recommendation engine. All they need to do is make their attack clever enough to pass the automated data checks—which is not usually very hard.

The other issue with data poisoning is that it could be a long, slow process. Hackers can afford to take their time to change the data by feeding in a few results at a time. Indeed, this is often more effective, because it is harder to detect than a massive influx of data at a single point in time—and significantly harder to undo.

Four steps to preventing data poisoning

Fortunately, there are steps that organisations can take to prevent data poisoning. These include:

  1. Establish an end-to-end ModelOps process to monitor all aspects of model performance and data drifts
  2. For automatic re-training of models, establish a business flow. This means that your model will have to go through a series of checks and validations by different people in the business before the updated version goes live
  3. Hire experienced data scientists and analysts. There is a growing tendency to assume that everything technical can be handled by software engineers, especially with the shortage of qualified and experienced data scientists. However, this is not the case. We need experts who really understand AI systems and machine learning algorithms, and who know what to look for when we are dealing with threats like data poisoning
  4. Use ‘open’ with caution. Open-source data are very appealing because they provide access to more data to enrich existing sources. In principle, this should make it easier to develop more accurate models. However, these data are just that: open. This makes them an easy target for fraudsters and hackers. The recent attack on PyPI, which flooded it with spam packages, shows just how simple this can be

The significance of human supervision

It is imperative that, for their own safety and that of their customers, organisations adopt these four recommendations to safeguard against data poisoning.

Yet, there is one other key consideration that organisations must acknowledge in their fight against data poisoning: human intervention. Automation is certainly a big player in the future of tech. However, it will be nothing without the trained human eye to supervise the entire process. It’s perceptive humans who will spot the tell-tale signs of data poisoning and pass the knowledge on, beginning to teach machines how to spot it themselves.

Spiros Potamitis is a senior data scientist at SAS, specialising in the development and implementation of advanced analytics solutions across different industries. Having acquired an MSc in Computer Engineering and one in Information Management, Potamitis provides subject matter expertise in the areas of Forecasting, Machine Learning and AI.