Welcome to the world of adversarial machine learning
Data Mining

Welcome to the world of adversarial machine learning

From startups such as Darktrace, Cylance, and ZoneFox to the established giants like FireEye and IBM, there’s few companies in the security space today that don’t claim to use either Machine Learning or Artificial Intelligence in some way or another.

And there’s good reason. Once you brush aside the “me too” marketing hype – of which there is no shortage – Machine Learning has the potential to help automate processes, reduce the number of false positives, and general make life easier for the overworked and often beleaguered security professional.

But as interest and use of Machine Learning for security purposes increases, so too will awareness from hacker and cyber criminals. Which inevitably leads to hackers trying to counter these technologies any way they can. And for companies looking to deploy their own Machine Learning-based systems for security use, this could lead to problems if they’re not careful.

“I see people taking machine learning techniques that we have been using in image processing and language processing and transferring them directly to the malware or the security domain,” says Professor Giovanni Vigna, CTO and co-founder of security startup Lastline. “And that doesn't work for a number of reasons.”

Vigna cofounded the California-based Lastline in 2011 to focus on offering breach detection and sandboxing technologies. Vigna himself is a Professor in the Department of Computer Science at the University of California in Santa Barbara, and part of the Shellphish group which won 3rd place at the DARPA Cyber Grand Challenge last year.

“Recognising images or language processing, in those domains, Machine Learning is operating on data that is not actively polluted or actively resisted from an adversary. This is different from only recognising cats. The pictures are not fighting you.”


Adversarial Machine Learning

We’re yet to enter the realm of hackers and cyber criminals deploying super advanced AI to hack our systems. The main offensive capabilities they’re using currently seems to be deploying chatbots to harvest data.

“I would say those are niche type of activity, because for them their goal is to bypass or to extract as we learn.”

While there’s little evidence of widespread use of AI for actively malicious purposes, Vigna is becoming increasingly concerned by hackers and criminals actively trying to mess with the training data of Machine Learning models in order to craft stealthier malware that avoid detection. Vigna labels this as “adversarial machine learning”.

“I'm advocating using machine learning right. Adversarial machine learning is different from just machine learning, and you have to take into account several possibilities; that, for example, the attackers pollute the data that you're learning from; or that an adversary can actually steal your models and what you have learned so far and use it internally to craft something that you will recognise as something benign instead of malicious.”

Because Machine Learning algorithms need massive amounts of data to work, it can be difficult to weed out attempts to pollute your learning models with false information. Vigna has seen attempts where malware links back to Facebook simply to try and convince you or your systems to blacklist it and annoy users. While this might be a fairly innocuous example, it shows that criminals are starting to see how they can impact such models.

“Because Machine Learning is based on statistical models - you have a sea of data and you extract statistical probability from this data - it's particularly prone to be poked,” he says.

While it might be unfeasible to ensure everything is perfect, it’s important to try and filter out any erroneous data polluting your models. In the example of malware linking back to Facebook, website analysis can help; how old is the site, is it reputable and likely to have been compromised etc. 

“Casting the demons out of the data set is very important, because if you learn the wrong thing than you will do it wrong.”


Using your own Machine Learning models against you

Could hackers soon be using the very models designed to stop them against its creators? Quite possibility. Researchers last year revealed that it was possible to reverse engineer learning models from Machine Learning-as-a-Service offerings and APIs from the likes of BigML, Amazon, Google, and Microsoft, almost perfectly by making repeated queries and calculating how it came to certain conclusions. If these services were used in the creation of spam, fraud, or malware detection, criminals could easily find a way to avoid being found by testing their samples against these reverse-engineered models.

“That was an important realisation; that it's very difficult to do Machine Learning and keep your model absolutely secret,” warns Vigna.

The likelihood of keeping a learning model secret is especially difficult if it publicly discloses possibilities. For example, if an image recognition service simply states “I think this is a cat” [or, for watchers of Silicon Valley, a hotdog] then there’s a higher chance of keeping a model’s inner workings hidden. However, if a service states an image to be, for example, “75% cat, 10% dog, 10% tiger, 5% flying slug”, it’s much easier to extract the model’s methodology.

“Eventually I think attackers will do exactly this: start asking specific requests and little by little they will leak out the model, keep a copy to themselves, so without having to submit a sample and disclosing their maliciousness, they can craft something that can bypass the tests, and then spring into action.”


Communication between Security and Data Science

To help prevent against this, says Professor Vigna, there has to be more communication between data scientists and security professionals.

“It has to a back and forth where the data scientist is the one that understands which is the right weapon [i.e. ML model] to kill the dragon that you [as the security professional] are describing, but you have to be very good at describing the dragon because the analysts don't know anything about malware.”

When asked whether this kind of communication and collaboration between security and data scientists could be difficult, Vigna things not; Wall Street has proven with Quants that those with talent in complex mathematics can be brought over into new industries, and security should be no different.

“That type of cross pollination can happen.”

But ultimately, companies shouldn’t relay on Machine Learning alone for its security needs, let alone just one ML algorithm.

“Your deployment of machine learning cannot be just Machine Learning, it has to be a lot of different algorithms together to cover each other's blind spots. But also you cannot forfeit traditional approaches because being a statistical approach, Machine Learning is really having a bunch of data and extracting something that predicts the best possible fit with the data. But it will never be perfect.”


Also read:
ZoneFox CEO on AI Snake Oil
Can ‘good’ machine learning take on global cybercrime?
The future of machine learning in cybersecurity: What can CISOs expect?
Escaping the security equivalent of Groundhog Day


«The CMO Files: Ben Geller, Datical


C-suite talk fav tech: Jason Atkins, 360 insights»
Dan Swinhoe

Dan is Senior Staff Writer at IDG Connect. Writes about all manner of tech from driverless cars, AI, and Green IT to Cloudy stuff, security, and IoT. Dislikes autoplay ads/videos and garbage written about 'milliennials'.  

  • twt
  • twt
  • Mail

Our Case Studies

IDG Connect delivers full creative solutions to meet all your demand generatlon needs. These cover the full scope of options, from customized content and lead delivery through to fully integrated campaigns.


Our Marketing Research

Our in-house analyst and editorial team create a range of insights for the global marketing community. These look at IT buying preferences, the latest soclal media trends and other zeitgeist topics.



Should the government regulate Artificial Intelligence?