Data Mining

Database helped reporters follow the Panama Papers money

Consider Mossack Fonseca and the Panama Papers scandal. Not long ago, if a terabyte-and-a-half’s worth of data had landed onto any national news desk, no team of investigators could make sense of it. They’d have been baffled by the sheer weight of information and the thick fog of shapeless data. Not to mention the endless false trails laid down by specially created companies and ghost organisations.

As reported on April 4th it was Australian e-discovery company Nuix that brought the investigation together. Now we can reveal the role of another player in the team, database company Neo4J, which provided the intelligence to make sense of all this information, discover links and structure it all. It did this by taking the original indecipherable database, stripping it down, then building it into a second database, which is designed to be a lot clearer. It then helped provide the tools and the training to enable investigators to start their job of following the money trails.

Sounds simple, but a lot of hard work goes into making complex tasks look easy. We asked Rik Van Bruggen, Neo4J’s European regional VP, to explain how it broke 11 million records down and built them up again.


Can we explain one mystery to begin with? Neo4J’s technology is described as a ‘graph database’…

Ah yes, that’s the curse of the IT industry. The word graph, when used in our context, is nothing to do with charts or graphics, but relates to the centuries old mathematical concept of structures of data. It alludes to the links between all the data, the networks if you like.

The project used the database to structure those 11 million records and provide all the relevant links between them. The unstructured nature of all those documents, obscured all those patterns and the picture was clouded even further by the nature of all those shell companies and false constructs. But we were able to take everything down to its constituent parts and put it all back together in a form that provides much greater clarity.


There are 11 million records, so presumably the process of establishing links must be automated.  How do you do that?

First we take all the information from all the various unstructured sources, such as PDFs, emails, word documents and excel files. And even other databases. The limitation of classic databases and records is that most people only structure data in tables or lists. This is only good if you’re performing calculations on it. It’s not terribly great if you’re looking for relationships in any set of information, whether it’s about off-shore investment schemes, biomedicine or terror groups. Our aim is to take information from 11 million discrete two dimensional data sheets and prepare it in a form which can relate across multiple dimensions.

Once we have all these unstructured documents – the PDFs and emails and so on – we begin treating them to extract meaning. We use search engines, entity extraction, natural language algorithms and so on. The power of our database is that it doesn’t store all these extracted nuggets of information as isolated objects, but as information nodes on a network. That’s what we mean by a Graph Network.


So it’s a bit like putting all the information in a blender, then allowing the natural elements to link together.

All the elements taken from each document are linked and tagged so that the people who consequently search the database can find the links.

Once we have created this structured network of information, we can give it to the investigating agencies to conduct their searches. They know what they will be looking for.

Our job is to make it easy for them. We give them a simple interface and the tools for digging. We trained the journalists, in this case, to be able to ask the database all kinds of crazy questions. The key to finding this information is provided by our technology partners, Linkurious, which helps users to visualize the fruits of their searches, so that they can see patterns when they emerge.


There must be tons of untold stories in those 11 million records. It all depends on the bias of the journalists. Will this database be open for anyone to search?

Initially the database was given to the ICIJ (the International Consortium of Investigative Journalists) and they conducted their trusted sources. There were 100 journalists around the world. Any Icelandic journalist on that list would have concentrated on asking questions about their own prime minister for example. ICIJ will be very picky about who they provide access to for the searches.


What’s the future of this technology?

We’re super excited. We’re helping to make it easier to hide information, so we’re making it harder for the bad guys. In future it’s not going to be as easy as taking a flight to Belize every month with a suitcase of cash. More and more companies, like the Financial Times in the UK and the Dutch equivalent, Het Financieele Dagblad, are building their own graph databases for financial investigations.

But we’re also aiming to help uncover information and links in other areas, like biomedicine as well as combatting fraud and terror networks.

It can be a daily grind working in the software industry, but things like this make it all worthwhile. It really gives meaning to what we’re doing.



Related reading:

Nuix CEO interview

Inside the tech behind the Panama Papers


« C-Suite Talk Fav Tech: Todd McKinnon, Okta


C-suite career advice: Andrew Filev, Wrike »
Nick Booth

Nick Booth worked in IT in the UK’s National Health Service, financial services and The Met Police, witnessing at first hand the disruptive effects of new technology. As a journalist and analyst, his mission is to stop history repeating itself.

  • Mail


Do you think your smartphone is making you a workaholic?