Visual analytics: Can machine learning 'see'?

The human brain remains the best video analyser – but computers are starting to catch up

This is contributed piece by Gadi Lenz, Chief Scientist at AGT International

Here is an interesting observation: Ask a child to describe what she sees around her and she will immediately tell you something like “I see a tall man talking to a woman in the driveway in front of a yellow house”.

The same task is beyond current computer technology – specifically, feeding a “raw” video clip to a machine and getting back (reasonably quickly) a short textual description of what happens in the clip, is currently pretty much impossible. Images and video are rich sources of information consisting of many different objects (with different shapes and colours) with some relationship to each other, in some environment, possibly moving (in the case of video), etc. – there is a reason that a picture is worth a thousand words.

Analysing images and video to facilitate automatic insights and associated decisions is still incredibly difficult (even offline; doing it in real time is much harder). A further complication is the fact that most of the visual content we view is actually a 2D projection of the real (3D) world. Remarkably, humans are really good at these types of tasks, so one approach could be “Hey, let’s just copy the human visual system (HVS)” – if only it was that simple.

So, what can we do in the area of video analytics or video content analysis? Actually, quite a bit, though not quite as much as you may have seen in some popular movies. Here are some examples:

  • Driven by security and surveillance use cases, many “suspicious” behaviours can be recognised automatically (i.e., with no human in the loop) such as an object that has been left behind, someone crossing a virtual line, people counting, loitering and many others. Similarly, in the vehicular traffic area, behaviours such as stopped vehicle, or someone driving on the hard shoulder, can be identified.
  • Some very specific objects can be recognised – faces, vehicles, license plates and probably a few more. Although some only under limited conditions – controlled lighting, controlled pose, minimal occlusions, etc.
  • Tracking of specific objects in the camera’s field of view (tracking across multiple cameras, even when there is overlap in successive cameras, is very difficult)

If your interest is in some specific items on this limited list – no problem, you can buy them from numerous vendors. However, if you are looking for a different behaviour or a different object, you will need some computer vision people to develop a new analytic service. That generic object recogniser or the generic “tell me if anything unusual happens in this area” does not exist yet.

But don’t despair – machine learning approaches are starting to appear in some commercial products. Basically, the machine is trained, for example, on video that represents normal vehicular traffic flow and once the learning phase is over, the machine can indicate that something abnormal has happened such as traffic slowdown due to some sort of incident further down the road. By “machine”, by the way, we mean the computer that ingests the video stream and runs the anomaly detection algorithm, which could, in principle, run in the camera itself or very near to it.

At this point, you are probably saying, “so what about copying the human visual system?” Well, it turns out the HVS is quite complex and we have not figured out how all of it works yet. A lot of progress has been made over the years and a lot of good research is going on. One of the exciting developments related to this area are associated with Deep Learning (DL) which is a machine learning approach that does really well with tasks where humans are usually better than machines – for example, object recognition in images. DL usually requires a lot of computational resources, which is slowly evaporating as a real hurdle and as a result there have been some really exciting results! Google has recently shown an image with a caption that was created automatically – this is actually getting us closer to the target of automatic video summarisation at the beginning of this post. Another one comes from Microsoft where they managed to outdo humans in an image classification task.

As an aside, some of the world’s top academics in the area of Deep Learning have joined Google, Facebook and Baidu in the last year or two – that should tell you something.

Let me also make the following point – humans are equipped with visual hardware (eyes) that can see in 3D. A lot of stuff gets easier when you also have depth information (e.g., which of two visible objects is in front and which is in back) and there actually are cameras that can record depth information – from stereoscopic cameras to Kinect-like sensors to time-of-flight cameras. In fact, if you have a number of cameras covering the same area from different vantage points you can do on-the-fly 3D reconstruction. Now, you can run some advanced video analytics on real time 3D streams to do things that were simply not possible before.

A final thought –

The estimated number of surveillance cameras in the world is about 210 million (obviously, not counting consumer cameras, smart phone cameras, etc.) producing an obscene amount of stored video, most of which has never been viewed by anyone and most likely will never be viewed by anyone – there is just too much of it. Only advanced video analytics will be able “to watch the video for us”, letting us know when there is something interesting there.