Addressing the tech skills gap could be key to the future of machine learning

Dr. Greg Benson discusses the future of Machine Learning and the skills gap that could be hindering its advancement.

Far from the ‘Dark Fate' the latest Terminator movie would have you believe, Artificial Intelligence (AI) and its cousin—Machine Learning (ML) — are not the precursor technologies of the apocalypse. Instead, these technologies represent an exciting avenue of discovery and innovation that are transforming business and paving the way for the development of more sustainable practices alongside increased business efficiencies.

From delivering instant access to patient information in the healthcare industry, to streamlining production in intelligent factories, the application of ML technology has already had a tangible effect on enterprise. And, given adequate time and funding to mature, ML has the ability to become a truly game-changing technology. However, like many disruptive technologies, ML's future is not certain and it faces a number of challenges moving forward, not least of which is a lack of highly-trained ML practitioners available to train future generations of ML enthusiasts.

To learn more about the current state of ML within the enterprise and talk about the challenges it faces, we spoke with Dr. Greg Benson, Professor of Computer Science, San Francisco, and Chief Scientist at SnapLogic. Below is a lightly edited copy of our conversation where he highlights the importance of addressing the data science skills gap and discusses the barriers companies have experienced trying to adopt and deploy ML solutions.  

What barriers have companies experienced when trying to adopt ML?

In my view, there are three categories of barriers that companies face when trying to adopt or leverage ML. First, simply getting the type of data needed to train ML models can be difficult. In some cases, the data may be available, but because of either technical challenges or bureaucratic hurdles, data scientists cannot get the data that is needed. Alternatively, the data needed for training may not even be collected in the first place. Lots of interesting data is simply not collected and stored for later use. In this case, ML can't start until a process is in place to capture useful data, such as various forms of user activity or related event data. 

Second, we've all heard about the shortage of data science and ML talent. That is, there aren't enough data scientists with adequate ML knowledge to satisfy the demand from industry. I believe this is going to change as we have seen an increase in the number of Data Science programs, like the Master of Science in Data Science at the University of San Francisco, where I'm a professor. Also, ML is becoming more prominent in undergraduate computer science programs. Our educational system are responding to help meet the demand. 

Third, even with good data sets and qualified data scientists, there is a gap between building useful models and putting them to use in production. Larger tech companies, like Google and Apple, have built custom ML infrastructures to support the development and deployment of models so they can be used in applications, but most smaller companies don't have that kind of infrastructure. We are seeing quite a few platform solutions to this problem. For example, SnapLogic provides a self-service visual interface to put ML models into production.

You mentioned poor data quality and availability as key challenges to ML adoption. Why is this?

Poor data quality is certainly another potential challenge to being successful with ML. The problem stems from the fact that ML algorithms are extremely sensitive to the input data. So, if you have noisy data or data with missing information, this can interfere with an algorithm's ability to provide useful predictions, classifications, or recommendations. In any ML project, a large amount of time is spent on both finding the right source of input data, but also cleaning and transforming the data so that it can be fed as quality input to a ML algorithm.

Given ML's use in facial recognition software and speech recognition, is it plausible to say that the development of ML has been hindered by opposition and wariness to these controversial technologies?

I suppose it is plausible, but that's not what I'm seeing both in industry and academia. There has been a huge increase in the amount of published research in ML and AI over the last 10 years and businesses are always looking for ways to improve efficiency and customer experience. There are plenty of non-controversial uses of ML that can help businesses better achieve their goal. 

The data science skills gap has proved challenging for the technology industry as a whole to overcome. Why do you think the industry is struggling to produce new data scientists?

I mentioned earlier how I believe that our educational systems are responding with new programs and the teaching of ML is becoming more prominent in computer science programs. Larger companies like Google and Facebook have created internal educational programs to train their own employees in ML because they see it as core to their business. It is harder for non-technology focused companies and other enterprises to afford such programs or even get the talent needed to start one in the first place. I think there is a real opportunity for businesses to partner with local universities and colleges to form a mutually beneficial relationship. It is nearly impossible to recreate or simulate in research labs the scale of data and the types of interactions that exist in real businesses. So, in education, students do not get to work with real data until they leave university. If more businesses can partner with colleges to expose students to real data, they will be helping in our overall effort to train more data scientists and computer scientists.

Has the skills shortage restricted the development of ML technologies thus far? And if so, in what ways?

There are two types of efforts in terms of ML technologies. There is the fundamental research that is being conducted at universities and research labs and also, the development of practical technologies based on our current knowledge. We need to keep both pipelines going. That is, we need PhDs to continue both fundamental research and also teach our undergraduate and graduate students modern ML methods. A concern here is that the industry is tempting many professors to leave academia to work on interesting problems, but also for more money than what can be made at the university. If this trend continues it could negatively affect the pipeline of skilled data scientists and engineers because we will lack the professors who can teach ML. On the development of practical technologies there seems to be healthy progress both in open source ML libraries and tools, as well as in commercial technologies.

Are there any other emerging technologies that could have a significant impact on the development and adoption of ML in enterprise?

Yes, there are several efforts by start-ups and established companies to address the challenges I mentioned above. Google and Microsoft provide several ML services as APIs, in which they provide pre-built models for tasks such as computer vision and natural language. These can be used without building your own infrastructure and can be incorporated into products and services easily. However, some businesses may not want to send their data to third parties, or the pre-built models may not provide the best results for a specific task. So, there are companies that are building platforms that make it easier to carry out each step of the ML process, from data collection, to feature engineering, to training and testing, and finally to production.

Is it possible to create a machine learning model free from bias? And how can businesses prevent bias in their models?

This is not my core area of expertise, but this is a real problem. It's not that ML algorithms themselves are biased, but rather the training data itself may be skewed in terms of examples toward a particular class or segment of people, either negatively or positively. So, this may be more about training data selection than the ML algorithms themselves. A business should spend time looking at the demographics of the training data and sample the data in such a way that it provides balance for the desired ML goal.

What are some best practices you would recommend to help data scientists develop their own ML models?

It is important to approach ML development with the mindset of constant refinement. Not only in the development  phase when trying to figure out which data features are going to produce the best results and also which type of algorithm and corresponding hyper parameters are the most effective, but also after a model has been deployed. Early on you want to think about how you are going to evaluate the model in practice and how to improve and redeploy models on a regular basis. ML is not a one shot process.