Machine learning with Python: An introduction Credit: Matthew LancasterCreative Commons
Business Management

Machine learning with Python: An introduction

Machine learning is one of our most important technologies for the future. Self-driving cars, voice-controlled speakers, and face detection software all are built on machine learning technologies and frameworks. As a software developer you may wonder how this will impact your daily work, including the tools and frameworks you should learn. If you're reading this article, my guess is you've already decided to learn more about machine learning.

In my previous article, "Machine Learning for Java developers," I introduced Java developers to setting up a machine learning algorithm and developing a simple prediction function in Java. While Java's ecosystem includes many tools and frameworks for machine learning, Python has emerged as the most popular language for this field. Figure 1 shows the result of a recent Google Trends query combining the search term "machine learning" with "Python," "Java," and "R." Although this graph is not reliable from a statistical point of view, it does allow us to visualize the popularity of Python for machine learning.

jw mlpython fig1 Gregor Roth

Figure 1. Popularity of machine learning languages (January 2019)

In this article, you'll learn why Python is especially successful for machine learning and other uses involving data science. I'll briefly introduce some of the Python-based tools data scientists and software engineers use for machine learning, and suggest a few ways to integrate Python into your machine learning development process--from mixed environments leveraging a Java backend to Python-based solutions in clouds, containers, and more.

A use case for machine learning

To start, let's revisit the use case from my previous introduction to machine learning. Assume you're working for a large, multinational real estate company, Better Home Inc. To support its agents, the company uses third-party software systems as well as a custom-developed core system. This system is built on the top of a massive database containing historical data on sold homes, sale prices, and descriptions of available houses. The database is updated continuously by internal and external sources, and is used to manage sales as well as estimating the market value of properties for sale.

An agent may enter features such as house size, year of construction, location, and so on to receive the estimated sale price. Internally, this function uses a machine learning model--essentially, a mathematical expression of model parameters--to calculate a prediction. (Please see my previous article for a more detailed explanation of machine learning algorithms and how to develop und use them in Java.)

Listing 1. A machine learning model based on linear regression


double predictPrice(double[] houseFeatures) {
    // mathematical expression (here linear regression)
    double price = this.modelParams[0] * 1 +
                   this.modelParams[1] * houseFeatures[0] +
                   this.modelParams[2] * houseFeatures[1] +
                   ...;
    return price;
}

In Listing 1, a machine learning model is implemented using a linear regression algorithm, which is very popular in machine learning. The algorithm multiplies model parameters with the feature parameters for a given property and sums them up. As is typical in machine learning, a training process determines the parameter values to be used for the model. This approach is called supervised learning.

Supervised learning consists of feeding a system labeled example records, which are then analyzed for correlations. In this case, the system is fed historical house record features that have been labeled with the sale price. The model looks for correlations between features that have some impact on sale price, as well as the weight of these relationships. Model parameters are then adjusted based on the identified correlations and weights. This is how a machine learning model "learns" to estimate the price for a given house.

Listing 2. Training model parameters


void train(double[] houseFeatures, double[] pricesOfSale) {
    // .. find hidden structures and determine the
    // proper model parameters
    this.modelParams = ...
}

Challenges in machine learning

While the code example may appear quite simple, the challenge is to find and train the appropriate algorithm. In contrast to linear regression, which is relatively simple, most algorithms used for machine learning are more complex. Many machine learning algorithms require additional (hyper) parameters, which require a deeper understanding of the mathematics behind the algorithm.

Another challenge is finding and selecting appropriate training data. Data records have to be collected and understood, and collecting the records is not always easy. In order to build and train a price-prediction model, you must first locate a large number of sold house records. In order to be useful, you need not only the sale price but other features that help define the value of each house. In many cases, this means importing and consolidating from external as well as internal data sources. As an example, you might fetch house characteristics as well as the price of sale from an internal database storing sales transactions. For additional characteristics, you might call external partner APIs that provide information regarding the transport infrastructure or income levels for the given neighborhood.

jw mlpython fig2 Gregor Roth

Figure 2. Acquiring and consolidating data

Machine learning as a scientific process

Developing machine learning models is more similar to a scientific process than to traditional computer programming. A scientific process starts with a question, or an observation. For instance, you might observe that senior estate agents at Better Home Inc. are quite good at estimating the market price of a house. By interviewing these agents, you discover that they are able to quickly enumerate the features that determine the market value of a house. Furthermore, they're well versed in market conditions for different cities and regions. From this observation, you theorize that anyone could determine the market price of a house by combining historical sales data with key features of the property. Using this data, you could develop a machine learning model capable of estimating the sale price of a house. This feature would be of value to the company because it would enable inexperienced agents to determine the expected sale price of a new offer.

In order to test your thesis, you will need to acquire and explore the selected data sets. At this point, you are seeking an overview of the data structure. To get this overview, you will likely use tools such as Tableau, KNIME, and Weka, or even simple libraries like Python Data Analysis Library (pandas) or matplotlib. Before attempting to build your machine learning models, you will also need to prepare your data records by handling invalid or missing values. Once you've built your models, you will need to test and validate them in order to know whether your assumptions are true or false. You might, for example, validate whether the Better Home Inc. machine learning model is capable of estimating the proper sale price of a house. In general, data exploration, analysis, cleaning, and validation are the most time-consuming activities of machine learning.

The role of the data scientist

Data scientists are frequently responsible for the major tasks of a machine learning process. Most data scientists have a background in mathematics and statistics, but they are also typically proficient with programming and data modeling skills. Data scientists often have a strong understanding of data-mining techniques, which helps them to understand and select data sources, as well as gaining insight from the data. Careful data analysis helps teams choose the appropriate machine learning algorithms for a given use case.

In contrast to traditional software engineers, including enterprise Java developers, a data scientist is more focused on data and the hidden patterns in data. Data scientists typically develop, train, and process machine learning models using computing environments and data platforms implemented by traditional software engineers.

Python-based tools for machine learning

Understanding the role of data scientists in machine learning helps us understand why Python is the preferred language for this field. Unlike traditional software engineers, most data scientists prefer Python as a programming language. This is because data scientists are generally closer to scientific and research communities, where R and Python are widely used. Moreover, these communities have developed Python-based scientific libraries that make it easier to develop machine learning models. Now there is a growing, Python-based tools ecosystem specifically for machine learning. This ecosystem includes Jupyter Notebook, an interactive web-based Python shell, which is the current, de facto standard in the field of data science.

Jupyter Notebook: A web interface for visualizing data analysis

Jupyter Notebook extends a command-line Python interpreter with a web-based user interface and some enhanced visualization capabilities. It integrates code and output into a single web document that combines code, explanatory text, and visualizations. The inline plotting of the output allows immediate data visualization and iterative development and analysis. A notebook is used to explore data as well as to develop, train, and test machine learning models. As an example, a data scientist working for Better Home Inc. might use a notebook to load and explore available housing data sets, as shown in Figure 3.

jw mlpython fig3 Gregor Roth

Figure 3. Exploring data with Jupyter Notebook

A notebook in Jupyter consists of input cells and output cells. The editable input cells contain common Python code, which will be executed by pressing the key combination Ctrl+Enter. In the notebook shown in Figure 3, the second input cell is used to load a houses.csv file into a pandas dataframe. The dataframe provides utilities to manipulate and visualize data in an intuitive way. The third cell of the notebook contains a dataframe used to plot a histogram of house prices over time.

Data scientists use histograms and other charts and visualizations to understand data, and to identify outliers and inconsistencies in the data. Identifying inconsistencies and outliers is important because it allows you to sort through and resolve them in the data preparation process. This process eventually leads to clean data sets, which you can use to develop reliable machine learning models. You use the data sets to identify the features or house properties that are most relevant to the final sale price. These are the features that will define your machine learning model. Most algorithms aren't intelligent enough to automatically extract meaningful features from the full data set, and most algorithms won't work well if there are too many features to be analyzed.

Scikit-learn: A library of advanced machine learning algorithms

I explained in my Java-based introduction to machine learning that logistic regression algorithms require numeric values. For such a machine learning model, all of your strings or category values must be converted to numeric values. The process of conversion is done during feature extraction. One way to extract features is to develop a dedicated function that converts the raw input of house records into a vectorized representation that the algorithm can understand.

Below is a simplified extract_core_features() method written in Python. If you are unfamiliar with Python, don't be confused by the self argument. In Python, the first argument of every non-static method definition is always a reference to the current instance of the class. On the caller side, this argument will be passed automatically by executing the method.

A significant portion of the machine learning code data scientists write is for feature extraction. In the field of natural language processing, for instance, several non-trivial conversion steps are required to transform human text into a vectorized form.

Listing 3. Estimator.py


import pickle
from sklearn.linear_model import LinearRegression
class HousePriceEstimator:
    def __init__(self):
        self.model = LinearRegression()
    def extract_core_features_list(self, house_list):
        features_list = []
        for house in house_list:
            features_list.append(self.extract_core_features(house))
        return features_list
    def extract_core_features(self, house):
        # returns the most relevant features as numeric values
        return [house['MSSubClass'],
                house['LotArea'],
                house['OverallQual'],
                house['OverallCond'],
                int(house['YrSold'])]
    def train(self, house_list, price_list):
        features_list = self.extract_core_features_list(house_list)
        self.model.fit(features_list, price_list)
    def predict_prices(self, house_list):
        features_list = self.extract_core_features_list(house_list)
        return self.model.predict(features_list)  # returns the predicted price_list
    def save(self, model_filename):
        pickle.dump(self.model, open(model_filename, 'wb'))
    @staticmethod
    def load(model_filename):
        estimator = HousePriceEstimator()
        estimator.model = pickle.load(open(model_filename, 'rb'))
        return estimator

The extract_core_features() method is part of the HousePriceEstimator class. This class encapsulates the machine learning model by providing methods to support model training and prediction, as well as saving and restoring the model. Internally, the model uses the Scikit-learn library, which is a collection of advanced machine learning algorithms for Python.

Instantiating a new HousePriceEstimator creates a new LinearRegression model instance. The model will then be trained by executing the train() method  along with a list of houses and their associated prices. The internal model parameter values of the model instance will be adjusted based on the training data. After the training process, HousePriceEstimator is able to estimate house prices by executing the predict_price() method.

Testing models with Jupyter Notebook

You can use Jupyter notebook to explore data sets, and you can also use it to develop and test machine learning models. In Figure 4, a scatter plot gives a visual overview of the relationship between real and predicted housing prices. The closer you are to the diagonal line the more accurate the predicted price will be. In practice, such visualizations will be enriched by numeric metrics such as accuracy or precision values.

jw mlpython fig4 Gregor Roth

Figure 4. Validating the quality of a model

Machine learning in production

Once you've finalized and trained a model, you will want to deploy it into a production environment. We often use different tools to meet different requirements in the development and production environments. Production environments are driven by requirements such as reliability and scalability, whereas tools and systems in a development environment are more focused on facilitating thinking and experimentation. Jupyter Notebook is mostly used in the development stage of the machine learning lifecycle. A runtime environment is required to process the model in production.

Different lifecycle stages also have different runtime performance requirements. As one example, the process of training a production-ready model typically requires considerable computation power. It isn't uncommon for complex algorithms, such as deep neural networks, to include millions of model parameters, all of which have to be trained. Python's execution speed is significantly slower than C/C++ or Java's, but this does not mean using Python in production is inefficient. Python-based machine learning libraries like Scikit-learn get around performance issues by relying on lower level packages written in C/C++ or Fortran, which are capable of more efficient computations.

Other popular machine learning frameworks and libraries such as TensorFlow are mostly written in highly-optimized C++, but provide Python language bindings.

Cloud solutions for machine learning

Cloud integrations with TensorFlow or Scikit-learn make it possible to run trained models in a cloud environment. For instance, Google's Cloud Machine Learning (ML) Engine offers a web service for handling HTTP prediction requests, which can then be forwarded to a trained machine learning model.

Figure 5 shows a web-based approach to performing an online prediction.

jw mlpython fig5 Gregor Roth

Figure 5. Performing online predictions

Built-in support from Google's Cloud ML Engine is currently restricted to a few popular machine learning libraries. The platform is also connected with other Google cloud services.

Mixing Python and Java for machine learning

Both Java and Python have strengths, so you might consider the advantages of a mixed environment for machine learning.

Jython, a Python interpreter written in Java, is one option for bridging the two programming language ecosystems. Jython enables you to call Python code in Java-based web services within the same JVM. Unfortunately, Jython is severely limited for machine learning: First, the current Jython implementation does not support Python 3.x, and currently goes only to Python 2.7. Much worse, Jython doesn't support running libraries like NumPy or SciPy. That's a real problem because many Python-based tools for machine learning depend on such libraries.

Instead of running Python code on the JVM, you might consider using Java's ProcessBuilder to execute Python code from Java. Essentially, ProcessBuilder lets you build and interact with Python processes from a Java-based web service, making it possible to run a trained machine learning model inside the Python process. Implementing this approach in a reliable and performant way requires considerable experience, however. You will need to reduce the overhead of creating a new Python processes for each prediction request, and you will need a way to synchronize your Java-code updates with Python-code updates, especially when the interaction protocol between the Java and Python processes changes.

An alternative to ProcessBuilder is using JNI/JNA to call Python from Java, in this case calling the Python code via a C++ bridge. A popular library implementing this approach is JPy. Like ProcessBuilder, a solution based on JNI/JNA also requires special attention to performance and reliability.

Interoperable solutions for Python and Java

Some interoperable models allow you to use Python to develop your machine learning models and process them using Java-based libraries like Eclipse's Deeplearning4j. Using Deeplearning4j, you would run Keras-based models, which are written in Python.

Another approach is to use an interoperability standard such as PMML (Predictive Model Markup Language) or ONNX (Open Neural Network Exchange Format). These provide a standard format to represent machine learning models, so that you can share models between machine learning libraries. An advantage of the standards-based approach is that it allows you to switch to another machine learning stack within the production environment.

Support for interoperability is limited for feature extraction, however, where one of the first steps is to pre-process the raw input data. In most cases, large parts of the pre-processing code are not covered by interoperability standards or models, so you are forced to reimplement feature extraction code whenever the language environment is switched. Reimplementing that code is not trivial, especially because it is often the major portion of the application code for a machine learning project.

Using Java by itself

Switching between machine learning libraries and programming languages to train and process models poses a variety of restrictions and difficulties, depending on the approach you choose. The simplest approach, by far, is to stay with a single machine learning library and programming language throughout the machine learning process.

As I demonstrated in my previous article, it is entirely possible to use Java for all phases of machine learning. Weka is a strong, Java-based framework that provides graphical interfaces and execution environments, as well as library functions to explore data and build models. Eclipse Deeplearning4j (DL4J) is built for Java or Scala, and it integrates with Hadoop and Apache Spark.

A Java-only approach likely makes the most sense for shops that are strongly committed to the Java stack and able to employ Java-skilled data scientists. Another option is to use Python for both the development and production phases of the machine learning lifecycle.

Building a Python-based web service for machine learning

In this section we'll quickly explore a pure, Python-based solution for machine learning. To start, we build a Python-based web service:

Listing 4. estimator_service.py


from datetime import datetime
from flask import Flask, request, jsonify
from flask_restful import Resource, Api
from estimator import HousePriceEstimator
class HousePriceEstimatorResource(Resource): 
    def __init__(self, estimator):
        self.estimator = estimator
    def post(self):
        house = request.get_json()  # parses the request payload as json
        predicted_price = self.estimator.predict_prices([house])[0]
        return jsonify(price=predicted_price, date=datetime.now())
estimator = HousePriceEstimator.load('model.p')
estimator_app = Flask(__name__)  # creates the application object used by the WSGI server
api = Api(estimator_app)
api.add_resource(HousePriceEstimatorResource, '/predict', resource_class_kwargs={'price_estimator': estimator})

In this example, we've used the Python microframework Flask to implement a RESTful service, which accepts house price prediction requests and forwards them to the estimator. The RESTful extension of Flask provides a Resource base class which is used to define the handling of HTTP methods for a given resource URL. The HousePriceEstimatorResource class extends the Resource base class by implementing the post() method of the /predict endpoint. By executing the post() method, the received house data record is used to call the HousePriceEstimator's predict_prices() method. The estimate price will then be returned by adding a prediction timestamp, as shown below.

Listing 5. Curl-based example call


root@2ec2dc33c182:/datascience# curl -H'Content-Type: application/json' -d'{"MSSubClass": 60, "LotArea": 8450, "MSZoning": "RL", "LotShape":"Reg", "Neighborhood":"CollgCr", "OverallQual": 67, "OverallCond": 55, "YrSold": 56}' -XPOST http://127.0.0.1:8080/predict
{
  "date": "Thu, 20 Dec 2018 06:40:50 GMT",
  "price": 3606324.4901064923
}

The Python code after the HousePriceEstimatorResource class definition loads the serialized machine-learning model and uses it to provide a Flask application instance called estimator_app. We can then run the estimator_service.py service using the the WSGI Gunicorn server, as shown below. The server will be started with the Python module name (which is in this case the Python source filename without the ".py" extension) and the variable name containing the app.

Listing 6. Running the Python web service


root@2ec2dc33c182:/datascience# gunicorn --bind 0.0.0.0:8099 estimator_service:estimator_app
[2018-11-11 13:18:05 +0000] [6] [INFO] Starting gunicorn 19.9.0
[2018-11-11 13:18:05 +0000] [6] [INFO] Listening at: http://0.0.0.0:8099 (6)
[2018-11-11 13:18:05 +0000] [6] [INFO] Using worker: sync
[2018-11-11 13:18:05 +0000] [9] [INFO] Booting worker with pid: 9

Similar to the Java Servlet specification, Python's WSGI standard defines an interface between web servers and Python web applications. In contrast to the Java Servlet API, the programming model of a Python web application is still framework specific. As an example, the naming of the post() method is defined by the Flask framework and is not part of the WSGI spec.

The WSGI interface gives you the flexibility to choose the best-fitting WSGI web server for your environment. One option is to use the lightweight Python-based Gunicorn server only for development activities. In production, you could use something more flexible, like Nginx/uWSGI. To deploy the code on production nodes you could build a Docker container image, which would include the WSGI web server, the application code, and the required libraries and serialized model file. The Docker image would provide a web-based microservice providing house price predictions.

jw mlpython fig6 Gregor Roth

Figure 6. Estimator docker container image

In conclusion

As I demonstrated in my previous article, it is entirely possible to use Java for all phases of machine learning: exploration, data preparation, modeling, evaluation, and deployment. Weka is a rich, Java-based framework that provides graphical interfaces and execution environments, and for deep, neural network machine learning models you could use Deeplearning4j. There is even an integration with Weka that allows you to use models developed with Deeplearning4j in the Weka workbench.

For shops that can be more flexible, there are many advantages to choosing Python. Data scientists are more likely to be familiar with Python and its tools ecosystem, and developers taking on the role of a data scientist will benefit from the many useful and efficient Python-based machine learning tools. Tools and libraries such as Jupyter, pandas, or Scikit-learn will move you through the process of analytics and model development, reducing iteration time from data analysis, data preparation, and model building to deploying your model in production.

Once your model is in production, you have a variety of options for deploying it: you may experiment with a mixed environment that supports a Java backend, move your Python-based application to the cloud, or develop your own RESTful Python-based web service or microservice.

PREVIOUS ARTICLE

« CES 2019: Dell refresh the XPS 13 and more

NEXT ARTICLE

Lenovo's Yoga S940 intros Smart Assist eye-tracking, noise filtering, and a glass-wrapped display »
author_image
IDG News Service

The IDG News Service is the world's leading daily source of global IT news, commentary and editorial resources. The News Service distributes content to IDG's more than 300 IT publications in more than 60 countries.

  • Mail

Poll

Do you think your smartphone is making you a workaholic?