One of the exciting APIs among the 50+ APIs offered by Google is the Prediction API. It provides pattern matching and machine learning capabilities like recommendations or categorization. The notion is similar to the machine learning capabilities that we can see in other solutions (e.g. in Apache Mahout): we can train the system with a set of training data and then the applications based on Prediction API can recommend (“predict”) what products the user might like or they can categories spams, etc.
In this post we go through an example how to categorize SMS messages – whether they are spams or valuable texts (“hams”).
Using Prediction API
In order to be able to use Prediction API, the service needs to be enabled via Google API console. To upload training data, Prediction API also requires Google Cloud Storage.
The dataset used in this post is from UCI Machine Learning Repository. UCI Machine Learning repository has 235 datasets publicly available, this post is based on SMS Spam Collections dataset.
To upload the training data first we need to create a bucket in Google Cloud Storage. From Google API console we need to click on Google Cloud Storage and then on Google Cloud Storage Manager: This will open a webpage whe we can create new buckets and upload or delete files.
The UCI SMS Spam Collection file is not suitable as is for Prediction API, it needs to be converted into the following format (the categories – ham/spam – need to be quoted as well as the SMS text):
“ham” “Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…”
Google Prediction API offers a handful of commands that can be invoked via REST interface. The simplest way of testing Prediction API is to use Prediction API explorer.
Once the training data is available on Google Cloud Storage, we can start training the machine learning system behind Prediction API. To begin training our model, we need to run prediction.trainedmodels.insert. All commands require authentication, it is based on OAuth 2.0 standard.
In the insert menu we need to specify the fields that we want to be included in the response. In the request body we need to define an id (this will be used as a reference to the model in the commands used later on), a storageDataLocation where we have the training data uploaded (the Google Cloud Storage path) and the modelType (could be regression or classification, for spam filtering it is classification):
The training runs for a while, we can check the status using prediction.trainedmodels.get command. The status field is going to be RUNNING and then will be changed to DONE, once the training is finished.
Now we are ready to run our test against the machine learning system and it is going to classify whether the given text is spam or ham. The Prediction API command for this action is prediction.trainedmodels.predict. In the id field we have to refer to the id that we defined for the prediction.trainedmodels.insert command (bighadoop-00001) and we also need to specify the request body – input will be csvInstance and then we enter the text that we want to get categorized (e.g. “Free entry”)
The system then returns with the category (spam) and the score (0.822158 for spam, 0.177842 for ham):
Google Prediction API libraries
Google also offers a featured sample application that includes all the code required to run it on Google App Engine. It is called Try-Prediction and the code is written in Python and also in Java. The application can be tested at http://try-prediction.appspot.com.
For instance, if we enter a quote for the Language Detection model from Niels Bohr: “Prediction is very difficult, especially if it’s about the future.”, it will return that it is likely to be an English text (54,4%).
The key part of the Python code is in predict.py:
The Java version of Prediction web application is as follows: