Machine Learning with Python – A Journey

January 30, 2017

This blog series is a walk-through of implementing machine learning applications with Python. In this first post of the series, I present a text classification application using Python’s scikit-learn library.

scikit-learn (sklearn) is an open source machine learning library for Python. It’s out-of-box support for data manipulation, multiple supervised and unsupervised learning algorithms and visualization libraries make it an excellent choice for implementing machine learning applications. In this post we will use on these sklearn sub-modules:

  • sklearn.cross_validation
  • sklearn.feature_extraction
  • sklearn.naive_bayes
  • sklearn.metrics

Let’s start by clearing this up! Why Python?

I didn’t compare the performance of different scientific languages like R, Scala, or MATLAB but the following points pushed me in favor of Python:

  • Open Source: Python is Open Source. There are multiple libraries available for Python with favorable license terms reducing the cost of creating applications with it.
  • Highly versatile: Python can be used in multiple domains. It has optimized libraries for specific application scenarios – scikit-learn for machine learning, pandas for data munging, numpy for data representation, matplotlib for visualization, Django for web application integration.
  • Speed of development: Being an interpreted language, I could quickly prototype in Python – no compilation required! This allowed me to experiment with different implementations before picking up the right one.
  • Supported by Industry: Various machine learning framework support Python bindings – TensorFlow by Google (https://www.tensorflow.org/), Hadoop Streaming and others.

Implementation Environment

  • I am working on 64 bit Windows 10 Operating system. For setting up the Python environment (Python interpreter, sklearn packages), install Anaconda2 (requires 350MB RAM). Download and install Anaconda (32 bit or 64 bit depending upon your OS) from the link. Once installed, add the following directories to the PATH Environment variable:
    1. <Anaconda2 Installation Dir> (example, C:\Anaconda2) and
    2. <Anaconda2 Installation Dir>\Anaconda2\Scripts

I find Anaconda’s package management and deployment tool very convenient. I use the PyCharm IDE (thanks to JetBrains, the community edition is free). You could also use IPython, Spyder or any other suitable editor.

Basics of Text Classification

Let’s quickly go through the basics of supervised learning before we move on to the actual application. The important terms that we would use throughout this post are:

  • Classification: An algorithmic method to assign any given new element of the dataset to one of a priori provided classes (categories).
  • Training set: Set of data used to train the classification model. The training data contains the class/label for each data item.
  • Testing set: Set of data used to verify how well the model performs. Note that, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set.

A summary of the steps involved in supervised learning would be:

Supervised Learning

Step 1: Dataset Representation – Crux of the entire model.

The dataset in most classification and clustering applications is represented as dimensional vector space of numeric or categorical data. This feature space consists of samples. Each sample can be represented as a p dimension feature vector consisting of p attributes of each data sample. Thus, the entire dataset can be viewed as a n X p matrix.

n X p dataset representation

Numeric Python

A quick digression into a package that I use throughout this post. NumPy is the extension package to Python for multidimensional arrays used for storing the data set . Machine learning Python enthusiasts swear by it. Python numerical library NumPy provides a vast range of functions for creating and manipulating N-D arrays. It is designed for scientific computation and is specifically optimized for large size arrays.

Most common method to create a n-dimensional array using NumPy is by calling numpy.array method that creates a numpy.ndarray object. An numpy.ndarray object represents a multidimensional, homogeneous array of fixed-size items.

 
import numpy as np 
np.array([[ 1, 6, 11, 16],[21, 26, 31, 36]])

The array created in the above example is an instance of numpy.ndarray where a[0] is the first row and a[1] is the second row. In physical memory, the array is stored in memory as 32 bytes, one after the other in a contiguous block of memory. Following figure shows the logical representation of a 2-d array:

logical representation of a 2-d array

Its always handy to know the common attributes of an ndarray as shown in the following figure:

Array attributes

The items of a Numpy array can be accessed and assigned in the same way as other Python sequences. I’ll be discussing the indexing and slicing along with other internals of Numpy in another post.

What data set are we using for this text classification application?

I have utilized the dataset popularly known as “Twenty Newsgroups”. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. Categories in this data set are:

Newsgroup Categories

To keep things simple, we will use the built-in data set loader for 20 newsgroups from sklearn using the following code:

#Fetching the training data set
 newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories_name)
 X = newsgroups_train.data #X contains the training data
 y = newsgroups_train.target #y contains the categories of the data
 #Fetching the testing data set
 newsgroups_test = fetch_20newsgroups(subset=test, remove=('headers', 'footers', 'quotes'), categories=categories_name)

Using a standard dataset is good place to start. As looking at the output we know that the our implementation is doing and what it is supposed to do. Then we can implement the same model with our own data.

Step 2: Conversion of text document to a Feature Vector.

We know that the classification algorithms accept continuous or categorical input. The challenge in text classification is fitting a text document as a feature vector. A commonly used model in text processing is the so-called bag of words (BOW) model. The document is tokenized and each word is treated as a feature.

The various terms used in a BOW representation are explained below:

Term Frequency (TF): This is the frequency of occurrence of each word and is used as a feature for training a classifier. Word order is discarded from a document and single words are treated as features.

Suppose D1 and D2 are two documents in a training dataset, where D1 is: “Each state has its own laws.”, and D2 is: “Every country has its own culture.”. Based on these two documents, the vocabulary could be written as each:1, state:1, has:2, its:2, own:2, laws:1, every:1, country:1, culture:1

In practice, the term frequency is often normalized by dividing the raw term frequency by the document length. This gets us –

TF(t) = (number of times term t appears in a document) / (total number of terms in the document).

For example, consider a document containing 100 words wherein the word day appears 3 times. Normalized term frequency (i.e. TF) for day is then (3/100) = 0.03.

Inverse Frequency representation (IDF): IDF measures how important a term is. While computing TF, all terms are considered equally important. However, it has been observed that there are certain terms that may appear a lot of times but add little value in terms of uniquely identifying a document. Imagine a dataset having 5 documents describing a topic on automobiles. In this dataset the word “auto” is likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text.

TF-IDF representation (TF-IDF): The TF-IDF approach assumes that the importance of a word is inversely proportional to how often it occurs across all documents. For instance, if we have 10 million documents and the word day appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 9.2. Finally, the TF-IDF weight is the product of TF and IDF = 0.03 * 9.2 = 0.276.

All the above becomes quiet straightforward by using the TfidfVectorizer class in sklearn. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features and returns a sparse matrix containing the feature vectors in the format [n_samples, n_features].


#Import the module
 from sklearn.feature_extraction.text import TfidfVectorizer
 #Create a Tfidfvectorizer instance
 vectorizer = TfidfVectorizer(encoding='utf-8',stop_words='english')
 #Convert into a tf-idf matrix
 vectors = vectorizer.fit_transform(X) #vectors contain the training data that will be required by the model

TfidfVectorizer has many parameters that can be adjusted while creating a TfidfVectorizer instance. The parameter stop_words can be used to remove the commonly occurring words that contribute poorly to the vectorization process. Since the vocabulary in text classification can be huge, the parameter max_features can be used to build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

Step 3: Selecting a classification algorithm

sklearn library comes equipped with almost all the popular supervised classification algorithms such as Naïve Bayes
and its variants, Support Vector Model, K-Nearest neighbors etc. We are using Multinomial Naïve Bayes as it is suitable for classification with discrete features (e.g., word counts for text classification). Recall that, Naïve Bayes algorithm works on maximum a posteriori (MAP) rule. The following code snippet demonstrates creating of a Multinomial Naive Bayes model and training it.

#Import the model
from sklearn.naive_bayes import MultinomialNB
#Create a MultinomialNB class instance
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, y) #Train the model on the training data

The parameters of sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None) are:
alpha: an additive smoothing parameter (0 for no smoothing). The smoothing priors alpha >= 0 accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting alpha = 1 is called Laplace smoothing, while alpha < 1 is called Lidstone smoothing,

fit_prior: a Boolean parameter that decide whether to learn class prior probabilities or not,

class_prior:  an array-like, size (n_classes,) parameter that contains the prior probabilities of the classes.

Step 4: Testing the model and assessing it’s accuracy

The sklearn.metrics module contains functions for assessing prediction error in classification performance. The most common way to find the correctness of the model is to predict the categories for the testing data set and to compare the predicted results with the actual categories called as the accuracy percentage of the model. Then sklearn.metric module can be used to find the accuracy of the model in the following way:

#Predict the categories of the testing data set
pred = clf.predict(vectors_test)
#Code to find the accuracy for the model using the prediction calculated above
print "Accuracy score is ", metrics.accuracy_score (newsgroups_target, pred)

The accuracy that we observe using the above code is about 87% which is a good accuracy to start with. sklearn.metric module provide two another interesting metric of model correctness, Classification Report and Confusion Matrix that are often used in statistics. These tools measure the model correctness based on four criterion: false positive, true positive, false negative, and true negative.These metric provide precision and recall of the classification model for per class/category basis.

print metrics.classification_report (newsgroups_target, pred, target_names=categories name)

print metrics.confusion_matrix (newsgroups_target, pred)

Summing-Up

Lets put the entire code together.

import numpy as np
from time import time
from sklearn import metrics

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups


#Selective Categories from which to read the data
categories_name = ['alt.atheism', 'sci.space', 'comp.os.ms-windows.misc']

def readTrainData():
    # Reading the data from the News dataset
    # categories_name = ['talk.religion.misc','rec.motorcycles', 'soc.religion.christian','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.graphics']

    newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
                                          categories=categories_name)

    print "The selected news categories are ", list(newsgroups_train.target_names)

    print "The number training data ", newsgroups_train.target.shape

    print newsgroups_train.target[:10]

    return newsgroups_train.data, newsgroups_train.target

def readTestData():
    # Reading the data from the News dataset
    # categories_name = ['talk.religion.misc','rec.motorcycles', 'soc.religion.christian','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.graphics']

    newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
                                          categories=categories_name)

    print "The number of testing data ", newsgroups_test.target.shape

    return newsgroups_test.data, newsgroups_test.target



def tfidf_representation(X):
    vectorizer = TfidfVectorizer(encoding='utf-8',stop_words='english')
    vectors = vectorizer.fit_transform(X)
    return (vectorizer, vectors)
  

def apply_classification(vectors, y):
    clf = MultinomialNB(alpha=.01)
    clf.fit(vectors, y)
    return clf
    
def main():

    X, y = readTrainData()
    
    #Reading Test data
    newsgroups_test, newsgroups_target = readTestData()
    
    #Create TF-IDF representation
    t0 = time()
    # Add code to create TFIDF representation
    vectorizer, vectors = tfidf_representation(X)
    duration = time() - t0
    print("TFID representation done in %fs." % duration)
    
    #Create classification model

    print "The featue space size is ", vectors.shape
    t0 = time()
    clf = apply_classification(vectors, y)
    duration = time() - t0
    print("Classifier Fit done in %fs." % duration)
    
    vectors_test = vectorizer.transform(newsgroups_test)
    pred = clf.predict(vectors_test)

    # Code to find the accuracy for the model
    print "Accuracy score is ", metrics.accuracy_score(newsgroups_target, pred)

    print metrics.classification_report(newsgroups_target, pred, target_names=categories_name)

    print metrics.confusion_matrix(newsgroups_target, pred)

if __name__ == "__main__":
    main()