Salmon Run: July 2016

Friday, July 15, 2016

Trip Report: Data Science Summit 2016 @ San Francisco

Earlier this week, I was at the Data Science Summit 2016 at San Francisco. This post is my trip report. The event was organized by Turi, better known as the people behind GraphLab Create. This OReilly article provides a quick backstory about the evolution of the company from a group of students and professors at Carnegie-Mellon University (CMU).

While the talks spanned a wide variety of subjects, there were a few unifying themes across the conference. They were, in no particular order, Distributed Systems, General Machine Learning (ML), Deep Learning (DL), Natural Language Processing (NLP), Recommendations, Unsupervised Learning, Online Learning, Visualization, Explainability and Graph Theory. I initially thought of classifying the talks along these lines, but then realized that a talk can span multiple categories. So I am going to cover them chronologically, and tag them with the themes that I think they belong to.

Day 1

Keynote 1 - by Pradeep Dubey, Intel Labs

[machine learning], [deep learning]

Pradeep Dubey explains how computing is moving from Inside-Out problems to Outside-In problems. Inside-Out problems are those which we understand analytically, such as a car moving up an incline. Outside-In problems are those for which we can observe the behavior but don't know how it works. An example given was trying to predict a person's social network from their purchasing behavior. Outside-In problems require lots of computing power, and Pradeep goes on to describe all the work being done at Intel to support such large ML/DL workloads on Intel CPUs using the MKL 2017 Beta toolkit (available now). His blog post provides some (quite impressive IMO) benchmark information.

Keynote 2 - by Carlos Guestrin, CEO Turi and Prof of ML, University of Washington (UoW)

[distributed systems], [machine learning], [deep learning], [unsupervised], [online learning], [visualization], [explainability]

Prof Guestrin runs through the themes of the conference using demos using GraphLab Create to demonstrate each theme, and also to show off the capabilities of the tool. His list (different from mine) included Data Distribution, Online Learning, Explanations, Automated Feature Engineering and Visualization. He is one of the authors of XGBoost: A Scalable Tree Boosting System, which he hinted could be used for automatic feature engineering. He also showed some very interesting demos around how GraphLab Create explains decisions from DL based visual recognition systems.

Deep Dive: A Dark Data System - by Chris Re

[machine learning], [unsupervised]

Chris Re describes Lattice.IO, a commercial system that automatically extracts entities from text. Initial training is by a process called Data Programming, where the trainer specifies functions that allow the system to detect patterns. The system is then fed a large body of text, and it uses the patterns and the text to teach itself about other related entities and extracts them, giving each of the extractions a probability of being correct. Humans can then tell it whether its right or wrong, and the system updates itself. Lattice.IO is based on the DeepDive Project from Stanford University. The HazyResearch/snorkel project has a number of IPython noteboosk with examples of data programming.

Petuum+ for Big ML: what next after the Parameter Server - by Prof Eric Xing, CMU

[distributed systems], [machine learning]

Prof Xing describes an alternative to the centralized Parameter Server that is usually found in Distributed ML systems. One alternative is to make it more of a peer-to-peer (P2P) system, but the communication overhead between peers becomes quite high. So the idea is to factorize out and broadcast the Sufficient Factors of the model to all workers and reconstruct the update matrix at each worker. Because Sufficient Factors are much smaller than the actual matrix, communication costs go down and such a P2P system becomes feasible. More information in this paper.

CoCoA: A communication-efficient primal-dual framework for distributed optimization, by Prof Mike Jordan, University of California, Berkeley (UCB)

[distributed systems], [machine learning]

Yet another approach to reducing communication overhead between nodes in distributed systems. Prof Mike Jordan advocates CoCoA, that uses local computation to reduce the communication overhead. As a result, experiments using CoCoA converged to the same solution 25x faster than comparable experiments without CoCoA. More information in this paper.

Evolution of the SFrame: Scalable Data Structure for ML - by Sethu Raman, Turi

[distributed systems], [machine learning], [graph theory]

This was something I have been curious about ever since I started using GraphLab Create as a student at the first course of the Coursera ML Specialization (I got a 1-year student license). SFrames allow you to treat your dataset on disk as if it was in memory, thus allowing you to work with data that is the size of your disk rather than the size of your RAM. It is open source and comes with its Python API. Unfortunately, about the only ML toolkit that uses it is GraphLab Create, so if you want to use it with Scikit-Learn, you would have to figure out how to do the vector arithmetic yourself. A graph abstraction of SFrame is SGraph, which is just a pair of SFrames (one for nodes and one for edges).

Tools for explorers, explaining and evaluating your recommender system - by Dr Chris DuBois, Turi

[recommendations], [explainability]

This was a demo of various features built into GraphLab Create to explain the behavior of a recommender system.

Advancing the Python Data Stack with Apache Arrow - by Wes McKinney, Cloudera

[distributed systems], [machine learning]

This talk was more about interoperability between different systems, using Apache Arrow as the data middleware. Apache Arrow supports an intermediate format and converters to write into different formats. Wes McKinney (creator of Pandas) has collaborated with Hadley Wickham (creator of many R packages) to seamlessly transfer data back and forth between Pandas dataframes and R dataframes. For users of Spark and Python (non-Spark), Apache Arrow also promises to one day allow reading Parquet files from standalone Python programs.

Using Graphs for improving recommendations, Amit Bhattacharya, Teachers pay teachers.

[machine learning], [recommendations], [unsupervised], [graph theory]

This is a very interesting application of graph theory to build a recommendation system. The user population are teachers who purchase books from the site, and the recommender's job is to suggest new books to purchase. A graph was built based on the user and purchase data currently available - teachers who purchase the same books are linked by an edge with weight proportional to the number of books they have in common. A few central users in this graph are labelled manually (Elementary School, High School Math, etc), and Label Propagation (functionality built into GraphLab Create) used to label the other teachers into the most probable cluster they belong to. Finally, a user is recommended books that other teachers in that cluster are purchasing.

Lessons learned from 2MM machine learning models - Dr Anthony Goldbloom, Kaggle

[machine learning]

Nice overview of general strategies used by Kaggle contestants. Popular algorithms used include XGBoost, Random Forests and Deep Learning (CNN) for Image competitions, RNN/LSTM used to a lesser extent. Also a quick glimpse into Kaggle's future plans of providing a more rounded metric of a contestant's data science skills as a whole.

Design for X - Amanda Cesari, Concur Labs

[distributed systems], [machine learning]

Amanda Cesari provides an overview of Concur Labs Data Science Stack, which includes Apache Spark and GraphLab Create. The general approach is to use Spark to analyze large volumes of data and reduce it to a medium size, then process it with GraphLab Create. This is quite pragmatic given that GraphLab Create can handle medium sized data thanks to SFrames, and has more choices in terms of algorithms compared to MLLib. She also covers a case study using Anomaly Detection using the above stack.

DSSTNE - A new deep learning framework for large sparse datasets - by Scott Le Grand, Teza Technologies

[deep learning], [recommendations]

Scott Le Grand describes DSSTNE (pronounced Destiny), Amazon's DL framework for handling super-sparse matrices. Amazon's catalogs are very large, and off-the-shelf DL packages could not handle the degree of sparseness they required. The network described looks conceptually like an Autoencoder that takes a 1-hot encoding of an item as input and generates embeddings (similarity scores, recommendations) as output. DSSTNE currently supports only the fully connected model, but has plans to support CNN and RNN in the future. Scott suggests using Nervana's Neon for building CNN and RNNs. Scott has reported that DSSTNE is 15% faster than Tensorflow, more details on his blog post.

Developing customer insights at Microsoft Visual Studio - Sai Tulasi Neppali, Microsoft

[recommendations], [unsupervised]

Sai Tulasi Neppali describes how her group used telemetry data collected from Visual Studio users to categorize them into three classes based on their usage data. The categorization helps to drive email campaigns designed to retain these users in different ways.

Deep Learning and Machine Learning: A view from the trenches - Supratim Banerjee, India Equity Partners

[machine learning], [deep learning], [recommendations], [unsupervised]

The use case here is to make sure trucks in the company's fleet are optimally loaded to maximize the revenue per kilometer. Photos of various sizes of loads were taken and image vectors extracted from them (most likely by using GraphLab Create's built in functionality using a CIFAR-10 CNN model). A k-nearest neighbors job was run on the vectors with k=10, and a graph was created with each node connected to its 10 neighbors. PageRank was run on the graph to find the top N important nodes, and these nodes were manually classified as full or empty. For a new picture, it is converted to an image vector and cosine similarity computed against these N vectors - the label of the closest vector is assigned to the new image.

The Data Science behind Bot blocking - William Cox, Distil Networks

[machine learning], [recommendations]

A very entertaining and riveting talk about the constant one-upmanship between a web bot and its human defenders. I liked the idea of fingerprinting users based on their activity, so it is easy to detect anomalous behavior against baselines with similar fingerprints.

DAY 2

Natural Language Understanding Pipelines: from keywords and grammar to inference and prediction - by Dr David Talby, Atigeo

[machine learning], [deep learning], [natural language processing]

Dr Talby describes Atigeo's Natural Language Understanding (NLU) pipeline. Their input corpus is the MIMIC dataset and their technology stack contains Apache Spark, Elasticsearch and UIMA. They started with simple dictionary based attributes and off the shelf NER, but have since created drug-disease knowledge graphs using word2vec embeddings on the critical care notes (reduced to a bag of UMLS concepts). He provides notebooks at Atigeo/nlp_demo that show some aspects of what they are doing and how.

The power of geospatial graph visualization - by Corry Lanum, Cambridge Intelligence

[visualization]

Nice presentation show how geospatial graphs (ie, overlaying data on top of maps) can increase the understanding of the data.

Exploratory Data Analysis 2.0 - Jock MacKinlay, Tableau

[visualization]

Very nice and detailed demo of Tableau features with 2 example datasets. Demonstrates the power and flexibility of the Tableau tool to develop insights from data.

Staying shallow and lean in a deep learning world - Dr Xavier Amatriain, Quora

[machine learning]

Dr Xavier Amatriain talks about the pitfalls of using DL indiscriminately. He mentions several other algorithms which are as deserving of Data Scientist's attention but which are not being considered because of DL's popularity, such as Factor Methods, Non-parameteric Bayesian Models, Online Learning, Reinforcement Lerning and Learning to Rank. He also talks about how DL models are hard to explain and mentions the Why should I trust you? paper, which lays out a technique for explanation that should be adopted by all ML models.

Matrix Factorization at scale: a comparison of scientific data analytics on Spark and MPI using three case studies - Prof Michael Mahoney, UCB

[distributed systems]

Prof Mahoney describes 3 matrix factorization techniques (NMF, PCA and CX) on Spark and Cray, and shows how using MPI locally can result in speedups. More details in his paper.

Deep Personalization - by Prof Alex Smola, CMU

[machine learning], [deep learning], [recommendations]

Prof Alex Smola talks about how to capture implicit recommendations that vary with time. Most recommendation systems do not consider how user preferences change over time. He uses survival analysis to model this change. User and Time embeddings are fed into an LSTM to produce time varying recommendations.

How to analyze 500,000h/day of human to human conversation with bleeding edge Deep Learning models - by Yishay Carmiel, Spoken Labs

[distributed systems], [machine learning], [deep learning]

Yishoy Carmiel describes his learnings when faced with processing large amounts of conversation data. He describes techniques to reduce processing times in DNNs for audio processing, including frame subsampling, using WFST beam search, using Deep Autoencoders to reduce the number of features, binarizing the weights and inputs. His changes resulted in a 35x boost in performance and he was ultimately able to process the volume within the time and expense budgeted.

The exploit-explore dilemna of music recommendation - by Dr Oscar Celma, Pandora

[recommendations], [graph theory]

Dr Oscar Celma talks about how to balance exploit (play songs that user is known to like) vs explore (play songs that the user might like based on past preferences). The decision to switch a given user from an exploit song to an explore song is made by using Markov chains over a graph of songs and user preferences. However, for any major changes in the algorithm, AB testing is done on a control small control group and rolled out to the general user base only if retention and activity metrics indicate that the change was received well.

Understanding cortical principles and building intelligent machines - by Subutai Ahmad, Numenta

[machine learning], [deep learning], [unsupervised], [online learning]

Subutai Ahmad describes the Hierarchical Temporal Memory (HTM) which is a general model of the neocortex. He describes his system as Neuroscience applied to streaming analytics, which is exactly how the human brain learns. Details of HTM are on the numenta/nupic project. He also describes NAB, a streaming anomaly detection benchmark that detects anomalies in real-time with a small amount of initial training and learns adaptively thereafter. The NAB project is available at numenta/NAB.

Product Reviews and NLP analysis and Elasticsearch - Dr Lynn Cherny, Ghostweather R&D

[machine learning], [natural language processing]

Dr Lynn Cherney delivers a nice tutorial about using NLP on Yelp! Product Review dataset. Much of it is about data analysis and loading into Pandas dataframes so it can be loaded into ElasticSearch and queried from it. She has supporting notebooks at arnicas/nlp_elasticsearch_reviews.

Scalable Learning and Recognition - Prof Ali Farhadi, University of Washington (UoW)/Allen AI

[deep learning], [online learning]

Prof Farhadi describes a number of systems that he and his students have built, that attempt to learn visually. The first is Learn EVerything About ANything (LEVAN), which learns by crawling the web for images and looking at image metadata. The second system is Visual Knowledge Extraction (VisKe), that learns interactions between entities and is able to answer questions about them. It uses factor graphs to model relationships between concepts and find the most probable explanation. The third is You Only Look Once (YOLO), which focuses on very fast recognition of images in photographs, and is described more fully in this paper. For YOLO, he uses XNOR-Net, which are CNNs with binary weights, which resulted in 30x boost in performance without loss in accuracy. They are described in this paper.

Towards Transparent AI systems: Do Humans and deep networks look at the same regions while answering visual questions? - by Prof Dhruv Batra, Virginia Tech

[deep learning], [explainability]

Prof Batra discusses how to verify that a deep learning vision system is doing what it appears to be doing. This research is an offshoot of the Visual QA (VQA) project. It generates attention maps of VQA models against human attention using visualizations (visual occlution and partial decomposition) and rank order correlation methods. The work is described in greater detail in this paper.

And here are the talks I wanted to go to but could not because I was attending one in a parallel session that I thought was more interesting. I hope to watch these once Turi publishes all the videos. If the videos are made public (which I hope they will), I will post the link to the videos once I have it.

Making data accessible with SQL on everything - Tomer Shiran, Apache Drill
Machine Learning for Analyzing complex time series - Prof Emily Fox, UoW
MOOCS/Turn 4: what have we learned? - Prof Daphne Koller, Coursera
Why did you recommend that? - Delip Rao, Joostware
AUC at what cost? - Alex Korbonits, Remitly
Next generation image processing - Dr Lukasz Kidzrinski, Deepart
Large scale Deep Learning with Tensorflow - Jeff Dean, Google
Engineering Open Machine Learning Software - Andreas Mueller, NYU
Machine Learning in Production - Dr Yucheng Low, Turi
Churn Prediction, Aggregate Features and Visualizations - Dr Srikrishna Sridhar, Turi
Active Learning and Human in the loop - Lukasz Biewald, Crowdflower
Personalizing image search with feature vectors - Rodrigo Nunes, The Real Self

Overall, I thought it was quite a nice conference. Because it is organized by a for-profit company, there were quite a few talks from employees and interesting user stories from satisfied customers and partners. However, because of the company's roots and connections in academia, there were quite a few talks from highly acclaimed researchers as well. I thought the mix between academic and business focused talks was as perfect as it could be. Looking forward to a few months of digging through all the github repositories and paper links I collected in this conference.

Monday, July 11, 2016

NLP (almost) From Scratch - Implementing the POS Network

Recently I was tasked with reading the Natural Language Processing (almost) from Scratch paper by Collobert, et al, and sharing my findings with other members of the San Francisco Deep Learning Enthusiasts meetup group. Unlike other meetups, where you go to listen to experts, this one is more like a study group. Our primary activity is watching and discussing deep learning videos and sharing information about Deep Learning (DL). We (usually) meet every Thursday - if you are interested in bootstrapping your DL skills and are in San Francisco, you should join us.

But in any case, getting back to the paper. Even though it was written in 2011, it is still interesting as one of the first major application of DL techniques to Natural Language Processing (NLP). Proof that its interesting to the population at large (as opposed to just us in our meetup) comes from the fact that it featured on The Morning Paper recently (July 4th 2016).

The field of NLP has certain core tasks upon which other higher level, more "magical" applications are built. These tasks are part of speech (POS) tagging, phrase chunking, named entity recognition (NER) and semantic role labeling (SRL). Each of these tasks have standard datasets and benchmark results that NLP researchers are constantly trying to beat. What sets this paper apart is that this is the first time where the authors avoided any task specific feature engineering (i.e., creating hand crafted linguistic features), and yet achieved results that beat current benchmarks in 3 of the 4 tasks.

They did this by using a simple Multi-Layer Perceptron (MLP) DL network for the first 3 tasks and a Convolutional Neural Network (CNN) for the last one. Input to the MLP was a context window of 5 words (word ± 2 neighbors). Each word was converted into an embedded vector representation and concatenated to form a context vector. The label was the attribute of the center word (POS or IOB tag). For the SRL tasks, a word based context was not sufficient so they used the entire sentence as input, again converting each word to an embedded vector representation and predicting the word positions of a single verb-predicate pair per sentence.

While reading the paper, I thought it might be interesting to implement one of the models using tools I am familiar with. I chose the simplest model, the one that predicts the POS tags. My input corpus is the text from Alice in Wonderland from Project Gutenberg. I created my own training set by POS tagging it with spaCy. Instead of generating my own embeddings from which to generate the word vectors from, I used gensim to load word2vec embeddings from the pre-trained Google News model. Finally, I used Keras to build the MLP to consume the context vectors and generate the POS predictions. The rest of the post talks about this work.

POS Tagging with spaCy

I manually removed the header and footer from the text of Alice in Wonderland, leaving just the story text starting at "CHAPTER I" and ending with "happy summer days.". I was originally just going to use NLTK to generate the POS tags, but I had heard good things about spaCy, so decided to check it out by using it instead. Here is the code that does that.

# Source: src/spacy_postagging.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from spacy.en import English
import operator
import os

DATA_DIR = "../data"

nlp = English()

falice = open(os.path.join(DATA_DIR, "alice_in_wonderland.txt"), "rb")
content = falice.read()
falice.close()

doc = nlp(content.decode("utf-8"))
sents = []
for sent in doc.sents:
    sents.append(sent)

i = 0
vocab = {}
tags = {}
tagged_sents = []
fout = open(os.path.join(DATA_DIR, "alice_sents_postagged.txt"), "wb")
for sent in sents:
    i += 1
    print("Processing sentence# %d: %s" % (i, sent))
    token_tags = []
    toks = nlp(sent.text.encode("ascii", "ignore").decode("utf-8"))
    for tok in toks:
        token_tags.append((tok.text, tok.pos_))
    clean_token_tags = []
    for tt in token_tags:
        if tt[1] == u"SPACE":
            continue
        clean_token_tags.append(tt)
        vocab[tt[0]] = vocab.get(tt[0], 0) + 1
        tags[tt[1]] = tags.get(tt[1], 0) + 1
    tagged_sents.append(clean_token_tags)
    fout.write("%s\n" % (" ".join(["/".join([tt[0], tt[1]]) 
        for tt in clean_token_tags])))
fout.close()

# replace words which occur 1 or 2 times with UNK in vocab
for word in vocab.keys():
    if vocab[word] < 3:
        vocab["UNK"] = vocab.get("UNK", 0) + 1
        vocab.pop(word, None)
        
# create a lookup dictionary for words
fwords = open(os.path.join(DATA_DIR, "alice_words.txt"), "wb")
vocab_s = sorted(vocab.iteritems(), key=operator.itemgetter(1), reverse=True)
for i, (k, v) in enumerate(vocab_s):
    fwords.write("%d\t%s\t%d\n" % (i, k, v))
fwords.close()

# create a lookup dictionary for POS tags
ftags = open(os.path.join(DATA_DIR, "alice_postags.txt"), "wb")
tags_s = sorted(tags.iteritems(), key=operator.itemgetter(1), reverse=True)
for i, (k, v) in enumerate(tags_s):
    ftags.write("%d\t%s\t%d\n" % (i, k, v))
ftags.close()

# construct 5-grams from sentences
fgrams = open(os.path.join(DATA_DIR, "alice_5grams.txt"), "wb")
for tagged_sent in tagged_sents:
    sent_grams = []
    gram_labels = []
    # lowercase the words
    tagged_sent = [(x[0].lower(), x[1]) for x in tagged_sent]
    # replace with UNK for specific words
    tagged_sent = [(x[0] if vocab.has_key(x[0]) else "UNK", x[1]) 
                         for x in tagged_sent]
    # put pre- and post- padding
    tagged_sent.insert(0, ("PAD", "PAD"))
    tagged_sent.insert(0, ("PAD", "PAD"))
    tagged_sent.append(("PAD", "PAD"))
    tagged_sent.append(("PAD", "PAD"))
    for i in range(len(tagged_sent) - 4):
        sent_gram = tagged_sent[i:i+5]
        # label of middle word, and input words is 5-gram around word
        fgrams.write("%s\t%s\n" % (sent_gram[2][1], 
                                   " ".join([x[0] for x in sent_gram])))
fgrams.close()

The code uses Spacy to generate POS tags inline with the words. For example, this sentence:

She took down a jar from one of the shelves as she passed; it was labelled 
'ORANGE MARMALADE', but to her great disappointment it was empty: she did 
not like to drop the jar for fear of killing somebody, so managed to put it 
into one of the cupboards as she fell past it.

is converted to this format:

She/PRON took/VERB down/PART a/DET jar/NOUN from/ADP one/NUM of/ADP the/DET 
shelves/NOUN as/ADP she/PRON passed/VERB ;/PUNCT it/PRON was/VERB 
labelled/VERB '/PUNCT ORANGE/ADV MARMALADE/PROPN '/PUNCT ,/PUNCT but/CONJ 
to/ADP her/ADJ great/ADJ disappointment/NOUN it/PRON was/VERB empty/ADJ 
:/PUNCT she/PRON did/VERB not/ADV like/VERB to/PART drop/VERB the/DET 
jar/NOUN for/ADP fear/NOUN of/ADP killing/VERB somebody/NOUN ,/PUNCT 
so/ADV managed/VERB to/PART put/VERB it/PRON into/ADP one/NUM of/ADP 
the/DET cupboards/NOUN as/ADP she/PRON fell/VERB past/ADP it/PRON ./PUNCT

The code then uses the intermediate format above to generate a list of word and POS tag frequencies, which it writes out to files. The word frequencies are used to replace any word that occurs 2 or fewer times in the text with the token UNK (unknown). I also calculate POS tag frequencies, which I will use later. I then run through each sentence, generating 5-grams out of the words in the sentence. This creates records like this:

PRON    PAD PAD she took down
VERB    PAD she took down a
PART    she took down a jar
DET     took down a jar from
NOUN    down a jar from one
ADP     a jar from one of
NUM     jar from one of the
ADP     from one of the shelves
DET     one of the shelves as
NOUN    of the shelves as she
ADP     the shelves as she passed
PRON    shelves as she passed ;
...

Looking up word2vec Vectors with gensim

I now use the 5-grams and the associated POS tag for the middle word and compute word2vec embeddings for each of the words. The word2vec team has released a prebuilt embedding model that is trained on 100B words and returns a 300 dimensional embedding given a word. It is available here along with some other models. Gensim provides a nice API to read this model and extract word2vec vectors from it for words in your corpus. In the code below, I use this API to convert our 5-grams words into a (1, 1500) vector and our labels into a (1, 15) label vector and write them out to files for the next stage.

# Source: gensim_word2vec.py
# -*- coding: utf-8 -*-
from gensim.models import word2vec
import numpy as np

print("Loading label lookup...")
label_lookup = {}
f_postags = open("../data/alice_postags.txt", "rb")
for line in f_postags:
    lid, ltext, _ = line.strip().split("\t")
    label_lookup[ltext] = int(lid)
f_postags.close()

print("Loading word2vec model...")
w2v = word2vec.Word2Vec.load_word2vec_format(
    "../data/GoogleNews-vectors-negative300.bin.gz", binary=True)
vec_size = 300
vec_pad = np.zeros(vec_size)
vec_unk = np.ones(vec_size)
ngram_size = 5

print("Writing vectors...")
f_data = open("../data/alice_5grams.txt", "rb")
f_X = open("../data/alice_X.csv", "wb")
f_y = open("../data/alice_y.csv", "wb")
nbr_read = 0
for line in f_data:
    nbr_read += 1
    if nbr_read % 1000 == 0:
        print("    Wrote %d vectors..." % (nbr_read))
    label, ngram = line.strip().split("\t")
    lid = label_lookup[label]
    word_vecs = np.zeros((ngram_size, vec_size))
    for i, word in enumerate(ngram.split(" ")):
        if word == "PAD":
            word_vecs[i] = vec_pad
        elif word == "UNK":
            word_vecs[i] = vec_unk
        else:
            try:
                word_vecs[i] = w2v[word]
            except KeyError:
                word_vecs[i] = vec_unk
    ngram_vec = np.reshape(word_vecs, (ngram_size * vec_size))
    f_X.write("%s\n" % (",".join(["%.5f" % (x) for x in ngram_vec.tolist()])))
    label_vec = np.zeros(len(label_lookup))
    label_vec[lid] = 1
    f_y.write("%s\n" % (",".join(["%d" % (x) for x in label_vec.tolist()])))
print("Wrote %d vectors" % (nbr_read))    
f_X.close()
f_y.close()

In order to include words at the sentence edges in our 5-grams, I used PAD characters which obviously don't have an associated word2vec vector. For these, I assigned a vector of all zeros. Similarly for the UNK words, I assign a vector of all ones. I also assign to UNK any words that I am unable to find in the word2vec model - given the size of the word2vec model's training set, such words should be few and by definition rare, so its similar to the original UNK words (2 or less in corpus). Output of this stage is two files - X.csv and y.csv. X.csv, which is a comma-separated list of 1500 numbers, each line representing the input vector for each record. y.csv is a 1-hot encoding of the POS tag label for the center word. There are 34,459 rows of input.

Training DL Model with Keras

Finally, I train a MLP network with a single hidden layer. The original paper specifies 300 hidden units, but their input vector shape (50,) was also much smaller than my (1500,), so my hidden layer has 512 units. They also used hard tanh as their non-linearity, whereas I used ReLU. I trained the model on 70% of the training data and validated against the remaining 30%. Running the training for 50 epochs produced a model with a best validation loss of 0.68 and training loss of 0.96. (If you are curious, as I was, why validation loss is less than training loss, a good explanation can be found in the Keras FAQ). Here is the code for training the model.

# Source: src/keras_postagging.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import Adamax
from keras.regularizers import l2
import matplotlib.pyplot as plt
import numpy as np
import os

DATA_DIR = "../data"

np.random.seed(42)

# read data
X = np.loadtxt(os.path.join(DATA_DIR, "alice_X.csv"), delimiter=",")
y = np.loadtxt(os.path.join(DATA_DIR, "alice_y.csv"), delimiter=",")

# set up model
model = Sequential([
    # input layer
    Dense(768, input_shape=(1500,), W_regularizer=l2(0.001)),
    Activation("relu"),
    Dropout(0.2),
    # hidden layer
    Dense(512, W_regularizer=l2(0.001)),
    Activation("relu"),
    Dropout(0.2),
    # output layer
    Dense(15),
    Activation("softmax")
])

adamax = Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss="categorical_crossentropy", optimizer=adamax)

# save model structure
model_struct = model.to_json()
fmod_struct = open(os.path.join(DATA_DIR, "alice_pos_model.json"), "wb")
fmod_struct.write(model_struct)
fmod_struct.close()

# train model
checkpoint = ModelCheckpoint(os.path.join(DATA_DIR, "checkpoints",
    "alice_pos_weights.{epoch:02d}-{val_loss:.2f}.hdf5"), 
    monitor="val_loss", save_best_only=True, mode="min")
hist = model.fit(X, y, batch_size=128, nb_epoch=50, shuffle=True,
                 validation_split=0.3, callbacks=[checkpoint])

# plot losses
train_loss = hist.history["loss"]
val_loss = hist.history["val_loss"]
plt.plot(range(len(train_loss)), train_loss, color="red", label="Train Loss")
plt.plot(range(len(train_loss)), val_loss, color="blue", label="Val Loss")          
plt.xlabel("epochs")
plt.ylabel("loss")
plt.legend(loc="best")
plt.show()

I chose the Adamax optimizer because it gave the best results. I also played around a bit with the other hyperparameters, such as different non-linearities, hidden layer sizes, number of layers, etc. The chart below shows how the training and validation losses change over time. I also use the ModelCheckpoint callback to capture the weights for the model with the lowest validation loss, and save the model structure into a JSON file.

Making predictions

I now take the ngrams corresponding to our example sentence and extract out the ngrams and the corresponding vectors. I then load the model structure and weights and (this is very important) recompile the model. I can then call model.predict to get back a (15,) vector from which I extract the highest scoring POS tag. The code for doing the prediction is shown below:

# Source: pos_predict.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.models import model_from_json
from keras.optimizers import Adamax
import numpy as np
import os

DATA_DIR = "../data"

# deserialize model
fmods = open(os.path.join(DATA_DIR, "alice_pos_model.json"), "rb")
model_json = fmods.read()
fmods.close()
model = model_from_json(model_json)
model.load_weights(os.path.join(DATA_DIR, "alice_pos_weights.30-0.68.hdf5"))
adamax = Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss="categorical_crossentropy", optimizer=adamax)

# label lookup
label_dict = {}
fpos = open(os.path.join(DATA_DIR, "alice_postags.txt"), "rb")
for line in fpos:
    lid, ltxt, _ = line.strip().split("\t")
    label_dict[int(lid)] = ltxt
fpos.close()

# read ngrams into array
ngram_labels = []
fngrams = open(os.path.join(DATA_DIR, "alice_5grams_pred.txt"), "rb")
for line in fngrams:
    label, ngram = line.strip().split("\t")
    ngram_labels.append((ngram, label))
fngrams.close()

# read word+context vectors and predict from model
fpred = open(os.path.join(DATA_DIR, "alice_test_pred.txt"), "wb")
fvec = open(os.path.join(DATA_DIR, "alice_X_pred.csv"), "rb")
lno = 0
for line in fvec:
    X = np.array([float(x) for x in line.strip().split(",")]).reshape(1, 1500)
    y_ = np.argmax(model.predict(X))
    nl = ngram_labels[lno]
    fpred.write("%s\t%s\t%s\n" % (nl[0], nl[1], label_dict[y_]))
    lno += 1
fvec.close()    
fpred.close()

A partial output for my test sentence is shown below. The first column is the word ngram, the second is the true label (as computed using spaCy) and the third column is the predicted label (as computed by my Keras MLP). As you can see, even given the relatively small input size, the results seem quite good. Of course, POS tagging is a relatively simple task, so I should probably not read too much into these results.

PAD PAD she took down    PRON   PRON
PAD she took down a      VERB   VERB
she took down a UNK      PART   ADV
took down a UNK from     DET    PUNCT
down a UNK from one      NOUN   PUNCT
a UNK from one of        ADP    ADP
UNK from one of the      NUM    NUM
from one of the UNK      ADP    ADP
one of the UNK as        DET    DET
of the UNK as she        NOUN   PROPN
the UNK as she passed    ADP    ADP
UNK as she passed ;      PRON   PRON
as she passed ; it       VERB   VERB
...

And thats all I have for this week. Doing this ended up being a lot of fun, thanks to awesome libraries like spaCy, gensim and keras. Hope you enjoyed reading it too.

Sunday, July 03, 2016

thinkstats-examples - my answers to Think Stats exercises

Josh Wills famously described a data scientist as someone better at statistics than a software engineer and better at software engineering than a statistician. My background is in software engineering, so I am always looking for ways to get better at statistics. Recently I was watching some PyCon videos on Youtube, and came across Prof. Allen B Downey's Bayesian Statistics Made Simple talk at PyCon 2015.

I found the approach quite unique - instead of proving theorems, he creates programs that simulate the setup using random data, and then uses the results to provide an intuition about the behavior the theorem describes. The talk was about Bayesian statistics, which he covers in detail in his book Think Bayes. He also mentioned one of his other books Think Stats, which is aimed at someone who is more programmer and less statistician. Unfortunately, even with the computational approach, I didn't quite fully understand his talk. So I decided to fix that by working my way through the two books. This post describes some notebooks I created as a result of working through the Think Stats book.

The notebooks have been uploaded to this Github repository and contains the following Jupyter (aka IPython) notebooks.

Introduction and Descriptive Statistics (Chapters 1 and 2) - illustrates how to use visualizations and descriptive statistics, including Probability Mass Functions (PMF) to answer questions about differences in distributions.
Cumulative Distribution Functions (Chapter 3) - describes how Cumulative Distribution Functions (CDF) are an alternative representation of distributions with many members, relationship between CDFs and percentiles, the use of CDFs for resampling, etc.
Continuous Distributions (Chapter 4) - introduces several common continuous distributions, such as Exponential, Pareto, Weibull, Normal and Lognormal, and describes strategies to fit data to these distributions.
Probability (Chapter 5) - introduces basic probability rules, Bayes theorem, the Binomial distribution and shows how to apply them to real problems. Includes a simulation of the famous Monty Hall problem and Poincare's baker problem.
Operations on Distributions (Chapter 6) - covers skewness, random variables and how to create them based on given distributions, rules for combining two or more normal distributions, and the Central Limit theorem.
Hypothesis Testing (Chapter 7) - covers computational techniques to determine if apparent effects are significant, how to compute p-values from data, etc.
Estimation (Chapter 8) - covers techniques to estimate distributions based on insufficient data. Covers the locomotive problem, a special case of the German tanks problem.
Correlation (Chapter 9) - covers techniques to compute Pearson's and Spearman's correlation coefficients, Linear Least Squares curve fitting, the relationship between Pearson's coefficient and R-squared, etc.

The examples in the book build up, chapter by chapter, a library of functions written in pure Python. Later functions call earlier functions, and their usage is almost like a Domain Specific Language (DSL). Since I have been using the Scientific Python stack (numpy, scipy, matplotlib, pandas, etc) for a while now, I decided to skip the DSL and use the libraries from the Scientific Python stack instead. Although there were times I wished I hadn't done so, I think overall it was the right choice for me, since it allows me to apply the concepts directly to my own projects without having to go through the DSL. Of course, YMMV.

One other thing that this mini-project has helped me with is becoming really good at writing LaTeX in Markdown :-). I started using the online LaTeX equation editor and copy-pasting the LaTeX into my notebook, but somewhere around Chapter 4, I developed the ability to just write the equations directly into the notebook. I think writing the equations this way helps make them much more readable, so acquiring this skill was a nice side effect.

The one caveat is that at least some of the answers are very likely to be incorrect. While I have tried to ensure that they are correct to the best of my ability, I am not an expert by any stretch of the imagination, and there were quite a few times when I found the material in the book pretty hard to go through. If you do find an error, please create an issue and tell me why I am wrong and preferably provide a correct answer, I will update the example and give you credit.

Thats all I have for today, hope you find the examples useful. At some point in the (hopefully near) future, I plan on doing something similar for the Think Bayes book as well. For those of you in the US, have a great 4th of July!

Salmon Run