Saturday, October 29, 2016

Predicting Student Alcohol Consumption with XGBoost


I've been hearing good things about the XGBoost algorithm. A colleague mentioned it to me early this year when I was describing how I used Random Forests to do some classification task. More recently, Dr Anthony Goldbloom from Kaggle spoke at the Data Science Summit and mentioned that it is one of the top 3 algorithms used in Kaggle competitions.

I've been meaning to check it out by applying it to a problem, but so far none of the problems I had fit the XGBoost use case. I even bought (and read) the excellent XGBoost with Python EBook by Dr Jason Brownlee, hoping to speed up my uptake by cutting out the exploration time. Finally, I decided to bite the bullet and try it out with some random data, just to see it work.

For a dataset, I went looking at the UCI Machine Learning Repository for something interesting that hadn't been done to death already. I came across the Student Alcohol Consumption dataset made available by Fabio Pagnotta and Hossain Mohammad Amran of the University of Camerino. The data contains various attributes about 1400 Portugese high school students enrolled in two courses. Pagnotta and Amran used Business Intelligence and Data Mining tools (specifically KNIME) it to predict alcohol consumption among these students. They also use the model to report on the top attributes that are predictive of high alcohol consumption. You can read more in their paper Using Data Mining to Predict Secondary School Student Alcohol Consumption.

In this post, I describe how I used XGBoost to do the same thing, and how my results compared with the original authors. XGBoost exposes an API that is identical to Scikit-Learn (at least at the level I was using it at), so there are no real surprises in the code. You can mix and match Scikit-Learn components with XGBoost and they all work seamlessly.

The first part is to get the data into a form that our XGBoost classifier (or any classifier for that matter) can consume. The data is described here. As you can see, there are quite a few categorical (Mjob, Fjob, guardian, etc) and nominal (school, sex, Medu, etc) variables that need to be converted. Also the data is split into two files, one for the Math class and one for the Portugese (language) class, so they need to be joined as well.

In addition, the target variable (alcohol consumption) is provided as two 5-category variables, the Daily Alcohol consumption (Dalc) and Weekend Alcohol consumption (Walc). The paper converts this information into a new categorical variable and then thresholds it to yield a binary high/low alcohol consumption target variable. I follow the paper's approach as well.

The code below preprocesses the input files to replace nominal and categorical variables by equivalent numeric and one-hot encoding numeric variables. It also merges the two files into a single one, and derives the value of the new binary target variable as described above.

# Source: src/student-alcohol/preprocess.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
import operator
import os
import re

DATA_DIR = "../../data/student-alcohol"
DATA_FILES = ["student-por.csv", "student-mat.csv"]

SUBJ_DICT = { "por": 0, "mat": 1 }
SCHOOL_DICT = { "GP": 0, "MS": 1 }
SEX_DICT = { "F": 0, "M": 1 }
ADDR_DICT = { "U" : 0, "R": 1 }
FAMSIZE_DICT = { "LE3": 0, "GT3": 1 }
PSTAT_DICT = { "T": 0, "A": 1 }
JOB_DICT = { "teacher": [1, 0, 0, 0, 0], 
             "health": [0, 1, 0, 0, 0],
             "services": [0, 0, 1, 0, 0],
             "at_home": [0, 0, 0, 1, 0],
             "other": [0, 0, 0, 0, 1] }
REASON_DICT = { "home": [1, 0, 0, 0],
                "reputation": [0, 1, 0, 0],
                "course": [0, 0, 1, 0],
                "other": [0, 0, 0, 1] }
GUARDIAN_DICT = { "mother": [1, 0, 0],
                  "father": [0, 1, 0],
                  "other": [0, 0, 1] }
YORN_DICT = { "yes": 0, "no": 1 }

def expand_options(colvalues):
    options = sorted([(k, v.index(1)) for k, v in colvalues.items()],
                      key=operator.itemgetter(1))
    return [k for k, v in options]
    
def get_output_cols(colnames):
    ocolnames = []
    ocolnames.append("subject")
    for colname in colnames:
        if colname in ["Mjob", "Fjob"]:
            for option in expand_options(JOB_DICT):
                ocolnames.append(":".join([colname, option]))
        elif colname == "reason":
            for option in expand_options(REASON_DICT):
                ocolnames.append(":".join([colname, option]))
        elif colname == "guardian":
            for option in expand_options(GUARDIAN_DICT):
                ocolnames.append(":".join([colname, option]))
        elif colname in ["Dalc", "Walc"]:
            continue
        else:
            ocolnames.append(colname)
    ocolnames.append("alcohol")
    return ocolnames        
        
def preprocess_data(cols, colnames, subj):
    pcols = []
    alc = 0.0
    pcols.append(str(SUBJ_DICT[subj]))
    for i, col in enumerate(cols):
        if colnames[i] == "school":
            pcols.append(str(SCHOOL_DICT[col]))
        elif colnames[i] == "sex":
            pcols.append(str(SEX_DICT[col]))
        elif colnames[i] == "age":
            pcols.append(col)
        elif colnames[i] == "address":
            pcols.append(str(ADDR_DICT[col]))
        elif colnames[i] == "famsize":
            pcols.append(str(FAMSIZE_DICT[col]))
        elif colnames[i] == "Pstatus":
            pcols.append(str(PSTAT_DICT[col]))
        elif colnames[i] in ["Medu", "Fedu"]:
            pcols.append(col)
        elif colnames[i] in ["Mjob", "Fjob"]:
            for v in JOB_DICT[col]:
                pcols.append(str(v))
        elif colnames[i] == "reason":
            for v in REASON_DICT[col]:
                pcols.append(str(v))
        elif colnames[i] == "guardian":
            for v in GUARDIAN_DICT[col]:
                pcols.append(str(v))
        elif colnames[i] in ["traveltime", "studytime", "failures"]:
            pcols.append(col)
        elif colnames[i] in ["schoolsup", "famsup", "paid", 
                             "activities", "nursery", "higher",
                             "internet", "romantic"]:
            pcols.append(str(YORN_DICT[col]))
        elif colnames[i] in ["famrel", "freetime", "goout",
                             "health", "absences", "G1", "G2", "G3"]:
            pcols.append(col)
        elif colnames[i] == "Dalc":
            alc += 5 * int(col)
        elif colnames[i] == "Walc":
            alc += 2 * int(col)
    alc /= 7
    is_drinker = 0 if alc < 3 else 1
    pcols.append(str(is_drinker))
    return ";".join(pcols)

colnames = []
fout = open(os.path.join(DATA_DIR, "merged-data.csv"), "wb")
for data_file in DATA_FILES:
    subj = data_file.split(".")[0].split("-")[1]
    fdat = open(os.path.join(DATA_DIR, data_file), "rb")
    for line in fdat:
        line = line.strip()
        if line.startswith("school;"):
            if len(colnames) == 0:
                colnames = line.split(";")
            continue
        cols = [re.sub("\"", "", x) for x in line.split(";")]
        pline = preprocess_data(cols, colnames, subj)
        fout.write("{:s}\n".format(pline))
    fdat.close()
fout.close()

fcolnames = open(os.path.join(DATA_DIR, "merged-colnames.txt"), "wb")
output_colnames = get_output_cols(colnames)
for ocolname in output_colnames:
    fcolnames.write("{:s}\n".format(ocolname))
fcolnames.close()

Next we build a XGBoost classifier using 70% of this data for training the classifier, and evaluate the classifier on the remaining 30% of the data. As I mentioned before, the XGBoost API is identical to Scikit-Learn, so you will find the code below surprisingly boring.

# Source: src/student-alcohol/train-classifier.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from sklearn.cross_validation import train_test_split
from sklearn.metrics import *
from xgboost import XGBClassifier

import cPickle as pickle
import numpy as np
import os

DATA_DIR = "../../data/student-alcohol"

dataset = np.loadtxt(os.path.join(DATA_DIR, "merged-data.csv"), 
                     delimiter=";")
X = dataset[:, 0:-1]
y = dataset[:, -1]

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, 
                                                random_state=42)

clf = XGBClassifier()
clf.fit(Xtrain, ytrain)

y_ = clf.predict(Xtest)

print("Accuracy: {:.3f}".format(accuracy_score(ytest, y_)))
print()
print("Confusion Matrix")
print(confusion_matrix(ytest, y_))
print()
print("Classification Report")
print(classification_report(ytest, y_))

with open(os.path.join(DATA_DIR, "model.pkl"), "wb") as fclf:
    pickle.dump(clf, fclf)

The accuracy reported on the test set is approximately 91%. As you can see from the confusion matrix (and the classification report), the classifier does a better job predicting low alcohol consumption students than high alcohol consumption students.

Accuracy: 0.908

Confusion Matrix
[[275   5]
 [ 24  10]]

Classification Report
             precision    recall  f1-score   support

        0.0       0.92      0.98      0.95       280
        1.0       0.67      0.29      0.41        34

avg / total       0.89      0.91      0.89       314

I did try 10-fold cross validation and Grid Search to attempt to find better variables, but did not have much success. I suspect that these numbers can be improved by upsampling to make the dataset more balanced, but the results looked good enough (for my current purpose) to me, so I moved on to looking at what else the model was telling us.

First I looked at the features that the model considered most important. Then I looked at the features individually to see which values corresponded with high vs low alcohol consumption. The code to generate that information is shown below.

# src/student-alcohol/top-features.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function

import numpy as np
import cPickle as pickle
import matplotlib.pyplot as plt

import os

DATA_DIR = "../../data/student-alcohol"

with open(os.path.join(DATA_DIR, "model.pkl"), "rb") as fclf:
    clf = pickle.load(fclf)

important_features = clf.feature_importances_

colnames = []
fcolnames = open(os.path.join(DATA_DIR, "merged-colnames.txt"), "rb")
for line in fcolnames:
    colnames.append(line.strip())
fcolnames.close()
colnames = colnames[0:-1]

# feature importances
plt.figure(figsize=(20, 10))
plt.barh(range(len(important_features)), important_features)
plt.xlabel("importance")
plt.ylabel("features")
plt.yticks(np.arange(len(colnames))+0.35, colnames)
plt.show()

# list of top features
print("Top features")
top_features = np.argsort(important_features)[::-1][0:15]
for i in range(15):
    idx = top_features[i]
    print("\t{:.3f}\t{:s}".format(important_features[idx], colnames[idx]))

# distribution of top features with output
dataset = np.loadtxt(os.path.join(DATA_DIR, "merged-data.csv"), delimiter=";")
X = dataset[:, 0:-1]
y = dataset[:, -1]

colors = ["lightgray", "r"]
plt.figure(figsize=(20, 10))
fig, axes = plt.subplots(5, 3)
axes = np.ravel(axes)
for i in range(15):
    idx = top_features[i]
    xvals = X[:, idx]
    xvals_nalc = xvals[np.where(y == 0)[0]]
    xvals_alc = xvals[np.where(y == 1)[0]]
    num_xvals = np.unique(xvals).shape[0]
    if num_xvals <= 2:
        nbins = 2
    elif num_xvals <= 5:
        nbins = 5
    else:
        nbins = 10
    axes[i].hist([xvals_nalc, xvals_alc], bins=nbins, normed=False, 
                 histtype="bar", stacked=True, color=colors)
    axes[i].set_title(colnames[idx])
    axes[i].set_xticks([])
    axes[i].set_yticks([])
plt.xticks([])
plt.yticks([])
plt.tight_layout()
plt.show()

The bar chart below shows a list of all the features and their relative importances. Click on the chart to expand it if you want to read the legends (the feature names) along the Y-axis.


And the list below shows the top 15 features.

Top features
        0.103   absences
        0.074   G1
        0.064   age
        0.054   sex
        0.052   goout
        0.052   famrel
        0.049   health
        0.042   traveltime
        0.039   studytime
        0.037   freetime
        0.034   G3
        0.034   Fjob:services
        0.032   famsize
        0.027   address
        0.025   Medu

Finally, the composite chart below shows the incidence of high alcohol consumption (red) and low alcohol consumption (gray) across various values for the top 15 features.


While my list matches some of the features listed in the paper as indicative of high alcohol consumption, there are also marked differences. I list each of my features with a few words as to why that might make sense. I also indicate where the feature is also reported in the original paper.

The first observation is that most students in this population don't seem to have high alcohol consumption.

  1. absence - this is by far the biggest indicator of high alcohol consumption. Short absences tend to be good indicators of high alcohol use. In the paper, absence was treated as another target variable for another experiment and most likely not even considered for this model.
  2. G1 (scores for first period) - low to medium scores in the first period seem to be indicative of high alcohol use, possibly because of the effects of hangover?
  3. age (high) - there seems to be more high alcohol consumers among older students in the 16-19 age group. The dataset has students from 15-22, but the number of students who are older than 19 seem quite small.
  4. sex (male) - there seems to be more high alcohol consumers among boys than girls. This agrees with the paper (this is their #1 feature).
  5. goout (high) - there seem to be more high alcohol consumers among students who go out more often. This is also observed in the paper (it is their #2 feature).
  6. famrel (high) - somewhat counter-intuitively, more high alcohol consumers are observed to come from families where the quality of relationships are good to excellent.
  7. health (good) - high alcohol consumers are found more among students that are in good health. This is also observed in the paper (#6).
  8. traveltime (low) - more high alcohol consumers are found among students who live close to campus compared to those that live further away. Perhaps this gives them a sense of complacence when they are out drinking late at night. While this feature is observed in the paper as well, they note that high travel times are indicative of high alcohol consumption.
  9. studytime (low) - more people who study for less time tend to be high alcohol consumers. Perhaps this may be a consequence rather than a cause?
  10. freetime (high) - more people with higher free time tend to be high alcohol consumers as well. This looks similar to the observation about study times, but in reverse.
  11. G3 (high) - yet another counter-intuitive observation, people with high alcohol consumption tend to do better in the 3rd period (presumably after their hangovers have passed). Maybe high alcohol users tend to be innately good students with higher confidence in their abilities (hence the drinking), which shows up in these higher scores?
  12. Fjob:service (not) - more high alcohol users are found among students whose father is not in the civil service according to this observation. Perhaps this is because fathers in the civil service are well-connected and they (or their connections) can swing by for surprise visits? The paper makes a similar observation about the father working (#13), although not specifically about the father being in the civil service.
  13. famsize (small) - more high alcohol users come from small families than large ones. This is also observed in the paper (#9).
  14. address (urban) - more high alcohol users come from urban families than rural ones.
  15. Medu (extreme) - a higher number of high alcohol users are observed among students whose mothers have either very little or a lot of education. This is similar to the observation in the paper where they note this effect where the mother's education is low (#5).

And that's all I have for today. While this exercise got me working with XGBoost, the API is quite similar and it was all quite painless. More than learning about how to use XGBoost, I had more fun analyzing the model. Its probably because school (although not high school for me) and alcohol is something most of us can relate to. In any case, I hope you enjoyed reading about it.


Sunday, October 09, 2016

Deep Learning Models for Question Answering with Keras


Last week, I was at a (company internal) workshop on Question Answering (Q+A), organized by our Search Guild, of which I am a member. The word "guild" sounds vaguely medieval, but its basically a group of employees who share a common interest in Search technologies. As so often happens in large companies, groups tend to be somewhat silo-ized, and one group might not know much about what another one is doing, so the objective of the Search Guild is to bring groups together and promote knowledge sharing. To that end, the Search Guild organizes monthly presentations (with internal speakers as well as industry experts from outside the company) delivered via Webex (we are a distributed company with offices in at least 5 continents). It also provides forums for members to share information via blog posts, mailing lists, etc. As part of this effort, and given the importance of Q+A to Search, this year we organized our very first workshop on Q+A at Philadelphia on October 5 and 6.

What was unique about this workshop for me was that I was an organizer, speaker and attendee here. As speaker, there is obviously significant additional work involved with building your presentation and delivering it. As organizer, however, you truly get an appreciation of how much work goes into making an event successful. Many thanks to my fellow organizers for all the work they did, and apologies to the participants (if any of them are reading this) for any mistakes we made (we made quite a few, next time we should definitely use more checklists. Also remote two-way participation is very hard).

The talks at the Workshop were organized into 4 primary themes. The first group of 3 talks (one of which was mine) dealt with approaches designed against external benchmarks, and were a bit more "researchy" than others. The second group of 3 talks dealt with Question Complexity and how people are tackling them in their various projects. The third group of 4 talks looked at strategies used by engines that were already in production or QA, and the fourth group had 3 talks around different approaches to introducing Q+A into our Clinical search engine. In addition, there were several short talks and demos, mostly around Clinical. The most interest and activity in Q+A is around our Legal and Clinical search engines, followed by search engine products built around Life Sciences, Material Science and Chemistry. Attendance wise, we had around 25 in-person participants and 15 remote. 3 of the 13 talks were delivered remotely from our Amsterdam and Frankfurt offices.

My own experience with Question Answering is fairly minimal, mainly attempts to build functionality over search without trying too hard to understand the question implicit in the query. So it was definitely a great learning experience for me, to hear from people who had thought about their respective domains at length and come up with some pretty innovative solutions. As expected, some of the approaches described were similar to what I had used before, but they were used as part of a broader array of techniques, so there was something to learn for me there as well.

In this post, I will briefly describe my presentation and point you to the slides and code. My talk was about a hobby project that my co-presenter Abhishek Sharma and I started couple of months ago, hoping to deepen our understanding of how Deep Learning could be applied to Question Answering. We are both part of the Deep Learning Enthusiasts Meetup (he is the organizer), and he came up with the idea while we were watching Richard Socher's Deep Learning for Natural Language Processing (CS224d) lectures. The project involves implementing a bunch of Deep Learning models to predict the correct choice for multiple choice 8th grade Science questions. The data came from the Allen AI Science Challenge on Kaggle.

You can find the slides for the talk here. All the code can be found in this github repository. The code is written in Python using the awesome Keras library. I also used gensim to generate and load external embeddings, and NLTK and SpaCy for some simple NLP functionality. The README,md is fairly detailed (with many illustrations originally built for the slides), so I am not going to repeat the stuff here.

I looked at the "question with four candidate answers one of which is correct" as a classification problem with 1 positive and 3 negative examples per question. All my models produce a binary (correct/incorrect) response given a question and answer pair. Once the best model (in terms of accuracy of correct/incorrect predictions) is identified, I then run it on all four (question, answer) pairs and select the one with the best score. To do this, I needed to be able to serialize each model after training and deserialize it in the final prediction script. This is where I ran into problems I described in Keras Issue 3927.

To make a long story short, if you re-use an input with the Sequential model, the weights get mis-aligned somehow and cannot be loaded back into the model. I noticed it after I upgraded to the latest version of Keras from a much older version because of some extra layer types I wanted to use. The workaround for the newer version seems to be to use the Functional API. Unfortunately I wasn't able to do the code rewrite and rerun by my presentation deadline, although luckily for me, I did have a usable model for one of my earlier (weaker) classifiers that I saved using the earlier version.

So in the rest of this post, I will describe the architecture and code for my strongest model, an LSTM-QA model with Attention (inspired by the paper LSTM-based Deep Learning Models for Non-factoid Answer Selection by Tan, dos Santos, Xiang and Zhou), and using a custom embedding generated from approximately 500k Studystack Flashcards, followed by the code for finding the best answer. In other words, the last mile of my solution.

This is what the network looks like:


And here is the code for the network.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Source: qa-lstm-fem-attn.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.callbacks import ModelCheckpoint
from keras.layers import Input, Dense, Dropout, Reshape, Flatten, merge
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Model
from sklearn.cross_validation import train_test_split
import os
import sys

import kaggle

DATA_DIR = "../data/comp_data"
MODEL_DIR = "../data/models"
WORD2VEC_BIN = "studystack.bin"
WORD2VEC_EMBED_SIZE = 300

QA_TRAIN_FILE = "8thGr-NDMC-Train.csv"
QA_TEST_FILE = "8thGr-NDMC-Test.csv"

QA_EMBED_SIZE = 64
BATCH_SIZE = 128
NBR_EPOCHS = 20

## extract data
print("Loading and formatting data...")
qapairs = kaggle.get_question_answer_pairs(
    os.path.join(DATA_DIR, QA_TRAIN_FILE))
question_maxlen = max([len(qapair[0]) for qapair in qapairs])
answer_maxlen = max([len(qapair[1]) for qapair in qapairs])

# Even though we don't use the test set for classification, we still need
# to consider any additional vocabulary words from it for when we use the
# model for prediction (against the test set).
tqapairs = kaggle.get_question_answer_pairs(
    os.path.join(DATA_DIR, QA_TEST_FILE), is_test=True)    
tq_maxlen = max([len(qapair[0]) for qapair in tqapairs])
ta_maxlen = max([len(qapair[1]) for qapair in tqapairs])

seq_maxlen = max([question_maxlen, answer_maxlen, tq_maxlen, ta_maxlen])

word2idx = kaggle.build_vocab([], qapairs, tqapairs)
vocab_size = len(word2idx) + 1 # include mask character 0

Xq, Xa, Y = kaggle.vectorize_qapairs(qapairs, word2idx, seq_maxlen)
Xqtrain, Xqtest, Xatrain, Xatest, Ytrain, Ytest = \
    train_test_split(Xq, Xa, Y, test_size=0.3, random_state=42)
print(Xqtrain.shape, Xqtest.shape, Xatrain.shape, Xatest.shape, 
      Ytrain.shape, Ytest.shape)

# get embeddings from word2vec
print("Loading Word2Vec model and generating embedding matrix...")
embedding_weights = kaggle.get_weights_word2vec(word2idx,
    os.path.join(DATA_DIR, WORD2VEC_BIN), is_custom=True)
        
print("Building model...")

# output: (None, QA_EMBED_SIZE, seq_maxlen)
qin = Input(shape=(seq_maxlen,), dtype="int32")
qenc = Embedding(input_dim=vocab_size,
                 output_dim=WORD2VEC_EMBED_SIZE,
                 input_length=seq_maxlen,
                 weights=[embedding_weights])(qin)
qenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(qenc)
qenc = Dropout(0.3)(qenc)

# output: (None, QA_EMBED_SIZE, seq_maxlen)
ain = Input(shape=(seq_maxlen,), dtype="int32")
aenc = Embedding(input_dim=vocab_size,
                 output_dim=WORD2VEC_EMBED_SIZE,
                 input_length=seq_maxlen,
                 weights=[embedding_weights])(ain)
aenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(aenc)
aenc = Dropout(0.3)(aenc)

# attention model
attn = merge([qenc, aenc], mode="dot", dot_axes=[1, 1])
attn = Flatten()(attn)
attn = Dense(seq_maxlen * QA_EMBED_SIZE)(attn)
attn = Reshape((seq_maxlen, QA_EMBED_SIZE))(attn)

qenc_attn = merge([qenc, attn], mode="sum")
qenc_attn = Flatten()(qenc_attn)

output = Dense(2, activation="softmax")(qenc_attn)

model = Model(input=[qin, ain], output=[output])

print("Compiling model...")
model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])

print("Training...")
best_model_filename = os.path.join(MODEL_DIR, 
    kaggle.get_model_filename(sys.argv[0], "best"))
checkpoint = ModelCheckpoint(filepath=best_model_filename,
                             verbose=1, save_best_only=True)
model.fit([Xqtrain, Xatrain], [Ytrain], batch_size=BATCH_SIZE,
          nb_epoch=NBR_EPOCHS, validation_split=0.1,
          callbacks=[checkpoint])

print("Evaluation...")
loss, acc = model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE)
print("Test loss/accuracy final model = %.4f, %.4f" % (loss, acc))

final_model_filename = os.path.join(MODEL_DIR, 
    kaggle.get_model_filename(sys.argv[0], "final"))
json_model_filename = os.path.join(MODEL_DIR,
    kaggle.get_model_filename(sys.argv[0], "json"))
kaggle.save_model(model, json_model_filename, final_model_filename)

best_model = kaggle.load_model(json_model_filename, best_model_filename)
best_model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])
loss, acc = best_model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE)
print("Test loss/accuracy best model = %.4f, %.4f" % (loss, acc))

The code above builds up questions and answers as an array of indexes into the word dictionary created off the words in the questions and answers. The weights for our embeddings are initialized from running word2vec on our corpus of StudyStack flashcards. Attention is modeled as a dot product of the output of the question and answer vectors that come out of the LSTMs. Finally, the attention vector and question vectors are concatenated and sent into a Dense network, which outputs one of two values.

The next step takes the saved model (final one) and runs each question in the test set and its four choices as a single batch, and predicts the correct answer as the one which has the highest score. The output is written to a CSV file in the format required for submission to Kaggle.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# src/predict_testfile.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.preprocessing.sequence import pad_sequences
import nltk
import numpy as np
import os

import kaggle

DATA_DIR = "../data/comp_data"
TRAIN_FILE = "8thGr-NDMC-Train.csv"
TEST_FILE = "8thGr-NDMC-Test.csv"
SUBMIT_FILE = "submission.csv"

MODEL_DIR = "../data/models"
MODEL_JSON = "qa-lstm-fem-attn.json"
MODEL_WEIGHTS = "qa-lstm-fem-attn-final.h5"
LSTM_SEQLEN = 196 # seq_maxlen from original model

print("Loading model..")
model = kaggle.load_model(os.path.join(MODEL_DIR, MODEL_JSON),
                          os.path.join(MODEL_DIR, MODEL_WEIGHTS))
model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])

print("Loading vocabulary...")
qapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TRAIN_FILE))
tqapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TEST_FILE), 
                                            is_test=True)
word2idx = kaggle.build_vocab([], qapairs, tqapairs)
vocab_size = len(word2idx) + 1 # include mask character 0

ftest = open(os.path.join(DATA_DIR, TEST_FILE), "rb")
fsub = open(os.path.join(DATA_DIR, SUBMIT_FILE), "wb")
fsub.write("id,correctAnswer\n")
line_nbr = 0
for line in ftest:
    line = line.strip().decode("utf8").encode("ascii", "ignore")
    if line.startswith("#"):
        continue
    if line_nbr % 10 == 0:
        print("Processed %d questions..." % (line_nbr))
    cols = line.split("\t")
    qid = cols[0]
    question = cols[1]
    answers = cols[2:]
    # create batch of question
    qword_ids = [word2idx[qword] for qword in nltk.word_tokenize(question)]
    Xq, Xa = [], []
    for answer in answers:
        Xq.append(qword_ids)
        Xa.append([word2idx[aword] for aword in nltk.word_tokenize(answer)])
    Xq = pad_sequences(Xq, maxlen=LSTM_SEQLEN)
    Xa = pad_sequences(Xa, maxlen=LSTM_SEQLEN)
    Y = model.predict([Xq, Xa])
    probs = np.exp(1.0 - (Y[:, 1] - Y[:, 0]))
    correct_answer = chr(ord('A') + np.argmax(probs))
    fsub.write("%s,%s\n" % (qid, correct_answer))
    line_nbr += 1
print("Processed %d questions..." % (line_nbr))
fsub.close()
ftest.close()

Here is the output for one single question which I had referenced in the presentation slides. The model shows shows the distribution of scores between the answers (normalized to add up to 1).


I did try to run my classifier on the entire test set and produce a submission file for Kaggle, just to see where I stand. Since the classification accuracy for the winner was approximately 59%, it is unlikely that my 70%+ accuracy numbers for my classifiers will carry over into the final task. I had signed up for the competition with the intention of participating but got sidetracked, so I had the original datasets of approximately 8000 training and 8000 test questions, but unfortunately, the final rankings were computed off another test set of approximately 200k questions that were supplied later in the competition, so I didn't have them.

Thats all I have for today. As someone mentioned to me after the workshop, these sort of things are very energizing. Certainly I learned a lot from it. The deadline also pushed me to complete my hobby project, so I got to learn quite a bit about more complex Keras models. Hopefully, this will enable me to build more complex models going forward.