Monday, July 11, 2016

NLP (almost) From Scratch - Implementing the POS Network


Recently I was tasked with reading the Natural Language Processing (almost) from Scratch paper by Collobert, et al, and sharing my findings with other members of the San Francisco Deep Learning Enthusiasts meetup group. Unlike other meetups, where you go to listen to experts, this one is more like a study group. Our primary activity is watching and discussing deep learning videos and sharing information about Deep Learning (DL). We (usually) meet every Thursday - if you are interested in bootstrapping your DL skills and are in San Francisco, you should join us.

But in any case, getting back to the paper. Even though it was written in 2011, it is still interesting as one of the first major application of DL techniques to Natural Language Processing (NLP). Proof that its interesting to the population at large (as opposed to just us in our meetup) comes from the fact that it featured on The Morning Paper recently (July 4th 2016).

The field of NLP has certain core tasks upon which other higher level, more "magical" applications are built. These tasks are part of speech (POS) tagging, phrase chunking, named entity recognition (NER) and semantic role labeling (SRL). Each of these tasks have standard datasets and benchmark results that NLP researchers are constantly trying to beat. What sets this paper apart is that this is the first time where the authors avoided any task specific feature engineering (i.e., creating hand crafted linguistic features), and yet achieved results that beat current benchmarks in 3 of the 4 tasks.

They did this by using a simple Multi-Layer Perceptron (MLP) DL network for the first 3 tasks and a Convolutional Neural Network (CNN) for the last one. Input to the MLP was a context window of 5 words (word ± 2 neighbors). Each word was converted into an embedded vector representation and concatenated to form a context vector. The label was the attribute of the center word (POS or IOB tag). For the SRL tasks, a word based context was not sufficient so they used the entire sentence as input, again converting each word to an embedded vector representation and predicting the word positions of a single verb-predicate pair per sentence.

While reading the paper, I thought it might be interesting to implement one of the models using tools I am familiar with. I chose the simplest model, the one that predicts the POS tags. My input corpus is the text from Alice in Wonderland from Project Gutenberg. I created my own training set by POS tagging it with spaCy. Instead of generating my own embeddings from which to generate the word vectors from, I used gensim to load word2vec embeddings from the pre-trained Google News model. Finally, I used Keras to build the MLP to consume the context vectors and generate the POS predictions. The rest of the post talks about this work.

POS Tagging with spaCy


I manually removed the header and footer from the text of Alice in Wonderland, leaving just the story text starting at "CHAPTER I" and ending with "happy summer days.". I was originally just going to use NLTK to generate the POS tags, but I had heard good things about spaCy, so decided to check it out by using it instead. Here is the code that does that.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Source: src/spacy_postagging.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from spacy.en import English
import operator
import os

DATA_DIR = "../data"

nlp = English()

falice = open(os.path.join(DATA_DIR, "alice_in_wonderland.txt"), "rb")
content = falice.read()
falice.close()

doc = nlp(content.decode("utf-8"))
sents = []
for sent in doc.sents:
    sents.append(sent)

i = 0
vocab = {}
tags = {}
tagged_sents = []
fout = open(os.path.join(DATA_DIR, "alice_sents_postagged.txt"), "wb")
for sent in sents:
    i += 1
    print("Processing sentence# %d: %s" % (i, sent))
    token_tags = []
    toks = nlp(sent.text.encode("ascii", "ignore").decode("utf-8"))
    for tok in toks:
        token_tags.append((tok.text, tok.pos_))
    clean_token_tags = []
    for tt in token_tags:
        if tt[1] == u"SPACE":
            continue
        clean_token_tags.append(tt)
        vocab[tt[0]] = vocab.get(tt[0], 0) + 1
        tags[tt[1]] = tags.get(tt[1], 0) + 1
    tagged_sents.append(clean_token_tags)
    fout.write("%s\n" % (" ".join(["/".join([tt[0], tt[1]]) 
        for tt in clean_token_tags])))
fout.close()

# replace words which occur 1 or 2 times with UNK in vocab
for word in vocab.keys():
    if vocab[word] < 3:
        vocab["UNK"] = vocab.get("UNK", 0) + 1
        vocab.pop(word, None)
        
# create a lookup dictionary for words
fwords = open(os.path.join(DATA_DIR, "alice_words.txt"), "wb")
vocab_s = sorted(vocab.iteritems(), key=operator.itemgetter(1), reverse=True)
for i, (k, v) in enumerate(vocab_s):
    fwords.write("%d\t%s\t%d\n" % (i, k, v))
fwords.close()

# create a lookup dictionary for POS tags
ftags = open(os.path.join(DATA_DIR, "alice_postags.txt"), "wb")
tags_s = sorted(tags.iteritems(), key=operator.itemgetter(1), reverse=True)
for i, (k, v) in enumerate(tags_s):
    ftags.write("%d\t%s\t%d\n" % (i, k, v))
ftags.close()

# construct 5-grams from sentences
fgrams = open(os.path.join(DATA_DIR, "alice_5grams.txt"), "wb")
for tagged_sent in tagged_sents:
    sent_grams = []
    gram_labels = []
    # lowercase the words
    tagged_sent = [(x[0].lower(), x[1]) for x in tagged_sent]
    # replace with UNK for specific words
    tagged_sent = [(x[0] if vocab.has_key(x[0]) else "UNK", x[1]) 
                         for x in tagged_sent]
    # put pre- and post- padding
    tagged_sent.insert(0, ("PAD", "PAD"))
    tagged_sent.insert(0, ("PAD", "PAD"))
    tagged_sent.append(("PAD", "PAD"))
    tagged_sent.append(("PAD", "PAD"))
    for i in range(len(tagged_sent) - 4):
        sent_gram = tagged_sent[i:i+5]
        # label of middle word, and input words is 5-gram around word
        fgrams.write("%s\t%s\n" % (sent_gram[2][1], 
                                   " ".join([x[0] for x in sent_gram])))
fgrams.close()

The code uses Spacy to generate POS tags inline with the words. For example, this sentence:

1
2
3
4
She took down a jar from one of the shelves as she passed; it was labelled 
'ORANGE MARMALADE', but to her great disappointment it was empty: she did 
not like to drop the jar for fear of killing somebody, so managed to put it 
into one of the cupboards as she fell past it.

is converted to this format:

1
2
3
4
5
6
7
8
She/PRON took/VERB down/PART a/DET jar/NOUN from/ADP one/NUM of/ADP the/DET 
shelves/NOUN as/ADP she/PRON passed/VERB ;/PUNCT it/PRON was/VERB 
labelled/VERB '/PUNCT ORANGE/ADV MARMALADE/PROPN '/PUNCT ,/PUNCT but/CONJ 
to/ADP her/ADJ great/ADJ disappointment/NOUN it/PRON was/VERB empty/ADJ 
:/PUNCT she/PRON did/VERB not/ADV like/VERB to/PART drop/VERB the/DET 
jar/NOUN for/ADP fear/NOUN of/ADP killing/VERB somebody/NOUN ,/PUNCT 
so/ADV managed/VERB to/PART put/VERB it/PRON into/ADP one/NUM of/ADP 
the/DET cupboards/NOUN as/ADP she/PRON fell/VERB past/ADP it/PRON ./PUNCT

The code then uses the intermediate format above to generate a list of word and POS tag frequencies, which it writes out to files. The word frequencies are used to replace any word that occurs 2 or fewer times in the text with the token UNK (unknown). I also calculate POS tag frequencies, which I will use later. I then run through each sentence, generating 5-grams out of the words in the sentence. This creates records like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
PRON    PAD PAD she took down
VERB    PAD she took down a
PART    she took down a jar
DET     took down a jar from
NOUN    down a jar from one
ADP     a jar from one of
NUM     jar from one of the
ADP     from one of the shelves
DET     one of the shelves as
NOUN    of the shelves as she
ADP     the shelves as she passed
PRON    shelves as she passed ;
...

Looking up word2vec Vectors with gensim


I now use the 5-grams and the associated POS tag for the middle word and compute word2vec embeddings for each of the words. The word2vec team has released a prebuilt embedding model that is trained on 100B words and returns a 300 dimensional embedding given a word. It is available here along with some other models. Gensim provides a nice API to read this model and extract word2vec vectors from it for words in your corpus. In the code below, I use this API to convert our 5-grams words into a (1, 1500) vector and our labels into a (1, 15) label vector and write them out to files for the next stage.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Source: gensim_word2vec.py
# -*- coding: utf-8 -*-
from gensim.models import word2vec
import numpy as np

print("Loading label lookup...")
label_lookup = {}
f_postags = open("../data/alice_postags.txt", "rb")
for line in f_postags:
    lid, ltext, _ = line.strip().split("\t")
    label_lookup[ltext] = int(lid)
f_postags.close()

print("Loading word2vec model...")
w2v = word2vec.Word2Vec.load_word2vec_format(
    "../data/GoogleNews-vectors-negative300.bin.gz", binary=True)
vec_size = 300
vec_pad = np.zeros(vec_size)
vec_unk = np.ones(vec_size)
ngram_size = 5

print("Writing vectors...")
f_data = open("../data/alice_5grams.txt", "rb")
f_X = open("../data/alice_X.csv", "wb")
f_y = open("../data/alice_y.csv", "wb")
nbr_read = 0
for line in f_data:
    nbr_read += 1
    if nbr_read % 1000 == 0:
        print("    Wrote %d vectors..." % (nbr_read))
    label, ngram = line.strip().split("\t")
    lid = label_lookup[label]
    word_vecs = np.zeros((ngram_size, vec_size))
    for i, word in enumerate(ngram.split(" ")):
        if word == "PAD":
            word_vecs[i] = vec_pad
        elif word == "UNK":
            word_vecs[i] = vec_unk
        else:
            try:
                word_vecs[i] = w2v[word]
            except KeyError:
                word_vecs[i] = vec_unk
    ngram_vec = np.reshape(word_vecs, (ngram_size * vec_size))
    f_X.write("%s\n" % (",".join(["%.5f" % (x) for x in ngram_vec.tolist()])))
    label_vec = np.zeros(len(label_lookup))
    label_vec[lid] = 1
    f_y.write("%s\n" % (",".join(["%d" % (x) for x in label_vec.tolist()])))
print("Wrote %d vectors" % (nbr_read))    
f_X.close()
f_y.close()

In order to include words at the sentence edges in our 5-grams, I used PAD characters which obviously don't have an associated word2vec vector. For these, I assigned a vector of all zeros. Similarly for the UNK words, I assign a vector of all ones. I also assign to UNK any words that I am unable to find in the word2vec model - given the size of the word2vec model's training set, such words should be few and by definition rare, so its similar to the original UNK words (2 or less in corpus). Output of this stage is two files - X.csv and y.csv. X.csv, which is a comma-separated list of 1500 numbers, each line representing the input vector for each record. y.csv is a 1-hot encoding of the POS tag label for the center word. There are 34,459 rows of input.

Training DL Model with Keras


Finally, I train a MLP network with a single hidden layer. The original paper specifies 300 hidden units, but their input vector shape (50,) was also much smaller than my (1500,), so my hidden layer has 512 units. They also used hard tanh as their non-linearity, whereas I used ReLU. I trained the model on 70% of the training data and validated against the remaining 30%. Running the training for 50 epochs produced a model with a best validation loss of 0.68 and training loss of 0.96. (If you are curious, as I was, why validation loss is less than training loss, a good explanation can be found in the Keras FAQ). Here is the code for training the model.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Source: src/keras_postagging.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import Adamax
from keras.regularizers import l2
import matplotlib.pyplot as plt
import numpy as np
import os

DATA_DIR = "../data"

np.random.seed(42)

# read data
X = np.loadtxt(os.path.join(DATA_DIR, "alice_X.csv"), delimiter=",")
y = np.loadtxt(os.path.join(DATA_DIR, "alice_y.csv"), delimiter=",")

# set up model
model = Sequential([
    # input layer
    Dense(768, input_shape=(1500,), W_regularizer=l2(0.001)),
    Activation("relu"),
    Dropout(0.2),
    # hidden layer
    Dense(512, W_regularizer=l2(0.001)),
    Activation("relu"),
    Dropout(0.2),
    # output layer
    Dense(15),
    Activation("softmax")
])

adamax = Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss="categorical_crossentropy", optimizer=adamax)

# save model structure
model_struct = model.to_json()
fmod_struct = open(os.path.join(DATA_DIR, "alice_pos_model.json"), "wb")
fmod_struct.write(model_struct)
fmod_struct.close()

# train model
checkpoint = ModelCheckpoint(os.path.join(DATA_DIR, "checkpoints",
    "alice_pos_weights.{epoch:02d}-{val_loss:.2f}.hdf5"), 
    monitor="val_loss", save_best_only=True, mode="min")
hist = model.fit(X, y, batch_size=128, nb_epoch=50, shuffle=True,
                 validation_split=0.3, callbacks=[checkpoint])

# plot losses
train_loss = hist.history["loss"]
val_loss = hist.history["val_loss"]
plt.plot(range(len(train_loss)), train_loss, color="red", label="Train Loss")
plt.plot(range(len(train_loss)), val_loss, color="blue", label="Val Loss")          
plt.xlabel("epochs")
plt.ylabel("loss")
plt.legend(loc="best")
plt.show()

I chose the Adamax optimizer because it gave the best results. I also played around a bit with the other hyperparameters, such as different non-linearities, hidden layer sizes, number of layers, etc. The chart below shows how the training and validation losses change over time. I also use the ModelCheckpoint callback to capture the weights for the model with the lowest validation loss, and save the model structure into a JSON file.


Making predictions


I now take the ngrams corresponding to our example sentence and extract out the ngrams and the corresponding vectors. I then load the model structure and weights and (this is very important) recompile the model. I can then call model.predict to get back a (15,) vector from which I extract the highest scoring POS tag. The code for doing the prediction is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Source: pos_predict.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.models import model_from_json
from keras.optimizers import Adamax
import numpy as np
import os

DATA_DIR = "../data"

# deserialize model
fmods = open(os.path.join(DATA_DIR, "alice_pos_model.json"), "rb")
model_json = fmods.read()
fmods.close()
model = model_from_json(model_json)
model.load_weights(os.path.join(DATA_DIR, "alice_pos_weights.30-0.68.hdf5"))
adamax = Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss="categorical_crossentropy", optimizer=adamax)

# label lookup
label_dict = {}
fpos = open(os.path.join(DATA_DIR, "alice_postags.txt"), "rb")
for line in fpos:
    lid, ltxt, _ = line.strip().split("\t")
    label_dict[int(lid)] = ltxt
fpos.close()

# read ngrams into array
ngram_labels = []
fngrams = open(os.path.join(DATA_DIR, "alice_5grams_pred.txt"), "rb")
for line in fngrams:
    label, ngram = line.strip().split("\t")
    ngram_labels.append((ngram, label))
fngrams.close()

# read word+context vectors and predict from model
fpred = open(os.path.join(DATA_DIR, "alice_test_pred.txt"), "wb")
fvec = open(os.path.join(DATA_DIR, "alice_X_pred.csv"), "rb")
lno = 0
for line in fvec:
    X = np.array([float(x) for x in line.strip().split(",")]).reshape(1, 1500)
    y_ = np.argmax(model.predict(X))
    nl = ngram_labels[lno]
    fpred.write("%s\t%s\t%s\n" % (nl[0], nl[1], label_dict[y_]))
    lno += 1
fvec.close()    
fpred.close()

A partial output for my test sentence is shown below. The first column is the word ngram, the second is the true label (as computed using spaCy) and the third column is the predicted label (as computed by my Keras MLP). As you can see, even given the relatively small input size, the results seem quite good. Of course, POS tagging is a relatively simple task, so I should probably not read too much into these results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
PAD PAD she took down    PRON   PRON
PAD she took down a      VERB   VERB
she took down a UNK      PART   ADV
took down a UNK from     DET    PUNCT
down a UNK from one      NOUN   PUNCT
a UNK from one of        ADP    ADP
UNK from one of the      NUM    NUM
from one of the UNK      ADP    ADP
one of the UNK as        DET    DET
of the UNK as she        NOUN   PROPN
the UNK as she passed    ADP    ADP
UNK as she passed ;      PRON   PRON
as she passed ; it       VERB   VERB
...

And thats all I have for this week. Doing this ended up being a lot of fun, thanks to awesome libraries like spaCy, gensim and keras. Hope you enjoyed reading it too.

6 comments (moderated to prevent spam):

Rafat said...

Hi I wanted to try your code, But While going for keras_postagging.py I am facing a error like
Traceback (most recent call last):
File "D:\pycode\project similarity\keras_postagging.py", line 51, in
validation_split=0.3, callbacks=[checkpoint])
File "C:\Users\Rafat\Anaconda2\lib\site-packages\keras\models.py", line 429, in fit
sample_weight=sample_weight)
File "C:\Users\Rafat\Anaconda2\lib\site-packages\keras\engine\training.py", line 1036, in fit
batch_size=batch_size)
File "C:\Users\Rafat\Anaconda2\lib\site-packages\keras\engine\training.py", line 967, in _standardize_user_data
exception_prefix='model target')
File "C:\Users\Rafat\Anaconda2\lib\site-packages\keras\engine\training.py", line 108, in standardize_input_data
str(array.shape))
Exception: Error when checking model target: expected activation_3 to have shape (None, 15) but got array with shape (34477L, 14L)

I have searched a lot on this issue but did not find any solution. It is occuring on the model.fit() operation

Sujit Pal said...

Sorry about the delay in responding, blogger seems to have turned off my notifications (or maybe I did by mistake), just seeing this now... one possibility is that you have only 14 labels for some reason, and your output y vector has only 14 columns instead of 15. In that case, just change the Dense(15) call in line 32 of keras_postagger.py and see if that fixes this.

Rafat said...

Thanks for the reply.it. worked fine. Now while working with pos_predict how can I get the alice_5grams_pred.txt, alice_test_pred.txt , "alice_X_pred.csv
I was wondering if can use this model to convert it to a lstm model to find question question similarities. If not can you please guide me

Sujit Pal said...

Hi Rafat, good to know. The _pred.* files are just the 30% split of the original datafiles that are used for testing. I don't know if this can be used to test question-question similarity. There is a model I learned about recently that can be used in an unsupervised way to find similarities between sentences in general - it involves generating a parse tree for each of the sentences, then computing a similarity matrix between the embeddings for each word and nodes in the parse tree, passing it through a dynamic pooling layer to reduce it to a standard size, and computing a measure of similarity the reduced matrix. I plan on implementing this at some point, but here is a link to the paper in this page.

Anonymous said...

May I ask what were the POS tag frequencies used for ?

Sujit Pal said...

I looked at the code and I can't see a reason for holding on to the POS frequencies, so whatever I needed it for doesn't seem to be covered by the blog post. Its been an eventful few months unfortunately, and I really don't remember if I did something beyond what I wrote about in the blog. Most likely I think the statement about doing something with it later may be irrelevant at least in the context of the blog post.