Wednesday, November 15, 2017

Observations of a Keras developer learning Pytorch

In terms of toolkits, my Deep Learning (DL) journey started with using Caffe pre-trained models for transfer learning. This was followed by a brief dalliance with Tensorflow (TF), first as a vehicle for doing the exercises on the Udacity Deep Learning course, then retraining some existing TF models on our own data. Then I came across Keras, and like many others, absolutely fell in love with the simplicity, elegance and power of its object-oriented layer centric API. Most of the deep learning work I have done over the past couple of years has been with Keras, and it has, much like Larry Wall intended with Perl, made easy things easy and hard things possible for me.

Few months ago (7 according to github), I caught the polyglot bug (but with respect to deep learning toolkits, hence polyDLot), and decided to test the waters by implementing MNIST handwritten digit classification with multiple different toolkits. One of these toolkits was Pytorch, which promises imperative programming and dynamic computation graphs. Although the MNIST tasks did not exploit Pytorch's ability to build dynamic computation graphs, I thought it was quite unique in its positioning. In terms of verbosity, it is somewhere in between Keras and TF, but in terms of flexibility and power, it seems to offer all the benefits of a low level framework like TF with a simpler and more intuitive interface.

Lately, I have been hearing a lot about Pytorch, such as the release of the Allen NLP library, an open-source Natural Language Processing (NLP) research library from Allen AI that was built on top of Pytorch. Also Jeremy Howard, the brain behind the MOOCs whose goal is to make neural nets uncool again, has written about his reasons for introducing Pytorch for Part-1 of this course was based on Keras, Part-2 is based on a combination of TF and Pytorch. There are a few other mentions as well, but you get the idea. By the way, Part-2 of the course is now open to the public in case you were waiting for that.

My own interests are around applying DL to NLP, and I have been hitting a few Keras limits of my own for some things lately. Nothing insurmountable at the moment that a custom layer cannot fix, but I figured that it might be worth exploring Pytorch a bit more, especially to get familiar with recurrent models. So that's what I did, and this post describes the experience.

I used the book Long Short Term Memory Networks with Python by Jason Brownlee as my source of toy examples to implement in Pytorch. Jason Brownlee runs the Machine Learning Mastery site and is the author of multiple books on Machine Learning (ML) and DL. He also sends out a regular newsletter with practical tips on ML/DL. Like his other DL books, the book provides Keras code for each of the toy examples.

I built Pytorch implementations for six toy networks, each in its own Jupyter notebook. Each notebook begins with a brief problem description, loosely extracted from the book. I have tried to make it descriptive, but if it is insufficient, please look at the code or read the description in its original. Also if you are looking for the Keras implementation and information beyond just the basic description for these toy examples, I would recommend purchasing the book. In addition to these examples, the book has good advice on the things to watch out for when building recurrent networks. I (obviously) bought a copy, and I think it was definitely worth the price. Here are the examples:

  • 06-echo-sequence-prediction.ipynb - the network is fed a fixed-size sequence of random integers, and trained to predict the integer at a specific (but unknown to the network) index in the input.
  • 07-damped-sine-wave-prediction.ipynb - the network is fed fixed-size of points on damped sine waves of varying amplitudes and periodicity, and trained to predict the value for an unknown damped sine wave at the next time step given a sequence of previous values.
  • 08-moving-square-video-prediction.ipynb - a combined CNN-LSTM network that takes a sequence of images representing the movement of a point from one end of a square to another, and predicts the direction of the movement for a new sequence of images.
  • 09-addition-prediction.ipynb - an encoder-decoder network to solve addition problems represented as a sequence of digits joined by the plus sign. Output is the stringified value of the sum.
  • 10-cumsum-prediction.ipynb - a network that takes a sequence of random values between 0 and 1, and outputs 0 or 1 depending on whether the cumulative sum of the values seen so far is below or above a specific (but unknown to the network) threshold value.
  • 11-shape-generation.ipynb - a network trained on a sequence of real-valued (x, y) coordinate pairs representing a rectangle. The trained network is then used to generate polygon shapes that (should) look like rectangles.

And finally, here comes the observations I promised in the title of this post. These examples do explore Pytorch capabilities better than the MNIST examples, but it still doesn't actually exploit its capabilities of creating dynamic computation graphs.

  • Models are classes - in Keras, you manipulate pre-built layer classes like Lego blocks using either the Sequential or Functional API. In Pytorch, you set up your network as a class which extends torch.nn.Module. Pytorch provides you layers as building blocks similar to Keras, but you typically reference them in the class's __init__() method and define the flow in its forward() method. Because you have access to all of Python's features as opposed to simple function calls, this can result in much more expressive flows.
  • TimeDistributed and RepeatVector are missing - these two components are used in Keras to declare a transformation and distribute it over time, or to replicate a vector to feed into an LSTM. Neither component exists in Pytorch because they can be easily implemented using code.
  • Less insulation from component internals - the Keras API hides a lot of the messy details from the casual user. Components have sensible defaults, so you can start simple and tweak more and more parameters as you gain experience. On the other hand, the TF API gives you complete control (and arguably more than enough rope to hang yourself), forcing you to think of all parameters at the level of matrix multiplication. While Pytorch does not go that far, it does require you to understand in general what is going on inside each component. For example, its LSTM module allows for multiple layers, and a Bidirectional LSTM (achieved by setting the parameter bidirectional=True) is internally represented as a stack of 2 LSTMs - you are required to know this so you can set the dimensions of the hidden state (h) signal correctly. Another example is the need to explicitly specify the output sizes after convolution for CNN layers.
  • Fitting model is multi-step process - fitting a model in Pytorch consists of initializing gradients at the start of each batch of training, running hte batch forward through the model, running the gradient backward, computing the loss and making the weight update (optimizer.step()). I don't know if this process varies enough to justify having these split out. At least in my case, the training loop is practically identical across all my examples.
  • Torch tensors interop with Numpy variables - Most Keras developers never have to worry about TF/Theano and Numpy interop, at least not unless they start using the backend API. Once they do, though, they have to understand the whole concept of TF sessions in order to interoperate between TF tensors and Numpy variables. Pytorch interop is actually much simpler, there are just two operations, one to switch a Torch tensor (a Variable object) to Numpy, and another one to go in the opposite direction.
  • GPU/CPU mode not transparent - both Keras and TF transparently use the GPU if it exists. For Pytorch, you have to explicitly check for this every time you move between torch tensors and numpy variables. This clutters up the code and can be a bit error prone if you move back and forth between CPU (for development) and GPU (for deployment) environments. Although I suppose we could build wrapper functions and use them instead.
  • Channel first always for images - TF (and by extension Keras) offers the user a choice of representing an image as (N, C, H, W) or (N, H, W, C), or channel-first or channel-last format (here N = batch size, C = number of channels, H = image height, and W = image width). Pytorch is always channel first. I mention it here because I spent some time trying to figure out why my NHWC format tensors weren't working with my network class.
  • Batch first is optional for RNN input - Unlike Keras and TF, where inputs to RNNs are in (N, T, F), Pytorch requires input as (T, N, F) by default (here N = batch size, T = number of timesteps, F = number of features). However, you can switch over to the more familiar (N, T, F) format by setting the batch_first=True parameter. This simplifies some of the code for batch manipulation during training.

A side effect of the more complex network definition is that I have almost standardized on a debugging strategy that I previously only used for Keras custom layers. The idea is that you send a random input signal of the required dimensions into the network and verify that the network returns a tensor of the required dimensions. Very often, this will expose dimensional inaccuracies inside the network, saving you some debugging grief during training.

The other thing I wanted to note is that I deliberately used the epoch/batch style training that I have grown used to with Keras, even though it meant slightly more code. The style I have seen in Pytorch examples is to do a flat number of iterations instead. Now that I think about this some more, this may be a non-issue, since the training loop appears to be common enough so it can be factored out.

And that is all I had for today. I hope you find my observations useful if and when you, as a Keras or TF developer, decide to pick up Pytorch as well.

Saturday, October 28, 2017

Debugging Keras Networks

Last week a colleague and I were trying to figure out why his network would crash with a NaN (Not a Number) error some 20 or so epochs into training. Lately I have also become more interested in tuning neural networks, so this was a good opportunity for me to suggest fixes based on reasoning about the network. The network itself was built with Keras, like all the other networks our team has built from scratch so far, although we have adapted some third party networks written in Caffe and Tensorflow as well.

Now Keras is great for fast development because of its high level API. It results in very expressive code that reads like how you would actually visualize the network in your head or on a piece of paper. Also, because Keras automates away so many things and provides reasonable default values for many of its parameters, there are fewer things programmers can make mistakes about. For example, this awesome post on How to unit test machine learning code is based on Tensorflow, and while some of the cases mentioned are possible in Keras, they are much less likely.

However, while it is very easy to go from design to code in Keras, it is actually a little harder to work with, compared to say Tensorflow or Pytorch, when things go wrong and you have to figure out what. However, Keras does offer some tools and hooks that allow you to do this. In this post I talk about some of these that we (re-)discovered for ourselves last week. If you have favorites that I haven't included, please let me know in the comments.

The example I will use throughout this post is a simple fully connected network that I built to recognize MNIST images. The code to train and evailate this network can be found here. The code to define and compile it is as follows:

model = Sequential()
model.add(Dense(512, activation="relu", input_shape=(784,)))
model.add(Dense(256, activation="relu"))
model.add(Dense(10, activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy", 

The first issue I have seen have have to do with sizing the intermediate tensors in the network. Keras only asks that you provide the dimensions of the input tensor(s), and it figure out the rest of the tensor dimensions automatically. The flip side of this convenience is that programmers may not realize what the dimensions are, and may make design errors based on this lack of understanding. Keras provides a model.summary() function that returns the output dimensions from each layer. I have found this very useful to get a better intuition about a network.


Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 512)               401920    
dropout_3 (Dropout)          (None, 512)               0         
dense_5 (Dense)              (None, 256)               131328    
dropout_4 (Dropout)          (None, 256)               0         
dense_6 (Dense)              (None, 10)                2570      
Total params: 535,818
Trainable params: 535,818
Non-trainable params: 0

If you need more granular information about the intermediate tensors, take a look at the About Keras layers page. You can get input shapes as well using some code like this:

for layer in model.layers:
    print(, layer.input.shape, layer.output.shape)

dense_1 (?, 784) (?, 512)
dropout_1 (?, 512) (?, 512)
dense_2 (?, 512) (?, 256)
dropout_2 (?, 256) (?, 256)
dense_3 (?, 256) (?, 10)

Another built-in diagnostic tool that I have been ignoring a bit so far is Tensorboard. Tensorboard was originally developed as part of the Tensorflow ecosystem, and allows Tensorflow developers to log certain things into a Tensorboard log file, which can later be used to visualize these logs graphically. The Keras project provides a way to write to Tensorboard using its TensorBoard callback. I learned to extract loss and other metrics from the output of and plot it with matplotlib before the TensorBoard callback was popular, and have continued to use the approach mostly due to inertia. But the TensorBoard callback provides not only these plots, but the weight distributions for all the weights, biases and gradients. In case of networks where Embeddings and Images are involved, Tensorboard provides visualizations for them as well.

To invoke the Tensorboard callback, it needs to be defined and then declared in the callbacks queue in the call.

tensorboard = TensorBoard(log_dir=TENSORBOARD_LOGS_DIR, 
history =, Ytrain, batch_size=BATCH_SIZE, 
                    callbacks=[..., tensorboard, ...])

Here are the kind of visualizations you can expect on Tensorboard. The best resource I have found on interpreting these visualizations are Dandelion Mané's talk at Tensorflow Developers Summit 2017 and the Tensorboard documentation on Histograms

As nice as the Tensorboard callback is, it may not work for you all the time. For one thing, it appears that it doesn't work with fit_generator. You may also want to log values which are not meant to be logged with the Tensorboard callback. You can do that by writing your own callback in Keras.

Here is a callback that will capture the L2 norm, mean and standard deviation for each weight tensor in the network for each epoch and at the end of training, dump these values out to screen.

from keras import backend as K
from keras.callbacks import Callback
import numpy as np
def calc_stats(W):
    return np.linalg.norm(W, 2), np.mean(W), np.std(W)

class MyDebugWeights(Callback):
    def __init__(self):
        super(MyDebugWeights, self).__init__()
        self.weights = []
        self.tf_session = K.get_session()
    def on_epoch_end(self, epoch, logs=None):
        for layer in self.model.layers:
            name =
            for i, w in enumerate(layer.weights):
                w_value = w.eval(session=self.tf_session)
                w_norm, w_mean, w_std = calc_stats(np.reshape(w_value, -1))
                self.weights.append((epoch, "{:s}/W_{:d}".format(name, i), 
                                     w_norm, w_mean, w_std))
    def on_train_end(self, logs=None):
        for e, k, n, m, s in self.weights:
            print("{:3d} {:20s} {:7.3f} {:7.3f} {:7.3f}".format(e, k, n, m, s))

The on_epoch_end and on_train_end are basically event handlers which fire off when the epoch has ended and when training has ended respectively. The Callback interface defines 6 such events, for the beginning and end of batch, epoch and training. See the Keras callbacks documentation for a list and some more examples.

You could use the callback above to train for a small number of epochs and observe how these attributes of the weight tensors change. At some point, I would like to write these values to disk and then read them and chart them maybe using something like Pandas, but my Pandas-fu is not strong enough for that at this time. Here is the output of after 2 wpochs of training.

Train on 54000 samples, validate on 6000 samples
Epoch 1/2
54000/54000 [==============================] - 4s - loss: 0.2830 - acc: 0.9146 - val_loss: 0.0979 - val_acc: 0.9718
Epoch 2/2
54000/54000 [==============================] - 3s - loss: 0.1118 - acc: 0.9663 - val_loss: 0.0758 - val_acc: 0.9773
  0 dense_1/W_0           28.236  -0.002   0.045
  0 dense_1/W_1            0.283   0.003   0.012
  0 dense_2/W_0           20.631   0.002   0.057
  0 dense_2/W_1            0.205   0.008   0.010
  0 dense_3/W_0            4.962  -0.005   0.098
  0 dense_3/W_1            0.023  -0.001   0.007
  1 dense_1/W_0           30.455  -0.003   0.048
  1 dense_1/W_1            0.358   0.003   0.016
  1 dense_2/W_0           21.989   0.002   0.061
  1 dense_2/W_1            0.273   0.010   0.014
  1 dense_3/W_0            5.282  -0.008   0.104
  1 dense_3/W_1            0.040  -0.002   0.013

Another thing we can do is to look at the attributes of the outputs at each layer. I initially tried to build this as another callback, but ran into some problems, then decided on this standalone implementation which can be called after every few epochs of training to see if anything has changed. This is adapted from the Keras FAQ.

def get_outputs(inputs, model):
    layer_01_fn = K.function([model.layers[0].input, K.learning_phase()], 
    layer_23_fn = K.function([model.layers[2].input, K.learning_phase()],
    layer_44_fn = K.function([model.layers[4].input, K.learning_phase()],
    layer_1_out = layer_01_fn([inputs, 1])[0]
    layer_3_out = layer_23_fn([layer_1_out, 1])[0]
    layer_4_out = layer_44_fn([layer_3_out, 1])[0]
    return layer_1_out, layer_3_out, layer_4_out

out_1, out_3, out_4 = get_outputs(Xtest[0:10], model)
print("out_1", calc_stats(out_1))
print("out_3", calc_stats(out_3))
print("out_4", calc_stats(out_4))

I suspect we can make this more generic by looking up the model.layers data structure, but since it is kind of hard to forecast every kind of model you will build and because you will be doing this once per model, a quick and dirty implementation like the above may be preferable to something nicer. As before, we can rerun this every couple of epochs and get back the L2 norm, mean and standard deviation of the output tensors at each layer, as shown below.

out_1 (15.320195, 0.15846619, 0.36553052)
out_3 (31.983685, 0.52617866, 0.82984859)
out_4 (1.4138139, 0.1, 0.29160777)

Finally, we also wanted to figure out what the gradients looked like. The code for this adapted heavily from Edward Banner's comment in Keras Issue 2226. Like the code for visualizing the outputs, this code also needs to be run after training for a few epochs and compared with the previous values of L2 norm, mean and standard deviation for the gradients at different layers in the network.

def get_gradients(inputs, labels, model):
    opt = model.optimizer
    loss = model.total_loss
    weights = model.weights
    grads = opt.get_gradients(loss, weights)
    grad_fn = K.function(inputs=[model.inputs[0], 
    grad_values = grad_fn([inputs, np.ones(len(inputs)), labels, 1])
    return grad_values

gradients = get_gradients(Xtest[0:10], Ytest[0:10], model)
for i in range(len(gradients)):
    print("grad_{:d}".format(i), calc_stats(gradients[i]))

As before, the output below shows how the L2 norm, mean and standard deviation of the gradients at each layer. As with the output tensors, we train the network for 2 epochs, then then run this block of code. As you can guess, this sort of debugging works really well with an interactive development environment such as Jupyter Notebooks.

grad_0 (1.7725379, 1.1711028e-05, 0.0028093776)
grad_1 (0.17403033, 3.4195516e-05, 0.0076910509)
grad_2 (1.2508092, -7.3888972e-05, 0.003460743)
grad_3 (0.12154519, -0.00047613602, 0.0075816377)
grad_4 (1.5319482, 4.8748915e-11, 0.030318365)
grad_5 (0.10286356, -4.6566129e-11, 0.032528315)

That's all I had for today. The example network I have used here is quite simple, but these same ideas and tools can be used to debug more complex networks as well. These tools were built based on discussions betwwen my colleague and I last week, and the code is available here. I am sure many of you have your own favorite tools and tricks. If so, and you are okay with sharing, please let us know in the comments.

Saturday, September 30, 2017

Serving Keras models using Tensorflow Serving

One of the reasons I have been optimistic about the addition of Keras as an API to Tensorflow is the possibility of using Tensorflow Serving (TF Serving), described by its creators as a flexible, high performance serving system for machine learning models, designed for production environments. There are also some instances of TF Serving being used in production outside Google, as described in Large Scale deployment of TF Serving at Zendesk. In the past I have built custom microservices that wrapped my machine learning models, which could then be consumed by client code in a language agnostic manner. But this is a repetitive task one has to do at some point for each new model being deployed, so the promise of a generic application into which I could just drop my trained model and have it be immediately available for use was too good to pass up, and I decided to check out TF Serving. In this post, I describe my experiences, hopefully it is helpful.

I installed TF Serving from source on my Linux Ubuntu 16.04 based notebook following the instructions on the TF Serving Installation page. This requires you to download the bazel build tool and install the grpc Python module. Compiling takes a while but is uneventful if you have all the prerequisites (listed in the instructions) set up correctly. Once done, the executables are available in the bazel-bin subdirectory under the TF Serving project root.

My initial thought was to create my model using Tensorflow and the embedded Keras API, that way the model would be serialized into the Tensorflow format rather than the HDF5 format that Keras uses. However, it turns out that TF Serving uses yet another format to serialize and export trained models, so you have to convert to it from either format. Hence there is no advantage to the hybrid Keras/TF approach over the pure Keras approach.

In fact, the hybrid Keras/TF approach has the problem of having to explicitly specify the learning_phase. Certain layers such as Dropout and BatchNormalization function differently during training and testing. Keras calls the fit() and predict() functions respectively during training and testing, so it is able to differentiate the necessary behaviors. Tensorflow, however, calls for both training and testing, so the learning_phase parameter needs to be supplied as an additional boolean placeholder tensor during this call for it to differentiate between the two steps.

I was able to build and train a hybrid CNN Keras/TF model to predict MNIST digits using the Keras API embedded in TF, and save it in a format that TF Serving recognized and is able to serve up through gRPC, but I was unable to consume the service successfully to do predictions. The error message indicates that the model expects an additional input parameter, which I suspect is the learning_phase. Another issue is that it forces me to input both image and label, an artefact of how I built the model to begin with. The labels need to be passed in because we are computing training accuracy. I didn't end up refactoring this code because I found a way to serve native Keras models directly using TF Serving, which I describe below. For completeness, the links below point to notebooks to build and train the hybrid CNN Keras/TF model, to serialize the resulting TF model to a form suitable for TF Serving, and the client code to consume the service offered by TF Serving.

In case you want to investigate this approach further, there are two open source projects that attempt to build on top of TF Serving. They are keras-serving and Amir Abdi's keras-to-tensorflow. Both start from native Keras models and convert them to TF graphs, so not exactly identical, but their code may give you ideas on how to get around the issues I described above.

Next I tried using a native Keras FCN model that was trained using an existing notebook. For what it is worth, this approach finds support in Francois Chollet's Keras as a simplified interface to TF (slightly outdated) blog post, as well as his Integrating Keras and Tensorflow: the Keras workflow, expanded presentation at the TF Dev Summit 2017. In addition, there are articles such as Exporting deep learning models from Keras to TF Serving which also advocate this approach.

I was able to adapt some code from TF Serving Issue # 310, specifically the suggestions from @tspthomas, in order to read the trained Keras model in HDF5 format, and save it to a format usable by TF Serving. The code to consume the service was adapted from a combination of the example in the TF Serving distribution, plus some online sources. Links for the two notebooks are shown below.

TF Serving allows asynchronous mode operation where requests do not have to wait until the model does the prediction, as well as batched prediction payloads, where the client can send a batch of records for prediction at a time. However, I was only able to make it work synchronously and with one test record at a time. I feel that examples and better documentation would go a long way to increasing the usability (and production use outside Google) of this tool.

Also, as I learn more about TF, I am beginning to question the logic of the Keras move to tf.contrib.keras. Although, to give credit where it is due, my own effort to learn more TF is driven in large part because of this move. TF already has a Layers API which is very similar to the Keras abstraction. More in line with the TF way of doing things, these layers have explicit parameters which can be set to indicate the learning phase instead of a magic learning phase that is handled internally. Also, it appears that pure TF and pure Keras models are both handled well with TF Serving, so I don't see a need for a hybrid model anymore.

Overall, TF Serving appears to be powerful, at least for Keras and TF models. At some point, I hope the TF Serving team decides to make it more accessible to casual users by providing better instructions and examples, or possibly higher level APIs. But until then, I think I will continue with my custom microservices approach.

Thursday, September 14, 2017

EMNLP 2017: Trip Report

Last week, I was at the EMNLP 2017 at Copenhagen. EMNLP is short for Empirical Methods for Natural Language Processing, and is one of the conferences of The Association for Computational Linguistics (ACL) that brings together NLP professionals from academia and industry to talk about their research and present their findings to each other. The conference itself was for 3 days - Saturday September 9 to Monday September 11 - but it was preceded by two days of tutorials and workshops, which I also attended. This is my trip report.


My main takeaway from the conference is that the NLP community is still heavily invested in deep learning. My frame of reference is NAACL 2015, the last ACL conference I attended, where the majority of the papers were about word embeddings and their applications. Papers this year continue to use word embeddings. But in addition, there are many other kinds of embeddings, such as character and subword embeddings to represent word morphology, phrase embeddings that marry the capabilities of NLP parsers to represent sentence structure. Both offer improvements over the Bag of Words approach or even combining word vectors through Bidirectional LSTMs to produce sentence (or higher abstraction) vectors.

In addition to the Bidirectional LSTM approach, many novel architectures were presented, including CRF-LSTMs, Graph LSTMs, and CNN-LSTMs. These modifications exploit the structure of natural language by providing extra information about phrase structure, emphasizing nearby words, or taking advantage of hierarchy imposed by the application (such as comment threads). Google's efforts with Machine Translation gave us the seq2seq model, but since then it's encoder-decoder architecture with optional attention has been adapted for many other NLP tasks that involve sequence inputs and outputs. In addition, Google briefly talked about their Transformer architecture, which is likely to become more important in coming years. Other interesting ideas are the use of adversarial techniques and joint learning to improve the accuracy of difficult tasks.

One other important trend I saw was the broader adoption of Reinforcement Learning (RL) techniques. I mostly think of RL in the context of game playing AIs, which implies that there is a physics engine somewhere to provide automated reinforcement during training. In the context of NLP, this physics engine seems to be search engine, optionally coupled with domain dependent retrieval rules. Applications taking advantage of RL seem to be mostly related to Learning to Rank (L2R), as far as I could see.

Finally, there were a few papers using more traditional techniques, such as the use of probabilistic graphical models or other Bayesian techniques, or based on clustering and topic modeling. In keeping with the focus on deep learning, almost all of them use word (and optionally character) embeddings to augment their feature set.

Structurally, the conference was organized into three parallel tracks, organized around themes such as Syntax, Semantics, Information Extraction, Machine Translation, Machine Learning, Language Generation, Discourse and Summarization, Multilingual NLP, Language Grounding, Multimodal NLP, Linguistic Theory, Computational Social Science, Sentiment Analysis, Dialog, and NLP Applications. In addition, there were 7 tutorials and 14 workshops held on the first 2 days, perhaps based on the premise that an attendee would either be a newbie or an expert, so you would find something to occupy your day. You could do a maximum of 3 tutorials (0.5 day per tutorial) or 2 workshops (1 day per workshop) if you attended the first 2 days.

There were also tons of poster sessions throughout the three days of the conference, and some of the ideas in these posters were really cool. One thing I found a bit annoying was that the posters would be up for a limited time and they would get changed after each session. This meant that either you miss a few talks if you wanted to do justice to the posters, or try to take in as many posters as you can during the coffee and lunch breaks. I chose to do the latter, except one time when a speaker failed to show up.

What follows is a brief description of the talks I attended, probably falls into the TL;DR category unless you want my personal take on the talks. Links to all papers presented at EMNLP can be found here (you might need ACL membership in the future, but they seem to be readily available now). In addition, the entire event was live streamed and the recordings are here. I am guessing that the recordings of the individual talks will eventually make it to a Youtube channel once the editing process is completed. I will update the post with the links once that happens. If you find the Youtube videos first, please let me know in the comments and I will update.

Tutorials and Workshops

Tutorial: Acquisition, Representation and Usage of Concept Hierarchies - by Marius Pesca (abstract)
A brief but very representative overview of techniques used to extract and represent entities in IS-A relationships, and various techniques for using these concepts in search applications. I could identify a few techniques I knew about, but there were quite a few I did not, so it was very useful for me.

Tutorial: Graph based text representations: Boosting text mining, NLP and information retrieval with graphs - by Fragkiskos D Malliaros and Michalis Vazirgiannis (abstract)
Very comprehensive coverage of graph techniques for NLP, using graph of words for information retrieval, text summarization using k-core decomposition, using graph based document representations for clustering, subgraph extraction and frequent subgraph mining techniques. Again, the benefit to me was the breadth of coverage.

Tutorial: Memory Augmented Neural Networks (MANN) for Natural Language Processing - by Caglar Gulcehre and Sarath Chandar (abstract)
Despite the success of LSTMs for solving NLP problems, there are still some complex tasks that need the ability to store and retrieve information on demand from an external store because they need to look at information that is too far in the past (or future) for an LSTM's hidden vector to provide. The resulting architecture is the Neural Turing Machine (NTM), and this tutorial discusses NTMs in quite a bit of depth.

Workshop: evaluating vector space representations in NLP
I attended part of this on the second day, highlights of the workshop for me were the talks by Yejin Choi from University of Washington, Jacob Uszkoreit from Google and Kyunghyn Cho (of GRU fame) of New York University. Yejin spoke about the need for extracting tactile information from the physical world and using it in reasoning. Jacob talked at length about the Transformer Architecture in connection with Machine Translation (and Language Understanding), and Cho spoke about using character models.

Conference Day #1

Keynote: Physical Simulation, Learning and Language - by Nando de Freitas
Nando de Freitas spoke about the need to build systems that can learn to learn from the environment like a general AI, and described a framework that allows researchers to simulate a physical world at faster than real time, that has led to many improvements in robotics. He argued for the need for something similar in the area of language research.

Monolingual Phrase Alignment on Parse Forests - by Yuki Arase and Jun'ichi Tsuji
Presenter described a tree-based method to detect and align phrases using paraphrase statements. In the process they have developed a gold dataset of parse trees and phrase alignments that they offer to fellow researchers.

Heterogeneous Supervision for Relation Extraction: A representation learning approach - by Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji and Jiawei Han.
Presenter described a method to learn to learn relation extraction using domain heuristics and knowledge bases. The resulting learning is quite noisy, which are resolved using reliability ranking of sources and context embeddings much like word disambiguation using word vectors.

Mimicking word embeddings using subword RNNs - by Yuval Pinter, Robert Guthrie, Jacob Eisenstein
Presenter described results from their system MIMICK against other subword embedding systems such as Char2Tag. MIMICK can generate subword embeddings for Out of Vocabulary (OOV) words, using subword embeddings, much like word vectors are used to generate sentence vectors using the BiLSTM approach. Code for MIMICK can be found at the link.

Entity Linking for Queries by Searching Wikipedia Sentences - Chuanqui Tan, Furu Wei, Pengie Ren, Weifeng Lo and Ming Zhu
Extracting entities from queries can make disambiguation easier. System uses a search index to retrieve sentences containing the query terms and does entity extraction on the resulting sentences to find entities in the query. For word disambiguation, the presenters used the system supWSD (supWSD code), which is a supervised WSD system, and provides a toolkit and trained models.

End to end neural coreference resolution - by Kenton Lee, Luheng He, and Luke Zettlemoyer
Presenter describes an end-to-end system (e2e-coref) similar to Question Answering (QA) systems, where a document is broken up into spans using standard parsing techniques. All spans are treated as mention spans and a network used to detect similar mentions and cluster them. Code for the e2e-coref system can be found at this link.

Neural Machine Translation with word prediction - by Rongxiang Weng, Shujian Huang, Zaixiang Zheng, Xin-Yu Dai, and Jiajun Chen
The presenter suggests a change to the standard seq2seq model used for machine translation, to also include all previous predictions at each stage in the decoder sequence, and use the top K words as the vocabulary. They have found that it improves translation performance.

Affinity preserving random walk for multi-document summarization - by Kexiang Wang, Tianyu Liu, Zhifung Sui, and Baobao Chang
Output of MDS is a short text that summarizes all the documents in the MD collection. Presenter describes a graph based method that collects the entities from all documents in the collection, and then executes a random walk similar to Pagerank. Once the process converges, the important entities of the graph can be converted to the MDS.

Google's multilingual Neural Machine Translation System: Enabling Zero Shot Translation - by Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yanghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean
Google's NMT model is well known. Presenter describes results of various experiments, including creating a combined multi-language model, creating languages for CJK and European languages and how they perform among different groups of languages. Turns out that multi-language NMT results in higher performance, also NMT trained on one language family is more effective (for certain language families) on their own family than on others.

DeepPath: A reinforcement learning method for knowledge graph reasoning - by Wenhan Xiong, Thien Hoang, and William Yang Wang
Presenter describes their DeepPath system, which uses Reinforcement Learning (RL) and Knowledge Graph embeddings to learn to find the most promising relation in a KG to extend the path. Code for DeepPath is available here.

Task Oriented Query Reformulation with Reinforcement Learning - by Rodrigo Nogueira and Kyunghyun Cho
Presenter describes a RL based Neural Network (NN) that reformulates complex user queries to maximize the number of relevant documents returned. The reward function used is the document recall. Code for the Query Reformulator is available here.

Sentence Simplification with Deep Reinforcement Learning - by Xingxing Zhang and Mirella Lapato
Presenter describes a RL based DL system for sentence simplification system called DRESS (Deep REinforcement Sentence Simplification). Reward function used is the SARI metric which rewards similarity, simplicity and correct grammar. Code for DRESS is available here.

Learning how to active learn: A Deep Reinforcement Learning Approach - by Meng Fang, Yuan Li and Trevor Cohn.
Presenter describes their system which uses RL to learn a policy to do Named Entity Recognition (NER) in one language, and apply the same policy to do NER in another language. The policy learned is based on labeling functions developed against a small dataset in the original language. Code for the RL system is here.

Conference Day #2

Keynote: Towards more universal language technology: unsupervised learning from speech - by Sharon Goldwater
Sharon makes a case for unsupervised and semi-supervised learning and describes her work on unsupervised learning in the area of speech. Results are not very good but the task is very hard. Some of her ideas may be directly transferable to language, but she makes the case that the NLP community should also invest effort in unsupervised techniques going forward.

A structured learning approach to temporal relation extraction - by Qiang Ning, Zhili Feng and Dan Roth.
Presenter describes the difficulty with manually annotating temporal relations in text, and proposes a graph based approach with verbs connected by candidate temporal relation edges, computing pairwise KL divergence between the nodes and comparing to KL divergence between two entities with uniform distribution.

Importance sampling for unbiased on-demand evaluation of knowledge base population - by Arun Chaganty, Ashwin Paranjpe, Percy Liang and Christopher D Manning
Presenter discusses how NER system evaluation is inherently biased in that it penalizes new findings from new NER systems. He proposes a way to sample from the predictions of the new NER system and verify that these findings are valid using crowdsourcing. Resulting approach is cheaper than naive crowdsourcing and removes bias in evaluation. Project code is here, and here is the Online Demo.

PACRR: A position aware neural IR model for relevance matching - by Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo.
Presenter describes a NN based model that mimicks the relevance formula in a search engine. Since the input embeddings are word vector based, the intuition is that the resulting model will be better at capturing semantic similarity. Good results are already available for unigram matching, this work explores the effect of position of context words on relevance matching. Code for PACRR is here.

Globally Normalized Reader - by Jonathan Raiman and John Miller.
Presenter describes their GNR system that does QA using iterative search instead of typical bidirectional attention mechanism. Results are back propagated through beam search, and is found to produce better results against the SQUAD dataset. The team has also used a novel data augmentation method for their training, and they offer the dataset as well to interested researchers. Code for the Globally Normalized Reader can be found here.

Encoding sentences for graph convolutional networks for semantic role labeling - by Diego Marcheggiani and Ivan Titov.
Presenter describes a Graph Convolutional Network (GCN) for modeling syntax dependency graphs, and their use as sentence encoders for Semantic Role Labeling (SRL) applications. They note that GCNs are complementary to LSTMs, and stacking them together results in improved results in identifying predicate-argument structures in a sentence, compared to the previous state of the art LSTM-based SRL model. Code for the NN based SRL system is here.

Neural Semantic Parsing with Type Constraints for Semi-structured tables - by Jayant Krishnamurthy, Pradeep Dasigi and Matt Gardner.
Presenter describes their model which learns how to answer compositional questions on semi-structured Wikipedia tables. Input is the natural language question and output is a well-typed logical form for navigating and looking up the answer. Dataset used is the Wikitable Questions.

Joint Concept Learning and Semantic Parsing from Natural Language Explanations - by Shashank Srivastava, Igor Labutov, and Tom Mitchell.
Presenter describes their system that certain features of text explanations to identify concepts. For example, the presence of "bank account number" in an explanation about phishing. Label functions are generated from these texts and used to identify a concept.

Opinion Recommendation using a Neural Model - by Zhongquing Wan and Yue Zhang.
Presenter describes their system that jointly generates a custom review score and a review for a given user, given his other reviews and scores. Task is novel, hence a new name Opinion Recommendation. Inputs are 3 NNs which model the reviews about the product, the user, and the user neighborhood (other users). These are concatenated using multi-hop attention (which seems to be iterative dot products) and form the input to another NN that outputs the score and the generated review using a standard encoder-decoder architecture.

Accurate Supervised and Semi-supervised machine reading for long documents - by Daniel Hewlett, Llion Jones and Alexandre Lacoste.
Presenter describes a standard QA network, the novel bit is that documents are split into equal sized parts (best results found with chunk size of 30 words) and encoded using RNNs in parallel. The network then attends over these separate encodings and reduces them to a single encoding, which is then decoded into an answer using a sequence decoder.

Adversarial Examples for Evaluating Reading Comprehension Systems - Robin Jia and Percy Liang.
Presenter describes how adding extra information to documents in a QA scenario can lead to a QA system giving the wrong answer. This is similar to the adversarial examples used in vision. They then propose an evaluation scheme for QA systems using this idea to measure if the QA system is demonstrating true language understanding versus just learning how to do pattern matching.

Joint modeling of Topics, Citations and Topical Authority in Academic Corpora - by Jooyeon Kim, Dongwoo Kim and Alice Oh.
Presenter introduces Latent Topical Authority Indexing (LTAI) which they show is a better way to expose topic signals from papers and authors than current techniques. LTAI can be used to find an expert on a topic, compare topical authority among multiple authors. The model used is a Programmable Graphical Model (PGM) which uses Expectation Maximization (EM) to compute the LTAI metric.

Identifying semantic intentions from revisions in wikipedia - Diyi Yang, Aran Halfaker, Rober Kraut, and Eduard Hovy.
Presenter talks about a 13 category taxonomy of semantic intention behind Wikipedia edits, and describes a classifier that can predict the intention given the user's edit history. This also opens up avenues for research into behavior of Wikipedia editors.

Conference Day #3

Keynote: Processing the language of policing for Improving Police-Community Relations. - by Dan Jurafsky
Dan Jurafsky speaks about his recent research into how the language policemen use exhibit a racial bias. Data for the research comes from 1 month of video footage from body cameras worn by Oakland PD officers. He also gave a brief update on his ongoing research into food and sociology. The theme of the talk was the need for NLP to do cross-disciplinary research so it can have a greater impact.

Part of Speech tagging for Twitter with Adversarial Neural Networks - by Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng and Xuanjiang Huang
Presenter describes how combining POS tags from a high resource corpus such as WSJ, as well as character embeddings from both WSJ and Twitter, enables learning of POS tags on Twitter using an adversarial discriminator setup.

Learning Generic Sentence Representation using Convolutional Neural Networks - Zhe Gan, Yunchen Pu, Ricardo Henau, Chunyuan Li, Xiadong He, and Lawrence Carin
Presenter proposes a new encoder-decoder approach to learn distributed sentence/paragraph representations using a CNN-LSTM or hierarchical CNN-LSTM network instead of using LSTMs for both encoder and decoder as is done currently. He presents empirical evidence showing that performance is as good or better than using LSTM for both encoder and decoder.

Conversation Modeling on Reddit using a Graph Structured LSTM - by Victoria Zayats and Mari Ostendorf
Presenter describes her project to capture keywords/topics for popular vs unpopular Reddit comments (objective is to find what makes some comments popular vs not popular for a given subreddit). Since Reddit comments are hierarchical, a Graph LSTM is used, which builds the hidden component of the input from both the parent comment as well as the previous comment. Learned a nice method of quantization by selecting the median of each quantile of the score distribution as the threshold.

Learning what to read: Focused Machine Reading - by Enrique Noriega-Atala, Marco A Valenzuela-Escdrcega, Clayton Morrison, and Mihai Surdeanu
Presenter describes project to capture statements of the form "A related-to B given context C" on the Pubmed OpenAccess (OA) dataset. A pair of entities are chosen and queries fired against the corpus to find all possible entity pairs. RL is used to score the best path between the two specified entities, results in approximately 40% of path lookups compared to exhaustive search.

DOC: Deep Open Classification of text documents - by Lei Shu, Hu Xu, and Bing Liu
This talk is unique in that it makes the open world assumption, instead of a document being classified into 1 of N classes, the document can also be not one of the N classes, as well as belong to more than one of N classes. Approach is to create one-vs-rest classifiers for each class, and then softmax across their scores to find the classes. Thresholding to detect the class(es) to assign to each document involves fitting a gaussian to a histogram of scores for positive labels for each class, and then considering the mean + a multiple of the standard deviation as the threshold.

Exploiting Cross Sentence Context for Neural Machine Translation - by Longyue Wang, Zhaopeng Tu, Andy Way and Qun Liu
Presenter describes a novel idea of computing the context (3 previous sentences to current sentence being translated) and using it as additional input to the encoder, decoder, or even for attention during decoding. Experiments show that the additional context results in better scores on their test data. Code for the project is available here.

Cross Lingual Transfer learning for POS Tagging without cross lingual resources - by Joo-Kiyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier
Yet another example of adversarial learning in the NLP space, where the POS tags for the resource rich language are used to train a discriminator which is then used to train a generator to generate POS tags for the resource poor language. Code for the tagger is available here.

A Simple Regularization based Algorithm for learning Cross-Domain Word Embeddings - by Wei Yang, Wei Lu and Vincent Zhang.
Presenter describes building a graph using entities from a given domain as the nodes, and the edges weighted using the cosine distance between their vector representations. Then an iterative algorithm such as Pagerank is run until convergence. Cross domain word embeddings are learned by running analogies between selected words in one domain and single words in the other.

Best Paper: Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps - by Tobias Falke and Iryna Gurevych.
Presenter describes a system that extracts entities from one or more documents, sets them up in a graph based on cosine distance between their word vectors, and finds the most important entities. These entities are then sent to human experts in a crowdsourcing arrangement who construct the summaries manually.

The other 3 papers in the best paper category were around the language learned when two automated agents engage in dialog, predicting depression and suicide risk in online forums, and how to correct for dataset bias for machine learning models.


Presentations were predominantly from academia, which is kind of expected, since most of the papers tend to push the envelope of the state of the art, something academia tends to do. Among US universities, Stanford and Carnegie Mellon are well known for their NLP, so as expected, there were quite a few presentations from there. University of Washington also had quite a few good submissions. I also saw a lot of presentations from Chinese universities, looks like NLP and Deep Learning are quite popular in China. From industry, I saw two presentations from Google and one from Baidu.

That's pretty much all I have for this week. I made a few awesome friends, thanks to an introduction from a colleague, with some other attendees who were local to Copenhagen. Thanks to their help, I got to eat authentic Italian pizza and spicy Indian food in Copenhagen at an area right next to the conference, but which I am pretty sure I wouldn't have found on my own :-).

Monday, August 21, 2017

Improving Sentence Similarity Predictions using Attention and Regression

If you have been following my last two posts, you will know that I've been trying (unsuccessfully so far) to prove to myself that the addition of an attention layer does indeed make a network better at predicting similarity between a pair of inputs. I have had good results with various self attention mechanism for a document classification system, but I just couldn't replicate a similar success with a similarity network.

Upon visualizing the intermediate outputs of the network, I observed that the attention layer seemed to be corrupting the data - so instead of highlighting specific portions of the input as might be expected, the attention layer output actually seemed to be more uniform than the input. This led me to suspect that either my model or the attention mechanism was somehow flawed or ill-suited for the problem I was trying to solve.

My similarity network was trying to treat the similarity problem as a classification problem, i.e, it would predict one of 6 discrete "similarity classes". However, the training data provided the similarities as continuous floating point numbers between 0 and 5. The examples I had seen before for similar Siamese network architectures (such as this example from the Keras distribution) typically minimize a continuous function such as Contrastive Divergence. So I decided to change my network to a regression network, more in keeping with the data provided and examples I had seen. This new network would learn to predict a similarity score by minimizing the Mean Squared Error (MSE) between label and prediction. The optimizer used was RMSProp.

With the classification network, the validation loss (categorical cross-entropy) on even the baseline (no attention) network kept oscillating and led me to believe that the network was overfitting on the training data. The learning curve from the baseline regression network (w/o Attn) looked much better, so the change certainly appears to be a step in the right direction. The network was evaluated using Root Mean Square Error (RMSE) between label and prediction on a held-out test set. In addition, I also computed the Pearson correlation and Spearman (rank) correlation coefficients between label and predicted values in the test set.

In addition, I decided to experiment with some different Attention implementations I found on the Tensorflow Neural Machine Translation(NMT) page - the additive style proposed by Bahdanau, and the multiplicative style proposed by Luong. The equations here are in the context of NMT, so I modified the equations a bit for my use case. In addition, I found that the attention style I was using from the Parikh paper is called the dot product style, so I included that too below with similar notation, for comparison. Note that the difference in "style" pertains only to how the alignment matrix α is computed, as shown below.

The alignment matrix is combined with the input signal to form the context vector, and the context vector is concatenated with the input signal and weighted with a learned weight and passed through a tanh layer.

One other thing to note is that unlike my original attention implementation, the alignment matrix in these equations is formed out of the raw inputs rather than the ones scaled through a tanh layer. I did try using scaled inputs with the dot style attention (MM-dot(s)) - this was my original attention layer without any change - but the results weren't as good as dot style attention without scaling (MM-dot).

For the two new attention styles, I added two new custom Keras Layers AttentionMMA for the additive (Bahdanau) style, and AttentionMMM for the multiplicative (Luong) style. These are called from the model with additive attention (MM-add) and model with multiplicative attention (MM-mult) notebooks respectively. The RMSE, Pearson and Spearman correlation coefficients for each of these models, each trained for 10 epochs, are summarized in the chart below.

As you can see, the dot style attention doesn't seem to do too well against the baseline, regardless of whether the input signal or scaled or not. However, both the additive and multiplicative attention styles result in a significantly lower RMSE and higher correlation coefficients than the baseline, with additive attention being giving the best results.

That's all I have for today. I hope you found it interesting. There are many variations among Attention mechanisms, and I was happy to find two that worked well with my similarity network.

Saturday, August 12, 2017

Visualizing Intermediate Outputs of a Similarity Network

In my last post, I described an experiment where the addition of a self attention layer helped a network do better at the task of document classification. However, attention didn't seem to help for another experiment where I was trying to predict sentence similarity. I figured it might be useful to visualize the outputs of the network at each stage, in order to see where exactly it was failing. This post describes that work. The visualizations did give me pointers to what was happening, and I tried some of these ideas out, but so far I haven't been able to get a network with attention to perform better than a network without it at the similarity task.

The diagram below illustrates the structure of the network whose outputs I was trying to visualize. The network is built to predict the similarity between two sentences on a 6 point scale. The training data comes from the Semantic Similarity Task Dataset for 2012, and consists of sentence pairs and associated similarity score (floating point numbers) between 0 and 5. For this experiment, I quantize the labels into 6 different similarity classes, and attempt to predict that value. Word vectors are looked up from pretrained GloVe embeddings for each word in the two sentence pair, then the sequence of word vectors sent through a Bidirectional LSTM to produce a encoded sentence matrix for each sentence in the pair. The sentence matrices are then sent through an attention layer to create a vector that first creates an alignment matrix between the two sentence matrices, then uses the alignment matrix to determine how much to weight each part of the two sentences when producing the output vector. The output vector is then fed into a Fully Connected network to do the final prediction.

I wanted to visualize the outputs at each stage of the network to see how they differed at each stage. So I first selected three sentence pairs with label similarity values approximately equidistant along the label range. For each sentence, I computed the (a) similarity matrices for the input (one-hot) vector sequence for each sentence, (b) their word vector sequence after embedding, (c) the sentence vector after encoding, (d) the alignment between the two sentence matrices, (e) and the similarity matrix between the aligned sentences. Each of these matrices are represented as a heat map for visualization. In addition, (f) I also used the alignment between the two embeddings to compute the weighted sentence matrix to see if that made any difference.

Each heatmap also has a crude measure of "similarity" that divides the sum of the diagonal elements by the sum of all the elements.

The sequence of heatmaps below show the outputs for a network trained for 10 epochs with a training accuracy of 0.8, validation accuracy of 0.7 and training accuracy of 0.4. The sentence pair that generated these outputs are as follows:

Left: A man is riding a bicycle.
Right: A man is riding a bike.
Score: 5.0

Next, we consider a slightly less similar (according to the score label) sentence pair as follows:

Left: A woman is playing the flute.
Right: A man is playing the flute.
Score: 2.4

Finally, we consider a pair of sentences which are even more dissimilar.

Left: A man is cutting a potato.
Right: A woman is cutting a tomato.
Score: 1.25

In all cases, the heatmap for the input is self-explanatory, since common words are down the diagonal. The output of the embedding step also kind of makes sense, since bicycle and bike in the first case, man and woman in the second and third cases, and potato and tomato in the third case show a non-zero resemblance. In all cases, the resulting sentence matrix (output of the encoding step) results in a blurry blob indicating the similarity between the two sentences in the pair. I did expect the alignments to be more meaningful - in all 3 cases above, there doesn't seem to be a meaningful pattern. Since the attention output is dependent on the alignment, there is no meaningful pattern there either.

Results from computing the alignment against the embedding output and weighting the encoding output to produce the attention output results in slightly more meaningful patterns. For example, in all 3 cases, the terminating period seems to be unimportant. Strangely, common words seem to hold less importance than I would have expected. Sadly, though, my crude measure of similarity does not match up with the labels, regardless of which pair of outputs I use for my alignment.

Here is the notebook that renders these visualizations, and here is the notebook to build the pre-trained model on which the visualization is based. I used a combination of model.predict() to generate outputs of sub-networks, as well as extracting the trained weights from the model, and applying numpy operations to get results.

That's all I have for today, hope you found it interesting.