Monday, August 21, 2017

Improving Sentence Similarity Predictions using Attention and Regression

If you have been following my last two posts, you will know that I've been trying (unsuccessfully so far) to prove to myself that the addition of an attention layer does indeed make a network better at predicting similarity between a pair of inputs. I have had good results with various self attention mechanism for a document classification system, but I just couldn't replicate a similar success with a similarity network.

Upon visualizing the intermediate outputs of the network, I observed that the attention layer seemed to be corrupting the data - so instead of highlighting specific portions of the input as might be expected, the attention layer output actually seemed to be more uniform than the input. This led me to suspect that either my model or the attention mechanism was somehow flawed or ill-suited for the problem I was trying to solve.

My similarity network was trying to treat the similarity problem as a classification problem, i.e, it would predict one of 6 discrete "similarity classes". However, the training data provided the similarities as continuous floating point numbers between 0 and 5. The examples I had seen before for similar Siamese network architectures (such as this example from the Keras distribution) typically minimize a continuous function such as Contrastive Divergence. So I decided to change my network to a regression network, more in keeping with the data provided and examples I had seen. This new network would learn to predict a similarity score by minimizing the Mean Squared Error (MSE) between label and prediction. The optimizer used was RMSProp.

With the classification network, the validation loss (categorical cross-entropy) on even the baseline (no attention) network kept oscillating and led me to believe that the network was overfitting on the training data. The learning curve from the baseline regression network (w/o Attn) looked much better, so the change certainly appears to be a step in the right direction. The network was evaluated using Root Mean Square Error (RMSE) between label and prediction on a held-out test set. In addition, I also computed the Pearson correlation and Spearman (rank) correlation coefficients between label and predicted values in the test set.

In addition, I decided to experiment with some different Attention implementations I found on the Tensorflow Neural Machine Translation(NMT) page - the additive style proposed by Bahdanau, and the multiplicative style proposed by Luong. The equations here are in the context of NMT, so I modified the equations a bit for my use case. In addition, I found that the attention style I was using from the Parikh paper is called the dot product style, so I included that too below with similar notation, for comparison. Note that the difference in "style" pertains only to how the alignment matrix α is computed, as shown below.

The alignment matrix is combined with the input signal to form the context vector, and the context vector is concatenated with the input signal and weighted with a learned weight and passed through a tanh layer.

One other thing to note is that unlike my original attention implementation, the alignment matrix in these equations is formed out of the raw inputs rather than the ones scaled through a tanh layer. I did try using scaled inputs with the dot style attention (MM-dot(s)) - this was my original attention layer without any change - but the results weren't as good as dot style attention without scaling (MM-dot).

For the two new attention styles, I added two new custom Keras Layers AttentionMMA for the additive (Bahdanau) style, and AttentionMMM for the multiplicative (Luong) style. These are called from the model with additive attention (MM-add) and model with multiplicative attention (MM-mult) notebooks respectively. The RMSE, Pearson and Spearman correlation coefficients for each of these models, each trained for 10 epochs, are summarized in the chart below.

As you can see, the dot style attention doesn't seem to do too well against the baseline, regardless of whether the input signal or scaled or not. However, both the additive and multiplicative attention styles result in a significantly lower RMSE and higher correlation coefficients than the baseline, with additive attention being giving the best results.

That's all I have for today. I hope you found it interesting. There are many variations among Attention mechanisms, and I was happy to find two that worked well with my similarity network.

Saturday, August 12, 2017

Visualizing Intermediate Outputs of a Similarity Network

In my last post, I described an experiment where the addition of a self attention layer helped a network do better at the task of document classification. However, attention didn't seem to help for another experiment where I was trying to predict sentence similarity. I figured it might be useful to visualize the outputs of the network at each stage, in order to see where exactly it was failing. This post describes that work. The visualizations did give me pointers to what was happening, and I tried some of these ideas out, but so far I haven't been able to get a network with attention to perform better than a network without it at the similarity task.

The diagram below illustrates the structure of the network whose outputs I was trying to visualize. The network is built to predict the similarity between two sentences on a 6 point scale. The training data comes from the Semantic Similarity Task Dataset for 2012, and consists of sentence pairs and associated similarity score (floating point numbers) between 0 and 5. For this experiment, I quantize the labels into 6 different similarity classes, and attempt to predict that value. Word vectors are looked up from pretrained GloVe embeddings for each word in the two sentence pair, then the sequence of word vectors sent through a Bidirectional LSTM to produce a encoded sentence matrix for each sentence in the pair. The sentence matrices are then sent through an attention layer to create a vector that first creates an alignment matrix between the two sentence matrices, then uses the alignment matrix to determine how much to weight each part of the two sentences when producing the output vector. The output vector is then fed into a Fully Connected network to do the final prediction.

I wanted to visualize the outputs at each stage of the network to see how they differed at each stage. So I first selected three sentence pairs with label similarity values approximately equidistant along the label range. For each sentence, I computed the (a) similarity matrices for the input (one-hot) vector sequence for each sentence, (b) their word vector sequence after embedding, (c) the sentence vector after encoding, (d) the alignment between the two sentence matrices, (e) and the similarity matrix between the aligned sentences. Each of these matrices are represented as a heat map for visualization. In addition, (f) I also used the alignment between the two embeddings to compute the weighted sentence matrix to see if that made any difference.

Each heatmap also has a crude measure of "similarity" that divides the sum of the diagonal elements by the sum of all the elements.

The sequence of heatmaps below show the outputs for a network trained for 10 epochs with a training accuracy of 0.8, validation accuracy of 0.7 and training accuracy of 0.4. The sentence pair that generated these outputs are as follows:

Left: A man is riding a bicycle.
Right: A man is riding a bike.
Score: 5.0

Next, we consider a slightly less similar (according to the score label) sentence pair as follows:

Left: A woman is playing the flute.
Right: A man is playing the flute.
Score: 2.4

Finally, we consider a pair of sentences which are even more dissimilar.

Left: A man is cutting a potato.
Right: A woman is cutting a tomato.
Score: 1.25

In all cases, the heatmap for the input is self-explanatory, since common words are down the diagonal. The output of the embedding step also kind of makes sense, since bicycle and bike in the first case, man and woman in the second and third cases, and potato and tomato in the third case show a non-zero resemblance. In all cases, the resulting sentence matrix (output of the encoding step) results in a blurry blob indicating the similarity between the two sentences in the pair. I did expect the alignments to be more meaningful - in all 3 cases above, there doesn't seem to be a meaningful pattern. Since the attention output is dependent on the alignment, there is no meaningful pattern there either.

Results from computing the alignment against the embedding output and weighting the encoding output to produce the attention output results in slightly more meaningful patterns. For example, in all 3 cases, the terminating period seems to be unimportant. Strangely, common words seem to hold less importance than I would have expected. Sadly, though, my crude measure of similarity does not match up with the labels, regardless of which pair of outputs I use for my alignment.

Here is the notebook that renders these visualizations, and here is the notebook to build the pre-trained model on which the visualization is based. I used a combination of model.predict() to generate outputs of sub-networks, as well as extracting the trained weights from the model, and applying numpy operations to get results.

That's all I have for today, hope you found it interesting.