GSOC-Week-Six

2020-07-14

Introduction

This week’s work will be continuing the training of NMT model with embedding and see if the embeddings could really help as the train_steps goes larger.(As last week I only finish the 20000-steps training and see the trend of progressing, but then reach the limit of usage of GPU).
Another task will be to try to use a BERT model to get the sentence representation in place of Universal Sentence Encoding, and try to compare these two methods of sentence embedding.

Embedding and non-embedding Training results

Figure Bleu score of NMT model with embeddings

Figure Bleu score of NMT model without embeddings

It’s glat to see that embeddings could help NMT model improve its learning performance.

BERT

The goal here is to represent a variable length sentence into a fixed length vector so that we could calculate a similarity score between two sentences. We have already tried Universal Sentence Encoder in the past few weeks, and this week, I have tried another option, using BERT.
The pre-trained model BERT could be directly used to give the sentence vector. In my experiment, I used this model with 12 layer and 768 number of vector dimensions.
To extract the sentence vector, there is a trick:
As BERT model will add token and token to the head and the end of each input sentence, and the output which corresponds to the could be used as the sentence vector. But here, as we will not do any fine-tune to the BERT model, we will take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
The reason of taking the second-to-last is that The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets, and the layers before may not learn enough about the sentence.
Concerning to the token, I will explain later why we don’t use it directly to get the sentence encoder.

BERT vs USE

The experiment if done on my Mac pro, with a CPU of 4-core Intel Core i5

who is the spouse of < a > ?
who is the wife of < a > ?
BERT similarity: 0.99015945 24’’93
USE similarity:0.9243937 51’’10
edit distance:3

What is the height of ?
Where does the height of come from?
BERT:0.9815554 24’’92
USE: 0.85236233 48’’35
edit distance:4

She hates me.
She loves me.
BERT: 0.9860806 23’’96
BERT(-1): 0.9647469
USE: 0.678635 49’’70
edit distance:1

We could see that the similarity(cosine) calculated from a BERT representation are not robust(always >0.9) compared to USE. The only advantage is that it extracts the sentence vector much faster than USE thanks to the optimization of this project.
The reason for not using token as sentence representation is that the pre-trained model is not fine-tuned on any downstream tasks yet. so in this case, the hidden state of is not a good sentence representation. If later we could fine-tune the model, we may use [CLS] as well. But in our case, we won’t do any fine-tune to the BERT model as we want to simply calculate the semantic similarity between two sentences.
Another thing to be pointed out is the reason for why cosine similarity of BERT sentence vectors is always higher than 0.9. I found it’s suggested that a decent representation for a downstream task doesn’t mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. As a result, if we want to use cosine distance anyway, then we should focus on the rank not the absolute value.
For more explanation the BERT pre-trained model, please refer to Hanxiao’s documentation[1]

Conclusion

As we could see, that NMT with embedding could really give a hand to the learning of translation. In addition, the generation of sentence vector of non fine-tuned BERT model is faster but much weaker than Universal Sentence Encoder in terms of sentence similarity task, so I think we should continue using the USE model but find a way to alleviate the time consuming problem(even I ran it on cpu, 50 seconds for one pair of sentence embeddings is a little bit too long).
I’ve also checked if the fine-tuned T5 model could directly give us a sequence embedding directly, but found that is well capsulated and only certain functions could be attached.

Reference

[1] Hanxiao’s documentation for bert-as service:
https://bert-as-service.readthedocs.io/en/latest/section/faq.html#how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something