Introduction
This week’s work will be continuing the training of NMT model with embedding and see if the embeddings could really help as the train_steps goes larger.(As last week I only finish the 20000-steps training and see the trend of progressing, but then reach the limit of usage of GPU).
Another task will be to try to use a BERT model to get the sentence representation in place of Universal Sentence Encoding, and try to compare these two methods of sentence embedding.
Embedding and non-embedding Training results
Figure Bleu score of NMT model with embeddings
Figure Bleu score of NMT model without embeddings
It’s glat to see that embeddings could help NMT model improve its learning performance.
BERT
The goal here is to represent a variable length sentence into a fixed length vector so that we could calculate a similarity score between two sentences. We have already tried Universal Sentence Encoder in the past few weeks, and this week, I have tried another option, using BERT.
The pre-trained model BERT could be directly used to give the sentence vector. In my experiment, I used this model with 12 layer and 768 number of vector dimensions.
To extract the sentence vector, there is a trick:
As BERT model will add
The reason of taking the second-to-last is that The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets, and the layers before may not learn enough about the sentence.
Concerning to the
BERT vs USE
The experiment if done on my Mac pro, with a CPU of 4-core Intel Core i5
who is the spouse of < a > ?
who is the wife of < a > ?
BERT similarity: 0.99015945 24’’93
USE similarity:0.9243937 51’’10
edit distance:3What is the height of ?
Where does the height of come from?
BERT:0.9815554 24’’92
USE: 0.85236233 48’’35
edit distance:4She hates me.
She loves me.
BERT: 0.9860806 23’’96
BERT(-1): 0.9647469
USE: 0.678635 49’’70
edit distance:1
We could see that the similarity(cosine) calculated from a BERT representation are not robust(always >0.9) compared to USE. The only advantage is that it extracts the sentence vector much faster than USE thanks to the optimization of this project.
The reason for not using
Another thing to be pointed out is the reason for why cosine similarity of BERT sentence vectors is always higher than 0.9. I found it’s suggested that a decent representation for a downstream task doesn’t mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. As a result, if we want to use cosine distance anyway, then we should focus on the rank not the absolute value.
For more explanation the BERT pre-trained model, please refer to Hanxiao’s documentation[1]
Conclusion
As we could see, that NMT with embedding could really give a hand to the learning of translation. In addition, the generation of sentence vector of non fine-tuned BERT model is faster but much weaker than Universal Sentence Encoder in terms of sentence similarity task, so I think we should continue using the USE model but find a way to alleviate the time consuming problem(even I ran it on cpu, 50 seconds for one pair of sentence embeddings is a little bit too long).
I’ve also checked if the fine-tuned T5 model could directly give us a sequence embedding directly, but found that is well capsulated and only certain functions could be attached.
Reference
[1] Hanxiao’s documentation for bert-as service:
https://bert-as-service.readthedocs.io/en/latest/section/faq.html#how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something