Evaluation of Paraphraser
A brief conclusion of the results
Model | Embeddings | Paraphraser | Accuracy |
---|---|---|---|
512 units, 2 layer LSTM | 50d GloVe | No | 0.365 |
512 units, 2 layer LSTM | 50d GloVe | Yes | 0.483 |
512 units, 2 layer LSTM | 300d GloVe | Yes | 0.495 |
Evaluations with lowercase and separated punctuations
Last saturday, Tommaso pointed out that as the test set should also be lower-cased and tokenized before sent to the nmt model. This kind of preparation could be presented as the example below:
1.What is the time zone of Salt Lake City?
—>
1.what is the time zone of salt lake city ?2.What is the highest mountain in Germany?
—>
2.what is the highest mountain in germany ?
The effect is also immediate, with the same trained model, it give two different SPARQL queries:
1.select var_x where brack_open dbr_Kusugal dbo_timeZone var_x brack_close
—>
1.select var_x where brack_open dbr_Fish_Lake_Valley dbo_timeZone var_x brack_close2.select var_x where brack_open dbr_Saint-Aubin-Routot dbo_highestMountain var_x brack_close
—>
2.select var_x where brack_open dbr_Prague_10 dbo_lowestMountain var_x brack_close
We could see that both of them matches correctly the property timeZone
, and the second one catches more information of the entity Salt Lake City
, it matches an entity with lake
in the name, while the first one doesn’t catches any information.
This point should be remembered in the future research that even a small punctuation and cases could change a lot to the model, because they strictly correspond to the vocabulary and a single modification will affect the final prediction.
Comparison with the original NSpM model without Paraphraser
As the test set qald-9
is not suitable for the evaluation(0 or very low f1-score), Tommaso proposed to use the test set splitted from the NSpM model itself. I trained a model with 512 units, 2 layer LSTM and 50-dimension GLoVe embedding with the same ontology terms as the model I trained last week but without Paraphraser, and here shows the BLEU/Accuracy curves:
512 units, 2 layer LSTM, 50d GloVe, without Paraphraser BLEU score curve
512 units, 2 layer LSTM, 50d GloVe, without Paraphraser accuracy curve
The final accuracy on test set is 4317/11816 which gives 0.365
To be fair, I used the model with the same meta-parameters, and this time, with Paraphraser. The accuracy on the same test set is 5711/11816, which gives 0.483
Comparison between different dimensions of GloVe embeddings
I used only 50 dimensions of GloVe in the last weeks’ training, Tommaso proposed that more dimensions could be helpful. As the largest available pre-trained GloVe model is of 300 dimensions, so I make comparison between 300d GloVe and 50d GloVe, they used the same train/validation/test split of data during training:
512 units, 2 layer LSTM, 50d GloVe, with Paraphraser BLEU score curve
512 units, 2 layer LSTM, 50d GloVe, with Paraphraser accuracy curve
512 units, 2 layer LSTM, 300d GloVe, with Paraphraser BLEU score curve
512 units, 2 layer LSTM, 300d GloVe, with Paraphraser accuracy curve
We could see that the training of 300d GloVe is a little bit faster than 50d GloVe: it reaches 90 BLEU score 10k steps before 50d GloVe’s model. The final results are close, 300d Glove is a bit higher.
I also test the 300d GloVe model with the test set I used above, and it gives an Accuracy of 5844/11816 = 0.495, which is also a bit higher than 50d GloVe.
Num_unit vs Dimension of Embeddings
As far as I understand and searched on the Internet, the documentation, as well as the codes of NMT, the input size is not necessarily equal to num_unit
. The code1, code2show that the input size will be equal to num_unit if embed files are not given, if embeddings are pre-trained, then the input size will equal to the dimension of embeddings (here 50 or 300).
The num_unit
in fact controll the dimension of hidden state in each LSTM cell, and the difference between input_size and hidden_state_size could be simply solved by a weight matrix.
This explains why increasing num_unit
while using the same dimension of embeddings could also improve the performance of NMT.
Conclusion
We could see that the final accuracy of NSpM with Paraphraser increases 0.118(0.365 —> 0.483) in comparison with the model without Paraphraser, we could say that the Paraphraser could help the model with the questions which don’t differ too much like qald-9
.
The miss-linked entities could be solved by introducing more examples per entity, but the number of examples is limited by the Knowledge Base of DBpedia itself and the variable of EXAMPLES_PER_TEMPLATE
. The Knowledge Base is what we could not change, so we may only increase EXAMPLES_PER_TEMPLATE
, but this does’t mean for sure that examples for all the entities are increased. This should be discussed for the next meeting.