Evaluation of Paraphraser
A brief conclusion of the results
| Model | Embeddings | Paraphraser | Accuracy |
|---|---|---|---|
| 512 units, 2 layer LSTM | 50d GloVe | No | 0.365 |
| 512 units, 2 layer LSTM | 50d GloVe | Yes | 0.483 |
| 512 units, 2 layer LSTM | 300d GloVe | Yes | 0.495 |
Evaluations with lowercase and separated punctuations
Last saturday, Tommaso pointed out that as the test set should also be lower-cased and tokenized before sent to the nmt model. This kind of preparation could be presented as the example below:
1.What is the time zone of Salt Lake City?
—>
1.what is the time zone of salt lake city ?2.What is the highest mountain in Germany?
—>
2.what is the highest mountain in germany ?
The effect is also immediate, with the same trained model, it give two different SPARQL queries:
1.select var_x where brack_open dbr_Kusugal dbo_timeZone var_x brack_close
—>
1.select var_x where brack_open dbr_Fish_Lake_Valley dbo_timeZone var_x brack_close2.select var_x where brack_open dbr_Saint-Aubin-Routot dbo_highestMountain var_x brack_close
—>
2.select var_x where brack_open dbr_Prague_10 dbo_lowestMountain var_x brack_close
We could see that both of them matches correctly the property timeZone, and the second one catches more information of the entity Salt Lake City, it matches an entity with lake in the name, while the first one doesn’t catches any information.
This point should be remembered in the future research that even a small punctuation and cases could change a lot to the model, because they strictly correspond to the vocabulary and a single modification will affect the final prediction.
Comparison with the original NSpM model without Paraphraser
As the test set qald-9 is not suitable for the evaluation(0 or very low f1-score), Tommaso proposed to use the test set splitted from the NSpM model itself. I trained a model with 512 units, 2 layer LSTM and 50-dimension GLoVe embedding with the same ontology terms as the model I trained last week but without Paraphraser, and here shows the BLEU/Accuracy curves:
512 units, 2 layer LSTM, 50d GloVe, without Paraphraser BLEU score curve
512 units, 2 layer LSTM, 50d GloVe, without Paraphraser accuracy curve
The final accuracy on test set is 4317/11816 which gives 0.365
To be fair, I used the model with the same meta-parameters, and this time, with Paraphraser. The accuracy on the same test set is 5711/11816, which gives 0.483
Comparison between different dimensions of GloVe embeddings
I used only 50 dimensions of GloVe in the last weeks’ training, Tommaso proposed that more dimensions could be helpful. As the largest available pre-trained GloVe model is of 300 dimensions, so I make comparison between 300d GloVe and 50d GloVe, they used the same train/validation/test split of data during training:

512 units, 2 layer LSTM, 50d GloVe, with Paraphraser BLEU score curve
512 units, 2 layer LSTM, 50d GloVe, with Paraphraser accuracy curve

512 units, 2 layer LSTM, 300d GloVe, with Paraphraser BLEU score curve
512 units, 2 layer LSTM, 300d GloVe, with Paraphraser accuracy curve
We could see that the training of 300d GloVe is a little bit faster than 50d GloVe: it reaches 90 BLEU score 10k steps before 50d GloVe’s model. The final results are close, 300d Glove is a bit higher.
I also test the 300d GloVe model with the test set I used above, and it gives an Accuracy of 5844/11816 = 0.495, which is also a bit higher than 50d GloVe.
Num_unit vs Dimension of Embeddings
As far as I understand and searched on the Internet, the documentation, as well as the codes of NMT, the input size is not necessarily equal to num_unit. The code1, code2show that the input size will be equal to num_unit if embed files are not given, if embeddings are pre-trained, then the input size will equal to the dimension of embeddings (here 50 or 300).
The num_unit in fact controll the dimension of hidden state in each LSTM cell, and the difference between input_size and hidden_state_size could be simply solved by a weight matrix.
This explains why increasing num_unit while using the same dimension of embeddings could also improve the performance of NMT.
Conclusion
We could see that the final accuracy of NSpM with Paraphraser increases 0.118(0.365 —> 0.483) in comparison with the model without Paraphraser, we could say that the Paraphraser could help the model with the questions which don’t differ too much like qald-9.
The miss-linked entities could be solved by introducing more examples per entity, but the number of examples is limited by the Knowledge Base of DBpedia itself and the variable of EXAMPLES_PER_TEMPLATE. The Knowledge Base is what we could not change, so we may only increase EXAMPLES_PER_TEMPLATE, but this does’t mean for sure that examples for all the entities are increased. This should be discussed for the next meeting.