GSOC-Week-Twelve

2020-08-26

Introduction

This week, I’ve firstly trained another model with only the original questions using the 300d Glove embeddings. Then I principally worked on the “One-command pipeline”.

300-dimension GloVe without Paraphraser

Figure 1. Bleu score 71.89

Figure 2. Accuracy 0.385

Compared to the accuracy of 50-d GloVe model, which reached 0.365, we could see that a larger dimension of embeddings could help with the performance(0.365 -> 0.385), but not as much as the Paraphraser(0.365 -> 0.483).

The complete pipeline

The complete pipeline could be divided into 4 principal steps and several detailed steps as below:

1.Templator: Generation of templates
2.Paraphraser: Batch-Paraphrasing

2.1 Download BERT-Classifier
2.2 Launch Paraphraser

3.Generator: Generation of corpus and data sets

3.1 Generate data.en/sparql
3.2 Generate vocab (simple tokenizing and normalization)
3.3 Generate Glove embeddings

3.3.1 Download GloVe 300d pretrained model
3.3.2 Fine-tune en and Train sparql

4.Learner: NMT training

4.1 Split into train/dev/test
4.2 Training with embedding

The code has already been tested separately locally on my Macbook, but I noticed that some Shell’s commands are slightly different between GNU/Linux and Mac OS, like the different option of command sed. In case that there may be other differences or issues happened on a Linux server, I would also try running the pipeline shell in a linux environment.

Conclusion

The work has nearly reached an end, the next step will be running the complete pipeline on the whole ontology classes of DBpedia.