GSOC-Week-Ten

2020-08-14

Works done

Vocabulary issue?

This week, I’ve firstly tried to fix the vocab issue that I mentioned in the last two weeks. The vocabulary inconsistency issue could be explained as below:

When building the vocabulary for the NMT model, our build_vocab.py module not only tokenize the corpus, but also have other specific operations, like:

“RKDartists” -> “RK” and “Dartists”
“EntrezGene” -> “Entrez” and “Gene”
“high-courier” -> “high” and “courier”
“i̇̇smet” -> “̇smet” (To be noted, the “i” is a special character with circumflex accent)

This modifications may cause a lot of “unk” tokens appear in the input sequences because they cannot match any token in the corpus vocabulary.

As a result, I have skipped the build_vocab module and build the corpus vocabulary totally based on tokenization and little normalization to avoid punctuations like “st.” to “st” .
But then, I found that maybe it’s not, or at least not only, the problem of vocabulary.

Size of the model!

Whether I used the self created vocab or the build_vocab vocabulary and tried to reproduce the work of week-seven, the performance of the model on test set are always not acceptable (BLEU score ~ 60, and accuracy < 20):

Figure1: BLEU score of a 256-unit 2-layer LSTM model

Figure2: Accuracy of a 256-unit 2-layer LSTM model

The only difference from week-seven is that I’ve used an paraphrased templet set without any SPARQL operators. Then I used a more complex model and the problem solved magically:

Figure3: BLEU score of a 512-unit 2-layer LSTM model

Figure4: Accuracy of a 512-unit 2-layer LSTM model

Figure5: Accuracy of a 512-unit 2-layer LSTM model for less ontology terms

Figure6: BLEU score of a 512-unit 2-layer LSTM model for less ontology terms

At last, I’ve doubled the size of the model using a 512-unit 2-layer LSTM model and the self-created vocabulary. I will make another try with this size and build_vocab module to see whether the vocabulary affect anything during training.

Other works

As last week, Tommaso proposed the training set should be shuffled before fine-tune the BERT classifier, I used DataLoader package to randomly split the training set into batches.

I’ve continually contributed to the annotation work and would have accomplished 1000 this week.

I’ve also made modifications to the pipeline, trying to better use the BERT classifier to help with our evaluation of paraphrases and changing the separator from “;” to “\t” as there could be “,”in the natural questions and “;” in the SPARQL queries.

Works to be done

The final f1-score on qald-9-train test set is still zero even with the well-trained NMT model with paraphrases. I might need to found other ways to prove the effectiveness of Paraphraser, maybe another testing set. I have already started to work on qald-9-test dataset, extracted the Ontology terms, build templates sets with paraphrases and am training the model.
It could also be interesting to use other criteria to include paraphrases, as now I only introduced two paraphrases, while my mentor Tommaso proposed last week to introduce all the correct and good paraphrases (with label 0 and 1) and remove the original one.