Introduction
This week, I have first tried a 100k-training steps training with embeddings, and it seems that the model could be still improved a little bit. Maybe 120k would be proper.
I’ve principally studied the other paraphrase generation methods, especially the metric of evaluation they used, but unfortunately, none of them could be used with our Paraphraser.
So I created a metric to evaluate the Paraphrasing, which is a combination of Cosine, two edit distances.
The remaining training
Figure BLEU score till 100k training steps
Assessment for Paraphrasing
As our Paraphraser could give several candidates of the paraphrased questions, so an evaluation metric would be necessary to pick up the best one.
There are three criteria in paraphrase generation: Semantic adequacy, fluency, and dissimilarity. I’ve checked several methods that could be used to automatically evaluate the paraphrasing generation: iBLEU, ParaMetric, PEM, PINC.
Some of these should be used with BLEU score, others could be used solely, but unfortunately, none of these could be used in our Paraphraser, because they either need manual alignments, or they need reference sequence to calculate metrics. As a result, I tried to build a “metric” for paraphrasing evaluation on my own.
Self-created Metric
This score is based on 3 other metrics, cosine evaluation, edit distance of sentence, and edit distance of the POS tagging sequence.
We need to assure that the cosine similarity is greater than 0.7 so that the paraphrased question is not so far from the original one.
Unfortunately, this cosine similarity’s threshold is not perfect, for examples: between What is the death cause of <A> ?
and Why did <A> die ?
is 0.67, while between What is the number of abbreviation of <A> ?
and What are numbers for <A> ?
cosine similarity is 0.83. This shortcoming could be solved when we have better sentence representation than Universal Sentence Encoder. (Fine-tuned BERT model)
List of terminologies of POS tags
I used NLTK library to tokenize the sequence and do POS tagging.
Part of the list of POS tags in NTLK
I only make use of NNP in this list to ensure that the paraphrased questions don’t generate other proper noun, but only XYZ.
- WP - Wh-pronoun - what
- VB - Verb, base form
- VBD - Verb, past tense
- VBG - Verb, gerund or present participle
- VBN - Verb, past participle
- VBP - Verb, non-3rd person singular present
- VBZ - Verb, 3rd person singular present
- NN - Noun, singular or mass
- NNS - Noun, plural
- NNP - Proper noun, singular
- NNPS - Proper noun, plural
- IN - Preposition or subordinating conjunction - in
A Universal Part-of-Speech Tagset
The POS tagset is too detailed to calculate the distance, so I used another Universal POS tagset which is more general.
- ADJ - adjective
- ADP - adposition
- ADV - adverb
- CONJ - conjunction
- DET - determiner, article
- NOUN - noun
- PRT - particle
- PRON - pronuon
- VERB - verb
- . - punctuation marks
- X - other
Similar Idea from the paper
KQA Pro [1], recently published benchmark of complex KGQA, proposed a similar approach with ours which tries to improve the diversity of questions generated with templates. The difference would be they utilize a crowd-sourcing method on AMT, while we try to realize it automatically and directly on the original templates. The paraphrasing process also tries to remove the questions if they confuse the workers.
This human-been effort could solve a lot of problems that I mentioned in our case, like the semantic dissimilarity problem and the fluency of language, but in the meanwhile, it costs a lot, not only money. but also time, manual effort.
Apart from the similar idea of paraphrasing, this paper pointed out that the existing testing data sets and benchmarks don’t propose a process of inference, while this paper introduces the conceptions of function and program to explicitly present the process of inference. But I don’t think this inference process is important in a Translation based KGQA model.
Conclusion
I should try to paraphrase the template set in the next few days, because each single paraphrasing process takes about 2-3 minutes on Colab with GPU,(Generate candidates -> Extract sentence representation -> Calculate metric score -> Ranking). With a template size of 1000, it takes at least 2000 minutes(33h).
References
[1] Jiaxin Shi, Shulin Cao et al. (2020.7) KQA Pro: A Large Diagnostic Dataset for Complex
Question Answering over Knowledge Base