GSOC-Week-Seven

2020-07-18

Introduction

This week, I have first tried a 100k-training steps training with embeddings, and it seems that the model could be still improved a little bit. Maybe 120k would be proper.
I’ve principally studied the other paraphrase generation methods, especially the metric of evaluation they used, but unfortunately, none of them could be used with our Paraphraser.
So I created a metric to evaluate the Paraphrasing, which is a combination of Cosine, two edit distances.

The remaining training

Figure BLEU score till 100k training steps

Assessment for Paraphrasing

As our Paraphraser could give several candidates of the paraphrased questions, so an evaluation metric would be necessary to pick up the best one.
There are three criteria in paraphrase generation: Semantic adequacy, fluency, and dissimilarity. I’ve checked several methods that could be used to automatically evaluate the paraphrasing generation: iBLEU, ParaMetric, PEM, PINC.
Some of these should be used with BLEU score, others could be used solely, but unfortunately, none of these could be used in our Paraphraser, because they either need manual alignments, or they need reference sequence to calculate metrics. As a result, I tried to build a “metric” for paraphrasing evaluation on my own.

Self-created Metric

This score is based on 3 other metrics, cosine evaluation, edit distance of sentence, and edit distance of the POS tagging sequence.

$Score = (EDS+EDT)*Cosine\ Similarity$ $EDS = Edit\ distance\ of\ sentence$ $EDT = Edit\ distance\ of\ POS\ tagging\ sequence$

We need to assure that the cosine similarity is greater than 0.7 so that the paraphrased question is not so far from the original one.
Unfortunately, this cosine similarity’s threshold is not perfect, for examples: between What is the death cause of <A> ? and Why did <A> die ? is 0.67, while between What is the number of abbreviation of <A> ? and What are numbers for <A> ? cosine similarity is 0.83. This shortcoming could be solved when we have better sentence representation than Universal Sentence Encoder. (Fine-tuned BERT model)

List of terminologies of POS tags

I used NLTK library to tokenize the sequence and do POS tagging.

Part of the list of POS tags in NTLK

I only make use of NNP in this list to ensure that the paraphrased questions don’t generate other proper noun, but only XYZ.

WP - Wh-pronoun - what
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
IN - Preposition or subordinating conjunction - in

A Universal Part-of-Speech Tagset

The POS tagset is too detailed to calculate the distance, so I used another Universal POS tagset which is more general.

ADJ - adjective
ADP - adposition
ADV - adverb
CONJ - conjunction
DET - determiner, article
NOUN - noun
PRT - particle
PRON - pronuon
VERB - verb
. - punctuation marks
X - other

Similar Idea from the paper

KQA Pro [1], recently published benchmark of complex KGQA, proposed a similar approach with ours which tries to improve the diversity of questions generated with templates. The difference would be they utilize a crowd-sourcing method on AMT, while we try to realize it automatically and directly on the original templates. The paraphrasing process also tries to remove the questions if they confuse the workers.
This human-been effort could solve a lot of problems that I mentioned in our case, like the semantic dissimilarity problem and the fluency of language, but in the meanwhile, it costs a lot, not only money. but also time, manual effort.

Apart from the similar idea of paraphrasing, this paper pointed out that the existing testing data sets and benchmarks don’t propose a process of inference, while this paper introduces the conceptions of function and program to explicitly present the process of inference. But I don’t think this inference process is important in a Translation based KGQA model.

Conclusion

I should try to paraphrase the template set in the next few days, because each single paraphrasing process takes about 2-3 minutes on Colab with GPU,(Generate candidates -> Extract sentence representation -> Calculate metric score -> Ranking). With a template size of 1000, it takes at least 2000 minutes(33h).

References

[1] Jiaxin Shi, Shulin Cao et al. (2020.7) KQA Pro: A Large Diagnostic Dataset for Complex
Question Answering over Knowledge Base