GSOC-Week-Two

2020-06-13

Introduction

This week, I’ve finally started my coding works(commits and merges to the community).
Based on Anand’s work, I’ve created my pipeline and added the Paraphraser to it, this work will be briefly introduced in the first part.
I’ve also concluded some of the available resources that correspond to our Template Expansion idea, which is presented in the second part of the blog.

Creation of pipeline

The whole pipeline is based on Anand’s pieline3, which get resources from DBpedia and generates the original template set.
My pipeline adds a Paraphraser before the stockage of templates, this Paraphraser could be different Paraphrase Generation models or even other approaches, for instance, we use T5 pre-trained model.
The pipeline then evaluates the generated paraphrases(candidates) and determines whether to expand, replace or abandon the original template.

Paraphraser

After the creation of templates and the elimination of the never asked queries, the questions will be passed to the Paraphraser.

paraphrase_questions.py: This aims to paraphrase the question template and return several possible candidates with their scores (potentially textual similarity, POS taggings, etc.). The main pipeline will select the templates with a strategy and add it/them to the template set.
textual_similarity.py: This aims to calculate the scores of similarity between the candidates and the original question template.

To test the Paraphraser:

1	python paraphrase_questions.py --sentence "what is your name ?"

Other Semantic similarity:

Besides the cosine similarity and the edit distance(levenshtein distance) mentioned in last week’s blog, an advanced semantic similarity could also be used.

$similarity(u,v) = (1-\frac{arccos(\frac{u*v}{|u|*|v|})}{π})$

As the paper “Universal Sentence Encoder”[1] proposed, they found that using a similarity based on angular distance performs better on average than raw cosine similarity.
I’ve also done some tests and found that this advanced semantic similarity seems more reasonable when the two sentences are similar, but when the two sentences are totally different semantically, this advanced approach will still give a relatively high score: 0.6(advanced) vs 0.3(cosine)

Insights on available resources

Resource	Input	Output	Main idea
NSpM Generator	Template set	NL2Sparql data set	Replacing the placeholder with concerned labels
Anand’s pipeline	Ontology name(from DBpedia)	Template set	Rule-based
Stuart’s pipeline	Entity name(from wiki pages)	Template set	POS-tagging + Rule-based
Paraphraser	NL template	Paraphrased NL template	Paraphrase Generation model
Universal Sentence Encoder	A sequence of tokens	Sentence embedding	Transformer encoder + sum of element-wise of word representation/ DAN

Our current pipeline is composed like this:

Anand’s pipeline —> Paraphraser+USE —> NSpM Generator

Except for NSpM Generator, other parts (Anand’s pipeline, Paraphraser, USE) could all be replaced or cooperated with other approaches.
For example, if we need more simple templates, we could make use of Stuart’s pipeline in determining the proper Entity Names (like the top N ranked entities in the specific Ontology). In this case, we may not need the Paraphraser as the template set is already expanded.
Another example, USE could be replaced by the pre-trained Hidden layer of BERT, as USE seems to ignore the order of the word in a sentence. Bert-as-service is available for example, and there are a lot of variant of BERT which gives better sentence embeddings.
In regard to the NSpM pipeline: the Learner, which uses a LSTM based model could be replaced by a Transformer model, like Stuart’s work, or a CNN model mentioned in my previous blog.

References

[1] Daniel Cera, Yinfei Yanga, Sheng-yi Kong et al. (2018.4). Universal Sentence Encoder