GSOC Bonding period

2020-06-01

Introduction

The bonding period has been formally ended(from 4th May to 31st May), and the coding perio begins.

In this report, I will present some of my thoughts during this period as well as my decision on the direction of next 3 months’ research and development.

About Neural-QA and NSpM[1]

Our neural QA system is principally based on the project NSpM, which uses the query-question template and the LSTB based NMT model.

The manual generation of templates aims to use the least human effort to generate training dataset, instead of using further human work like LC-QuAD. The training data, which are NL questions and SPARQL pairs, will be learned as a Machine Translation problem by the NMT model.

Direction of research and development

I’ve gone through several papers, and I have finally made my decision to work on the question coverage problem using NL generation(probably other methods). I case that we could not have a meeting these days I am going to list some of my arguments and doubts here:
I’m going to define our NSpM system as a Restricted-domain QA system, as we are training and test on a relatively restricted dataset, like Monument, Person, etc. The compositional problem would be different from an Open-domain QA system(like Stuart’s work and the papers referenced in his github page), and I think our main problem would be the composition of templates and the coverage of utterance. The former has already been solved by Anand so I would be rather working on the latter.
As our template-query pair set is manually created, the NL queries generated will be syntactically isomorphic(they have the same syntax)[2]. I think the level of utterance coverage could be improved by adding more syntactic variablility with NL generation(the main idea is already discussed in my proposal), which could also impove the semantic variability. The main concern here is that this process will cost a lot of time if we use it on all the templates even in restricted-domain of dataset, and here are two possible ideas which can more or less alleviate the time-consuming problem:

Process NL generation only on simple templates, and let them be composed to “spread” into compositional templates

Process NL generation only on top-ranked templates

The first point makes better sense to my mentor Tommaso and to me.
For other aspects, in keeping the RNN-LSTM model, another NMT model with CNN should be an alternative option, as it is proved to be outperforming[3] in NL2SPARQL missions.

References

[1] Tommaso Soru, Edgard Marx, André Valdestilhas, Diego Esteves, Diego Moussallem, Gustavo Publio. (2018). Neural Machine Translation for Query Construction and Composition.

[2] Abdalghani Abujabal et al. (2018) Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

[3] Xiaoyu Yin et al. (2019) Neural Machine Translating from Natural Language to SPARQL