GSOC-Final-Report

2020-08-27

Introduction

Gsoc 2020 is coming to an end, I’m very glad to have spent the last 4 months with the community DBpedia, in particular with my mentors Tommaso and Anand.
With one of the biggest Knowledge Graph, how to make good use of this is very important and interesting. As the extraction of information from such knowledge base requires skills of making SPARQL query, it is interesting to provide users with more friendly and easier interfaces that save the trouble of learning the formal query. Knowledge Graph Question Answering (KGQA) system is such a solution.
My works principally focused on the question coverage problem of such a system, which tries to understand more questions asked by users and parse them into correct SPARQL queries.

A quick conclusion

During the last three months, I have been working on DBpedia’s QA system — NSpM, tried to optimized certain existing modules and add another module, Paraphraser.
If you are interesting in the coding works of the whole project, here is the Code Source.
My own contributions to the project’s code base could be found in A Single Pull Request.
Here is a quick conclusion of the three stages during last 4 months and if you are interesting in my works and want to know more about the details and my journey, please read my blog posts.

Stage Zero — 4th May ~ 31st May

GSOC-Bonding-Period: https://baiblanc.github.io/2020/06/01/GSOC-Bonding-period/
The first month of gsoc refers to a bonding period with the community DBpedia. I’ve principally working on the research of state-of-art’s solutions and models in the relevant field and tried to make my initial proposal more feasible and deliverable.

Stage One — 1st June ~ 3rd July

GSOC-Week-One: https://baiblanc.github.io/2020/06/07/GSOC-Week-One/
GSOC-Week-Two: https://baiblanc.github.io/2020/06/12/GSOC-Week-Two/
GSOC-Week-Three: https://baiblanc.github.io/2020/06/23/GSOC-Week-Three/
GSOC-Week-Four: https://baiblanc.github.io/2020/06/28/GSOC-Week-Four/
These four weeks of works focused on testing different available resources, including the existing module in our projects and other available open-source models, and tried to build an initial pipeline with with Paraphraser and a relevant benchmark environment to test on the first results.

Stage Two — 4th July ~ 31st July

GSOC-Week-Five: https://baiblanc.github.io/2020/07/07/GSOC-Week-Five/
GSOC-Week-Six: https://baiblanc.github.io/2020/07/13/GSOC-Week-Six/
GSOC-Week-Seven: https://baiblanc.github.io/2020/07/17/GSOC-Week-Seven/
GSOC-Week-Eight: https://baiblanc.github.io/2020/07/28/GSOC-Week-Eight/
During the second month of coding period, I’ve worked on different aspects that optimized the existing modules of NSpM, like creation of initial templates, GloVe embeddings and proper vocabulary building, I’ve also worked on more details around the Paraphraser and solved problems found during the first stage of developments.

Stage Three — 1st August ~ 31st August

GSOC-Week-Nine: https://baiblanc.github.io/2020/08/02/GSOC-Week-Nine/
GSOC-Week-Ten: https://baiblanc.github.io/2020/08/13/GSOC-Week-Ten/
GSOC-Week-Eleven: https://baiblanc.github.io/2020/08/18/GSOC-Week-Eleven/
GSOC-Week-Twelve: https://baiblanc.github.io/2020/08/26/GSOC-Week-Twelve/
I’ve worked on the final training and build a simple Grid-search of different meta-parameters of the new NSpM model which presents the final results and comparison with the baseline. A detailed presentation could be found on my Week-Eleven’s blog post. I’ve also worked on a One-command pipeline that tries to automate all the modules, from the creation of the templates to the training of the final NMT model.

Look to the future

The Paraphraser and the final pipeline only focused on the simple questions, with single Basic-Graph-Pattern (BGP), so in the future, it will also be interesting to tackle the problem of more complex questions, like compositional questions and questions that involves more complex expressions of SPARQL queries (SPARQL operators for example).
It would also be interesting to change the model Learner, as we are still using a LSTM-based model, which is no longer state-of-art model for a while. The Transformer or CNN-based models could be a good choice. And introducing Attention Mechanism would be needed as the questions are getting longer at next step.
Another bottleneck of the pipeline is the time-consuming problem, to cover all the ontology classes of the top level, the part of Templater would cost more than 6 hours (that depends on the Internet), this could be optimised in the future.