GSOC-Week-Five

2020-07-08

Introduction

Last week, we discussed about the failed attempts to set up the baseline with multiple ontology terms. As we still considered it necessary to cover as many utterances of NL question as possible, trying to handle well all of the other issues before really start a time-consuming training is a good idea.
Embeddings of words is the option that we decided to deal with this week.

Embedding of Input and Output

As the NMT model support pre-trained word embeddings, but only if we use them on both sides, i.e. we need embedding both natural language and Sparql query.

GloVe embedding

The GloVe proposes its pre-trained embedding so that we could simply use them as the input of the NMT model.
Unlike other pre-trained embedding models using language model, GloVe uses the Co-occurrence matrix:

$log(X_{ij}) = w^T_i * w_j + b_i + b_j$

$X$ : The co-occurrence matrix

$X_{ij}$ : In the whole corpus, the number of times that the word i and the word j appear together in a window

$w_i , w_j$ : The two word vectors

$b_i , b_j$ : The bias

Fine-tune GloVe model

In our case, we may have a data-set whose vocabulary has a large number of words that are not present in pre-trained model, this may cause a serious OOV issue. Fortunately, the GloVe model could be fine-tuned, this is also why we don’t choose Word2Vec embedding. Here, I followed the article to fine-tune the GloVe embeddings.
Once fine-tuned, the new embeddings are compatible with the existing embeddings it makes sense to use a mixture of the two embeddings.

Improvement of model training

After some adjustments, I have succeeded to run the NMT model with our trained GLoVe embeddings. Here is the figure that shows the two curves of BLEU score along the training steps:

The blue curve shows the progress of the model without pre-trained embeddings, the red one shows our model with word embeddings.
It’s a little bit surprised to see that the embeddings does not accelerate the training of NMT, it falls behind in the first 17500 steps, but it shows the possibility to improve along with the train steps.

Conclusion

We are using fixed word-embeddings because the NMT model does not support contextualized embeddings like BERT, ELMo, but I think it will be a good choice in the future if we change our base of Translation Model.

Read all >>

GSOC-Week-Four

2020-06-28

Introduction

Last week, I’ve set up a pseudo baseline with 0 f-score, so this week, I will be principally working on the set up of a real baseline and benchmark.
As we have discussed in last Thursday’s meeting, choosing the proper ontology terms to cover all the queries in qald-9-train will be crucial. To start with, we will only focus on the simple Basic-Graph-Pattern queries

Strategies of creating data set

We follow the strategy for the data:

Top-down by using QALD-9 queries from experts
Bottom-up by creating templates from data itself

And I’ve got two approaches to execute this strategy:

Approach 1

Extract all the ontology terms that appear in the QALD-9 data set. Most of the ontoloies are of type Property, not of type Class, so we will need to query DBpedia base to get their Domain Class, like Class Person for Property education.

Use Anand’s pipeline and these ontology terms of Class to generate our template set.
This approach has a disadvantage, which I will be discussing in detail in the next section, that is it will create a really huge template set, with 26 ontology terms, 1000+ templates, which will generate a size of 440k data set. This huge and heterogeneous(26 ontology terms) training set is almost impossible to be learnt.

So I came up with another approach.

Approach 2

This approach follows the first two steps of approach 1, and a third step

3.Filter the template set, to eliminate all the templates with ontology that doesn’t appear in ‘QALD-9’ data set. This will reduce dramatically the size of training set(from 440k to 43k).
But this approach didn’t work neither because of the multiple ontology terms, so at last I tries with only one Ontology which is the most concerned in QALD-9, like ‘Person’ and ‘Work’.

Troubles encountered

The Neural Machine Translation(nmt) model has encountered some troubles during training, resulting a low BLEU score(around 65). I will present these troubles and what I have done to overcome some of them, or try to find out the reasons why these issues occurred and the potential solutions.

Token `sep_dot`

The token has appeared very frequently in the interpreted queries, as below:

select ?x where{dbo:pdb . . . }

As we are dealing with single-BGP queries, therefore dots are not supposed to appear in our SPARQL at all, because are only used to separate multi-BGP and to end the query:

SELECT DISTINCT ?uri WHERE { res:Area_51 dbo:location ?uri . ?uri dbo:country res:United_States. }

I found that this issue is cause by the proper noun like Washington D.C., the last dot is wrongly identified as a .
So we decided to clean the data by eliminating all the queries with and the corresponding questions(1.6k out of 42k question-query pairs).

After directly solved this issue(by cleaning the data), I also modified the base code to avoid this issue in the future.

Entity Mis-linked

Entity linking in our context of KGQA is the task of deciding which KG entity is referred to by (a certain phrase in) the NL question. Entity linking is considered as a subtask in the process of question answering, but as we are using NMT model, translation based systems (see Section 4.3) in principle could generate the whole query in a single sequence decoding process, thus solving all subtasks at once. As a result, we could not specifically solve this entity mis-linked problem separately.
The issue could be showed as the examples below:

src: What is the building end date of 50 kennedy plaza ?
ref: select var_x where brack_open dbr_50_Kennedy_Plaza dbo_buildingEndDate var_x brack_close
nmt: select var_x where brack_open dbr_Giebelstadt_Army_Airfield dbo_buildingEndDate var_x brack_close

src: Who discovered Pluto?
ref: SELECT ?uri WHERE { dbr:Pluto dbo:discoverer ?uri }
nmt: SELECT ?x where{http://dbpedia.org/resource/819_Barnardiana dbo:discoverer ?x }

Failed attempts

Consequently, I’ve thought to solve this issue straightway by adding more train_steps and number of units, and hopefully this more complex model could learn the linking relation between a String and an entity. But unfortunately, this doesn’t solve the problem. The BLEU score is always around 65. Empirically, a BLEU score greater than 80 means that the model has learnt the linking relation.
I’ve then decided to filter the data by eliminating all the irrelevant templates, i.e. keeping only the templates that contain the properties which appear in the qald-9-train data set.(Approach2 that I’ve presented) I though that reducing the size of data set could help with the mis-match problem, but this doesn’t help(BLEU score still around 65). And I thought it could be caused by the multiple ontology terms we used, so I tried with single ontology template set, like Person and Work, this time, the BLEU score could reach a relative satisfactory level(greater than 80 around 12000 steps for dbo:Person, and 72 at 20000 steps for ontology dbo:Work )

Reason for the mis-linking and some thoughts

The NMT model has been proved to be successful(even it might not be the best model today) for Machine Translation domain, our NSpM model is also proved to be mature.

The problem is an entity in a SPARQL query is usually “longer”, which means for example, a human’s name “William Henry Smith” in the natural language is represented by “dbr_William_Henry_Smith” or even longer “dbr_William_Henry_Smith_(American_politician)”. This three words name is converted to a single word entity, so this time we need to translate correctly all the three words so that the loss could get improved, a single error like “dbr_William_Henry_Jack” is not accepted and means no improvement. This means, for NL2SPARQL task, the NMT model will need more effort and chance to learn something(especially an entity name) to reduce the Loss than a normal NL2NL task(like french2english).

This is ok when we just have one ontology of entities, for example, in order to generate the data set of ontology Person, we use approximately 600 entities of Person for each template on average. This means the NMT model need to learn the combination between 600 entities and 600 phrase in the NLQ. This even get worse when we are training with multi-ontology data set. Imagine the dramatic increase of the number of combinations. $N$ ontology means $N^2$ combinations. This may not be mathematically correct, but this $N^2$ could give the intuition of how much harder it will be if we involve more ontology. Here come three Figure 1,2,3 for more intuition.
Figure1: The BLEU/Perplexity score for multiple ontology training
Figure2: The BLEU/Perplexity score for single ontology (Person) training
Figure3: Comparison of BLEU score between Multiple ontology and single ontology
These three figures could show that, the model usually learn well the matching of predicates(relation) around 1000 steps, where we could see a jump of BLEU score. But then, to learn the entity linking, it’s tough. We can see that even for a single ontology Person, the model begins to learn something from 8000 steps to 12000 steps. But for multiple ontology, it doesn’t learn anything at all before 12000 steps, and even until 36000 steps, it learns nothing(an experiment I’ve realised after).

Even at last, I choose only one single ontology, like dbo:Person or dbo:Work to generate the template set, I still got zero f-score. I’ve compared the our predicted SPARQLs with the golden ones, some times, it’s caused by a mis-matched predicate, but most of the time, it’s the mis-linked entity. What is surprising is that, many of the entities appear in ‘QALD-9’ doesn’t even appear in our training set, so we have zero chance to respond to those questions.

Another note to be pointed out is that, I found many of the structure of questions in QALD-9 have never appear in our template-based data set, which means the training set and testing set have a really different distribution in terms of question structure. Like: “Give me a list of all trumpet players that were bandleaders.”, this kind of question isn’t concluded by any of our template.

Conclusion with my Paraphrasing strategy and further direction

After these failed attempts, I began to think about my paraphrasing strategy. Even QALD-9 isn’t a data set of good quality, but as we are just using it to test our model to set up a baseline, we could consider it as the real questions asked by users. In this case, which issues mentioned above could be solved by the Paraphraser:

1. Mis-matched predicate

This would be the issue that could most possibly be solve with the Paraphraser. But this issue isn’t the most problem that I have encountered.

2. Mis-linked entity(object)

This is the most encountered problem, but unfortunately, we will only process the paraphrasing at the phase of templates generation, where entities are represented by a placeholder.
This could be solved by adding more hidden units, layers and training steps, but this will need a lot time to be proved.(16 hours for a 2 layer-LSTM 1024 units model and 36k training-steps). Another way, which is still in theory in my head, is to add an entity-linking phase to our NSpM model, and to use another approach to replace the NMT model.

3. Different question/query structure

This is possible to be solved by the paraphrasing if only the question structures are different and impossible if the query structures are also different.
Adding more rules to generate template could be straightway to solve this issue. Meanwhile, it’s also interesting to implement a continuous-learning approach to automatically learn new templates, but this would probably need a User feedback system.

At last, I want to say that to simply set up the benchmark and a baseline, changing the data set may be easier. But to improve our NSpM model, these directions could all be considered.

Read all >>

GSOC-Week-Three

2020-06-23

Introduction

This week, in order to get the direct sense of whether we are in the good direction, I am going to set up the benchmark environment and a baseline with our original NSpM model.

Benchmark environment

Figure1: The whole benchmark pipeline
This figure could give you a first impression of our benchmark process. We have selected irbench as our benchmark tool. Although Geribil may be more mature, but it is an online platform, and irbench is stabler as it could be set up locally.

Evaluation Metrics

We will principally evaluate our NSpM model with two metrics, f-score and BLEU score.

F-score

f-score aims to evaluate on the retrieved answers, this metric ignores the performance of the NMT model cares only about the final results.

$f-score = 2*\frac{Precision*Recall}{Recall+Precision}$ $Precision = \frac{TP}{TP+FP}$ $Recall = \frac{TP}{TP+FN}$

In our context, a FP is a wrong answer and a FN is a missed answer. For example:

Query ‘list all scandinavian countries’ returns :Denmark, :Norway, :France
:France is a FP (wrong)
:Swedenis a FN (missing)

This f-score metric is integrated in the irbench.

BLEU score

A BLEU score could be accompanied with the f-score, it evaluates the quality of the translation. Imagine that two different questions could give the same answer, so we could never say if the model learns will with only the metric f-score.
Normally, the NMT model will automatically calculate this score while training, but it’s still possible if we wants to calculate the BLEU score on the test set. The only problem is that the SPARQL queries interpreted by oud Interpreter are equipped with the complete URIs while the gold queries given by test set of irbench are equipped with PREFIX and abbreviations:

Gold query: PREFIX dbo: \http://dbpedia.org/ontology/ PREFIX res: \http://dbpedia.org/resource/ SELECT DISTINCT ?name WHERE { res:Queens_of_the_Stone_Age dbo:alias ?name }
System query: SELECT DISTINCT ?name WHERE { \http://dbpedia.org/resource/Queens_of_the_Stone_Age dbo:alias ?name }

irbench tips

If you also want to set up a benchmark environment with irbench, here are some tips to run the jar:
I recommend to run it with jdk8, if not, this will cause java.lang.NoClassDefFoundError: javax/xml/bind/JAXBException in higher version, this could still be resolved in JDK 9/10 by adding --add-modules java.xml.bind as an argument in run configuration, or this removed class could be only manually added in pom.xml.
As we have only a release version of jar file, this could only be run with jdk8/9/10. (I have run it with jdk14 and failed, and succeeded with jdk8, the solution for jdk9/10 is what I found from this stackoverflow question)

To run the jar:

1	java -jar <released_jar_name> -evaluate <datasetID> <path of your results file> "<Metric ID>"

Make sure the json of your results file should be coherent with the test set json.
An example:

1	java -jar irbench-v0.0.1-beta.2.jar -evaluate "qald-9-test-multilingual" "<yourpath>/qald-9-NSpM-test.qald.json" "f-score"

Pseudo Baseline

I call it a pseudo baseline because we haven’t discussed on the details in order to set up the baseline, and I’ve just made an naive experiment to test if my Benchmark environment works well.
Here are some details to be recorded:

Training dataset

I’ve directly used the dataset of Monument_300 which is proposed in the readme of NSpM project. It’s handy to use and not too big to train the model on my Mac.

Test set

I’ve used all the english question in qald-train-multilingual.qald.json.

Results

The BLEU score on our own dataset(validation set) is relatively high, 86.9/100.
But the result of f-score on test set is terrible, 0 f-score on average, but this is normal because our training set is really restricted to the domain of Monument but the test set questions are more opened(I used the word ‘opened’ but to be noted, there are still in restricted domain but not open-domain questions ).
One problem is that I found some queries are not grammarly correct, which means the could only retrieve a Bad Request 400 from the SPARQL endpoint.
Another point to be noted is that the prefix are not correctly used as below:

SELECT DISTINCT ?name WHERE { \http://dbpedia.org/resource/Queens_of_the_Stone_Age dbo:alias ?name }

We could see that the resource is represented by the complete URI but the Ontology is represented by its abbreviation, and this is not a unique case. This could be caused by the rules for generating the templates in Generator.

Some thoughts

We could see that only one domain data-set could not train a satisfactory QA model to response to relatively open questions, so it becomes problematic to select domains so in order to generate a relatively complete data-set which can handle questions most domains.

Read all >>

RNN vs CNN vs Transformer

2020-06-21

Introduction

I’ve been working on an open-source project: NSpM on Question Answering system with DBpedia. As the Interpretor part, which means the translation from a natural language question to a formal query(SPARQL) is considered as a Machine Translation task, so in this blog, I will talk about the three largely used bases of the seq2seq model, RNN, CNN and Transformer.

RNN

The dominant approach in NLP, which are still very popular to date, is to use a series of bi-directional recurrent neural networks (RNN) to encode the input sequence and to generate(decode) a squence of output. This recurrent approach matches the natual way of our brain to read one word after another.
RNN (bi-LSTM) is currently used as our based NMT model for NSpM. RNN is one of the dominant approach used in NLP, it uses a series of recurrent of neural units to encode the input sequence into a context vector.

Soltions of handling long sequences

The main constraint for a recurrent model is its capacity of handling long sequences. I will present briefly two principle solutions to address the problem.

Attention

As the use of a fixed-length context vector is problematic for translating long sentences, the attention mechanism is introduced, which dynamically calculate the context vector at the phase of decoding based on all the hidden states.

LSTM

While the basic version of RNN model struggles with long sequence(gradient vanish and long dependency), a variant version, LSTM, could handle this very well. It maintains a cell state which runs through all along the sequence. It has the Gate mechanism to determine which information in the cell state to forget and which new information from the current state to remember.

Conclusion on RNN

RNN is the natual way we treat a sentence, from left to right(or in contrast), one word after another.But also due to this character, RNN is restricted in parallelization of computation, because one state is obliged to wait for the computation of its previous state. And it is still constrained to handle super-long sequences(length > 200).

CNN

Convolutional neural networks are less common for sequence modeling than RNN. It takes the meaning of a word by aggregating the local information from its neighbors by convolution operations.

Comparison with RNN

Compared to recurrent networks, the CNN approach allows to discover compositional structure in the sequences more easily since representations are built hierarchically. Convolutional networks do not depend on the computations of the previous time step. Consequently, this allows parallelization over every element in a sequence and has a shorter path to capture long-range dependencies.

Some tricks used in CNN model

In this part, we focus on the model of fairseq[1].

Position Embeddings

As the convolution operation and pooling operation in CNN will affect the ability to learn position information, fairseq added position embeddings to each word with an addictive operation.

Gated Linear Unit

This model uses also a gated linear unit to introduce the relevance information of the current context for each input word.

Multi-step Attention

Same to the RNN-based model, CNN model also introduces the attention mechanism for decoder. As there are multiple convolutional layer in decoder, this attention mechanism is proceeded in each layer.

Conclusion on CNN

The Seq2Seq model based entirely on CNN can be implemented in parallel, and this reduces a lot of time on training in comparison with RNN and could have a SOTA performance. But it occupies a lot of memory, we could also see that it needs many tricks to ensure the last performance, and most importantly, the parameter adjustment on large data volumes is not easy with so many tricks. In plus, it need a really deep stack of convolution block to handle long dependency, for example, a stacking 6 blocks with kernel_size = 5 results in an input field of 25 elements, i.e. each output depends on 25 words.

Transformer

So here comes the model Transformer, which is really a breakthrough in NLP. It introduces a novel network architecture other than RNN or CNN. It enables the parallelization and meanwhile has a better performance.

The New Model

Figure1: Model Architecture of Transformer[2]
Each layer of Encoder has two sub-layers: a multi-head self attention layer and a feed forward layer. Each layer of Decoder has three sub-layers: in addition to the two sub-layers in Encoder, the Decoder inserts a third sub-layer, which performs multi-head attention over the output of the last encoder stack.

Self Attention

The new attention mechanism which is calculated as below:

$Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt{d_{k}}})V$

This mechanism is called self-attention because it does not calculate the relevance between a hidden state at the phase of Decoder and Encoder, in contrast, it calculates only the relevance among the inputs or the outputs, that’s why it is called as “self” attention.

Conclusion on Transformer

Transformer is a breakthrough in the domain of MT and could be used in many other domain of NLP. It enable the parallelization and improves the performance of attention mechanism by introducing self-attention. Although there is still a limit of handling hyper-long sequences(length > 512, this limit is usually encountered when we treat Language modeling as a character-level task[3]), we have already and will variation of Transformer which could address this kind of limit and improve its capacity.

Conclusion of the three models

Although Transformer is proved as the best model to handle really long sequences, the RNN and CNN based model could still work very well or even better than Transformer in the short-sequences task. Like what is proposed in the paper of Xiaoyu et al.(2019)[4], a CNN based model could outperforms all other models on KGQA task with the metrics of BLEU and accuracy.
In the next few months, we would also do the same experiments on our own NL2SPARQL task and our own data-sets, so let’s see which model would be dominant.

References

[1] Jonas Gehring, Michael Auli et al. Convolutional Sequence to Sequence Learning arXiv:1705.03122v3

[2] Ashish Vaswani, Noam Shazeer et al. Attention Is All You Need arXiv:1706.03762v5

[3] Al-Rfou et al. 2018. Character-level lan- guage modeling with deeper self-attention.

[4] Xiaoyu Yin, Dagmar Gromann, and Sebastian Rudolph, Neural Machine Translating from Natural Language to SPARQL

Read all >>

GSOC-Week-Two

2020-06-13

Introduction

This week, I’ve finally started my coding works(commits and merges to the community).
Based on Anand’s work, I’ve created my pipeline and added the Paraphraser to it, this work will be briefly introduced in the first part.
I’ve also concluded some of the available resources that correspond to our Template Expansion idea, which is presented in the second part of the blog.

Creation of pipeline

The whole pipeline is based on Anand’s pieline3, which get resources from DBpedia and generates the original template set.
My pipeline adds a Paraphraser before the stockage of templates, this Paraphraser could be different Paraphrase Generation models or even other approaches, for instance, we use T5 pre-trained model.
The pipeline then evaluates the generated paraphrases(candidates) and determines whether to expand, replace or abandon the original template.

Paraphraser

After the creation of templates and the elimination of the never asked queries, the questions will be passed to the Paraphraser.

paraphrase_questions.py: This aims to paraphrase the question template and return several possible candidates with their scores (potentially textual similarity, POS taggings, etc.). The main pipeline will select the templates with a strategy and add it/them to the template set.
textual_similarity.py: This aims to calculate the scores of similarity between the candidates and the original question template.

To test the Paraphraser:

1	python paraphrase_questions.py --sentence "what is your name ?"

Other Semantic similarity:

Besides the cosine similarity and the edit distance(levenshtein distance) mentioned in last week’s blog, an advanced semantic similarity could also be used.

$similarity(u,v) = (1-\frac{arccos(\frac{u*v}{|u|*|v|})}{π})$

As the paper “Universal Sentence Encoder”[1] proposed, they found that using a similarity based on angular distance performs better on average than raw cosine similarity.
I’ve also done some tests and found that this advanced semantic similarity seems more reasonable when the two sentences are similar, but when the two sentences are totally different semantically, this advanced approach will still give a relatively high score: 0.6(advanced) vs 0.3(cosine)

Insights on available resources

Resource	Input	Output	Main idea
NSpM Generator	Template set	NL2Sparql data set	Replacing the placeholder with concerned labels
Anand’s pipeline	Ontology name(from DBpedia)	Template set	Rule-based
Stuart’s pipeline	Entity name(from wiki pages)	Template set	POS-tagging + Rule-based
Paraphraser	NL template	Paraphrased NL template	Paraphrase Generation model
Universal Sentence Encoder	A sequence of tokens	Sentence embedding	Transformer encoder + sum of element-wise of word representation/ DAN

Our current pipeline is composed like this:

Anand’s pipeline —> Paraphraser+USE —> NSpM Generator

Except for NSpM Generator, other parts (Anand’s pipeline, Paraphraser, USE) could all be replaced or cooperated with other approaches.
For example, if we need more simple templates, we could make use of Stuart’s pipeline in determining the proper Entity Names (like the top N ranked entities in the specific Ontology). In this case, we may not need the Paraphraser as the template set is already expanded.
Another example, USE could be replaced by the pre-trained Hidden layer of BERT, as USE seems to ignore the order of the word in a sentence. Bert-as-service is available for example, and there are a lot of variant of BERT which gives better sentence embeddings.
In regard to the NSpM pipeline: the Learner, which uses a LSTM based model could be replaced by a Transformer model, like Stuart’s work, or a CNN model mentioned in my previous blog.

References

[1] Daniel Cera, Yinfei Yanga, Sheng-yi Kong et al. (2018.4). Universal Sentence Encoder

Read all >>

GSOC-Week-One

2020-06-08

Introduction

After last meeting(held on every Thursday), I have gone through several approaches of template expansion(Paraphrase generation).

Template Expansion
We should differentiate between Template Expansion (what we’re focussing on) and Template Generation, i.e. completely new questions for which we need to build also the corresponding SPARQL query.
The expected output would be a paraphrase of the original template.

Paraphrases
Paraphrases are sequences that convey the same meaning but using different wording. Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level.

Although most of the papers don’t publish their source code, I’ve made a list for those whose implementation is open-sourced:

Title	Paper	Code source and Implementation
-	https://arxiv.org/pdf/1709.05074.pdf	https://github.com/kefirski/pytorch_RVAE
-	-	https://github.com/vsuthichai/paraphraser
Neural Syntactic Preordering for Controlled Paraphrase Generation	https://arxiv.org/pdf/2005.02013v1.pdf	https://github.com/tagoyal/sow-reap-paraphrasing
Decomposable Neural Paraphrase Generation	https://www.aclweb.org/anthology/P19-1332.pdf	——
Syntax-guided Controlled Generation of Paraphrases	https://arxiv.org/pdf/2005.08417v1.pdf	https://github.com/malllabiisc/SGCP
Paraphrase Generation with Latent Bag of Words	https://arxiv.org/pdf/2001.01941v1.pdf	https://github.com/FranxYao/dgm_latent_bow
T5 model	https://arxiv.org/pdf/2002.08910.pdf	https://github.com/ramsrigouthamg/Paraphrase-any-question-with-T5-Text-To-Text-Transfer-Transformer-

Among all, the last T5 model(proposed by my mentor Tommaso) has the best instruction of implementation and seems to match our need, so we will start our Template Expansion task with this model.

Pipeline of Template Expansion

Figure1: New question template will be generated by our Paraphrase Model and hopefully be matched with the same Query template

Evaluation of the quality

We should first of all ensure that the paraphrase of question template have the same meaning with the original one, this could be done with Universal Sentence Encoder, which could help us compute sentence level semantic similarity scores.
Here are some results of the preliminary experiments realized with the pre-trained T5 model and the metric Cosine similarity:

Original Question One:

When is the birth date of XYZ ?

Paraphrased Questions followed by its Cosine similarity :

0: When did the birth date of XYZ begin?
0.93356544
1: Is XYZ born on June 8th?
0.76759416
2: What is the year XYZ?
0.7768826
3: What is the date of birth of XYZ?
0.9564296
4: When is XYZ birth date?
0.97022116
5: When did you date your XYZ birth?
0.7718135
6: What was XYZ & when was his birthday?
0.71993476
7: When was XYZ born?
0.7896207
8: What is birth date of xyz?
0.9210696

We can say that #7 paraphrase is the perfect one because it is closer to people’s daily language. Additionally, it contains the conversion from the nominal predicate(birth date) to a verbal predicate(was born), but interestingly, the cosine similarity is relatively low among all (0.7896207).

Consequently, I think wh should have a second criteria to evaluate the quality of paraphrasing. With this second metric, we should be able to ensure that the syntax similarity is low as we expect very different question structures. To be noted, we are not looking for synonyms as those will be handled by replacing internal with global Word Embeddings.
This difference of syntax could be evaluated with Levenshtein distance, or other possible metrics. My mentor Tommaso proposed a tool part-of-speech tagging.

This POS tagging tool can make linguistic annotations as a token attributes, so it may help us to detect the ‘nominal2verbal’ or opposite changes.

But the most important thing is that the improvement of the overall F-score on the QALD-benchmark(Question Answering over Linked Data), that will be our final goal.

Next Weeks Plan

As we will still focus on the final metric on QALD benchmark, setting up a benchmark and a baseline will be necessary in the next weeks.
In addition, in order to build this benchmark, creating our own pipeline is also supposed to be done next weeks. I will not create a pipeline from scratch, but use my predecessor, Anand’s work as the base, and add my Template Expansion pipeline to it.

Read all >>

GSOC Bonding period

2020-06-01

Introduction

The bonding period has been formally ended(from 4th May to 31st May), and the coding perio begins.

In this report, I will present some of my thoughts during this period as well as my decision on the direction of next 3 months’ research and development.

About Neural-QA and NSpM[1]

Our neural QA system is principally based on the project NSpM, which uses the query-question template and the LSTB based NMT model.

The manual generation of templates aims to use the least human effort to generate training dataset, instead of using further human work like LC-QuAD. The training data, which are NL questions and SPARQL pairs, will be learned as a Machine Translation problem by the NMT model.

Direction of research and development

I’ve gone through several papers, and I have finally made my decision to work on the question coverage problem using NL generation(probably other methods). I case that we could not have a meeting these days I am going to list some of my arguments and doubts here:
I’m going to define our NSpM system as a Restricted-domain QA system, as we are training and test on a relatively restricted dataset, like Monument, Person, etc. The compositional problem would be different from an Open-domain QA system(like Stuart’s work and the papers referenced in his github page), and I think our main problem would be the composition of templates and the coverage of utterance. The former has already been solved by Anand so I would be rather working on the latter.
As our template-query pair set is manually created, the NL queries generated will be syntactically isomorphic(they have the same syntax)[2]. I think the level of utterance coverage could be improved by adding more syntactic variablility with NL generation(the main idea is already discussed in my proposal), which could also impove the semantic variability. The main concern here is that this process will cost a lot of time if we use it on all the templates even in restricted-domain of dataset, and here are two possible ideas which can more or less alleviate the time-consuming problem:

Process NL generation only on simple templates, and let them be composed to “spread” into compositional templates

Process NL generation only on top-ranked templates

The first point makes better sense to my mentor Tommaso and to me.
For other aspects, in keeping the RNN-LSTM model, another NMT model with CNN should be an alternative option, as it is proved to be outperforming[3] in NL2SPARQL missions.

References

[1] Tommaso Soru, Edgard Marx, André Valdestilhas, Diego Esteves, Diego Moussallem, Gustavo Publio. (2018). Neural Machine Translation for Query Construction and Composition.

[2] Abdalghani Abujabal et al. (2018) Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

[3] Xiaoyu Yin et al. (2019) Neural Machine Translating from Natural Language to SPARQL

Read all >>

Hello GSOC and my First Meeting

2020-05-09

My first GSOC Journey

Hi everyone, I am Zheyuan BAI.
I’m so glad to annouce that my proposal have been accepted by the community DBpedia for the 2020 GSOC project A Neural QA Model for DBpedia: Compositionality on 4th May.
This will be my first journey of GSOC and my first experience working on an opensource project. I’m very excited to be on board because the subject of this project corresponds perfectly what I am interested to do: building linked-data to make information exchange more efficient and valuable. I think this will be a good start.

My First Meeting and Looking to the Future

On 8th May, scheduled at 15:00 UTC+2, I had my first meeting with my mentor Tommaso Soru and one of my co-mentors Anand Panchbhai, unfortunatly another co-mentor Jayakrishna Sahit didn’t make to join us.
During this first meeting, we have done self-introduciton and looked a little bit to the next few months. The concern of what I’ve planned in my proposal is that, it involves too much, I ‘m supposed to select and be focusing on less ideas which will be more faisible in a limited period(3 months) and my working result will beeasier to be delivered at last. To help me out with this situation, Tommaso suggested me to read more papers, like the refereced ones in his paper.
I am so glad to know my mentors and I hope that we will know each other better during this bonding period and I will have a great journey of GSOC in this summer!

Read all >>

Hello World

2020-05-08

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment

Read all >>

Introduction

Embedding of Input and Output

GloVe embedding

Fine-tune GloVe model

Improvement of model training

Conclusion

Introduction

Strategies of creating data set

Approach 1

Approach 2

Troubles encountered

Token sep_dot

Entity Mis-linked

Failed attempts

Reason for the mis-linking and some thoughts

Conclusion with my Paraphrasing strategy and further direction

1. Mis-matched predicate

2. Mis-linked entity(object)

3. Different question/query structure

Introduction

Benchmark environment

Evaluation Metrics

F-score

BLEU score

irbench tips

Pseudo Baseline

Training dataset

Test set

Results

Some thoughts

Introduction

RNN

Soltions of handling long sequences

Attention

LSTM

Conclusion on RNN

CNN

Comparison with RNN

Some tricks used in CNN model

Position Embeddings

Gated Linear Unit

Multi-step Attention

Conclusion on CNN

Transformer

The New Model

Self Attention

Conclusion on Transformer

Conclusion of the three models

References

Introduction

Creation of pipeline

Paraphraser

Other Semantic similarity:

Insights on available resources

References

Introduction

Template Expansion

Paraphrases

Pipeline of Template Expansion

Evaluation of the quality

Original Question One:

Paraphrased Questions followed by its Cosine similarity :

Next Weeks Plan

Introduction

About Neural-QA and NSpM[1]

Direction of research and development

References

My first GSOC Journey

My First Meeting and Looking to the Future

Quick Start

Create a new post

Run server

Generate static files

Deploy to remote sites

Token `sep_dot`