Why the Paraphraser could increase the performance?

2020-10-15

Why the Paraphraser could increase the performance

We’ve seen the improvements of performance on a DBpedia based data set and a Yado based data set thanks to Paraphraser. But why it could improve the accuracy with the same trainings steps and the same model (Bi-LSTM) comparing with a data set without Paraphraser?

Hypothesis 1

It increases the amount of templates for each relation and entity (Tripling the template size), so each entity and relation have more examples to be learnt.

Experiment 1

Copy the templates 2 times in plus, so the size of templates equals to the ones with Paraphraser.

Result of Accuracy:

Figure 1. Accuracy of Original data set

Figure 2. Accuracy of Paraphrased data set

We could see that Original (size tripled) data set could also reach a similar accuracy score (>70), except that it may require more training steps. The slower training could be explicated later.
However, when I tested the Paraphrased model on the original test set, the accuracy only gives 42. This result would be explicated in the next paragraph.

Other statistics

There are 587936 samples for each of the two data sets (Paraphrased and Tripled data sets), nearly 103000 entities appear in the two data sets, which means approximately 5 samples per entity on average, and this could ensure the final performance of the translation.
However, among the 103000 entities in the original data set, there are about 54000 entities that don’t appear in the Paraphrased data set. This means a different ensemble of entities between the two data sets and explicates the low accuracy of Paraphrased model on original test set. This line of code could show why there exists a difference.

Conclusion

As we have a such large ensemble of entities, each time of training will only pick a subset of them, we should encourage the Generator to pick a much bigger size of the subset in order to cover as many entities as we can. To realize that, we should change a little bit the code of priority here. Changing the 1~5 usages to the highest priority and 0 usage to the second could be better, as the number of samples per entity on average is 5 and this could ensure that the final accuracy reach 81, which is satisfactory. We should do an experiment if necessary.

Hypothesis 2

The second hypothesis is the basic idea of the Paraphraser: it could help match the relation/predicated by adding variation to the original templates

Experiment 2

As the accuracy is no longer a proper metric of evaluation, and the test set of the original data set is not a good choice neither because of the different ensemble of entities, I chose qald-9-test as a test set and count the number of matched relations. To be fair, I only picked single Basic-Graph-Pattern questions whose relation exists in out data set.

Results1

Among the 30 samples of simple BGP questions whose relation appears in the data set (qald-9-test), the Paraphrased model could match 9 of them correctly; whereas the Original model predicts 4 of them correctly. We could say that the Paraphraser double the final performance at the level of matching relations.

Results2

In a larger data set (qald-9-train), among about 100 samples, the paraphrased model could match correctly 40 relations, whereas the original model could only match 10 of them. There exists a huge improvement.

Other statistics

The synonym in the paraphrasing procedure is necessary to help with the performance, because we could not totally count on the embeddings to cover automatically the synonyms of expressions.
I have checked the cosine similarity of synonymous expression in our GloVe.6B.200d pre-trained embeddings and found:

“husband” and “spouse”
0.6033827158132323

“wife” and “spouse”
0.5476363820234046

“writer” and “author”
0.778833813581239

“profession” and “occupation”
0.25730224462480544

These kind of “not similar” synonym pairs has caused many mis-matched relations. The performance of embeddings should be better to help with the synonyms, otherwise, the synonyms should be handled with Paraphraser.

Hypothesis 3

The Paraphraser could also help with the matching of entities.

Experiment 2

It’s the same experiment with the last one.

Result

Intuitively, this assumption is wrong because the Paraphraser only paraphrases templates, which means it doesn’t change any entities, so the matching of entities should be the same. In fact, I didn’t agree with this hypothesis neither at first, but here comes some results:
The prefixes are removed and some expressions are decoded to simplify the comparison

Natural Language question: “Who was influenced by Socrates?”
Ground True Query: “SELECT DISTINCT ?uri WHERE { ?uri dbo:influencedBy dbr:Socrates> }”
Original model: “SELECT ?uri WHERE { dbr:James_Allen_(Virginia_delegate) dbo:foundingYear ?uri }”
Paraphrased model: “SELECT ?uri WHERE { dbr:Socrates dbo:influencedBy ?uri }”

We could see that the paraphrased model predicts correctly both the relation and the entity, except for the order because our templates of SPARQL quries all follow the same order; whereas the original model predicts neither of them.

Natural Language question: “How many awards has Bertrand Russell?”
Ground True Query: “SELECT (COUNT(?Awards) AS ?Counter) WHERE { dbr:Bertrand_Russell dbp:awards ?Awards }”
Original model: “SELECT ?uri WHERE { dbr:KTPX-TV dbo:subregion ?Awards }”
Paraphrased model: “SELECT ?uri WHERE { dbr:Claude_Bertrand_(neurosurgeon) dbo:award ?Awards }”

We could see that the relation nearly matches, dbo:award with dbp:awards, and the paraphrased model catches part of the entity name Bertrand, where as the original doesn’t catch it at all.

Conclusion

The Paraphraser does help with the matching of entities, which is a little bit contradictory to intuition.
A further hypothesis is: The Paraphraser can help with the matching of entities only if the relation is correctly or almost correctly matched. This would be an interesting subject, but it’s harder to be proved, because we have such a large entities.

Read all >>

Implementation of Paraphraser to Yago

2020-10-10

Introduction

In order to study the effect of generalization of Paraphraser, to make clear if it could help not only DBpedia increase the performance of a Question Answering system, I’ve decided to implement it to other knowledge bases.
Among all the choices: freebase, wikidata, yage, I’ve picked up Yago because of its similarity with DBpedia: uses an Endpoint of Sparql, maintain a clear schema, the range of most of properties are available, etc.

Implementation

Although Yago is similar with DBpedia in many ways mentioned above, I still need to modify many details of the code, in particularly on the step of creating the original templates and the step of Generator.
As the labels of properties are not accessible on the page of each class, the predicates could only be integrated into the original templates in a Bump Nomenclature way:

Normal way: What is the active years end year of “A” ?
Bump Nomenclature way: What is the activeYearsEndYear of “A” ?

Comparison

The two tests and training are all realised with a Bi-LSTM, 512-unit model, and 40k training steps.

Model I:Non Paraphrased Dataset

Figure 1 Accuracy

Figure 2 BLEU Score

The final accuracy on test set reaches 0.11653.

Model II: Paraphrased Dataset

Figure 3 Accuracy

Figure 4 BLEU Score

The final accuracy on the same test set with Non paraphrased Data set (not the representation of Figure 3), reaches 0.72338, which is really surprising.

Conclusion

We could see a leap forward of accuracy from the non paraphrased on to the paraphrased one (from 0.12 to 0.72). This may be caused by the Bump Nomenclature of predicates, which increases the difficulty to the non paraphrased one to be learnt, while the paraphrased data set somehow deconstruct the Bump Nomenclature names and make it much easier to be learnt.

Read all >>

GSOC-Final-Report

2020-08-27

Introduction

Gsoc 2020 is coming to an end, I’m very glad to have spent the last 4 months with the community DBpedia, in particular with my mentors Tommaso and Anand.
With one of the biggest Knowledge Graph, how to make good use of this is very important and interesting. As the extraction of information from such knowledge base requires skills of making SPARQL query, it is interesting to provide users with more friendly and easier interfaces that save the trouble of learning the formal query. Knowledge Graph Question Answering (KGQA) system is such a solution.
My works principally focused on the question coverage problem of such a system, which tries to understand more questions asked by users and parse them into correct SPARQL queries.

A quick conclusion

During the last three months, I have been working on DBpedia’s QA system — NSpM, tried to optimized certain existing modules and add another module, Paraphraser.
If you are interesting in the coding works of the whole project, here is the Code Source.
My own contributions to the project’s code base could be found in A Single Pull Request.
Here is a quick conclusion of the three stages during last 4 months and if you are interesting in my works and want to know more about the details and my journey, please read my blog posts.

Stage Zero — 4th May ~ 31st May

GSOC-Bonding-Period: https://baiblanc.github.io/2020/06/01/GSOC-Bonding-period/
The first month of gsoc refers to a bonding period with the community DBpedia. I’ve principally working on the research of state-of-art’s solutions and models in the relevant field and tried to make my initial proposal more feasible and deliverable.

Stage One — 1st June ~ 3rd July

GSOC-Week-One: https://baiblanc.github.io/2020/06/07/GSOC-Week-One/
GSOC-Week-Two: https://baiblanc.github.io/2020/06/12/GSOC-Week-Two/
GSOC-Week-Three: https://baiblanc.github.io/2020/06/23/GSOC-Week-Three/
GSOC-Week-Four: https://baiblanc.github.io/2020/06/28/GSOC-Week-Four/
These four weeks of works focused on testing different available resources, including the existing module in our projects and other available open-source models, and tried to build an initial pipeline with with Paraphraser and a relevant benchmark environment to test on the first results.

Stage Two — 4th July ~ 31st July

GSOC-Week-Five: https://baiblanc.github.io/2020/07/07/GSOC-Week-Five/
GSOC-Week-Six: https://baiblanc.github.io/2020/07/13/GSOC-Week-Six/
GSOC-Week-Seven: https://baiblanc.github.io/2020/07/17/GSOC-Week-Seven/
GSOC-Week-Eight: https://baiblanc.github.io/2020/07/28/GSOC-Week-Eight/
During the second month of coding period, I’ve worked on different aspects that optimized the existing modules of NSpM, like creation of initial templates, GloVe embeddings and proper vocabulary building, I’ve also worked on more details around the Paraphraser and solved problems found during the first stage of developments.

Stage Three — 1st August ~ 31st August

GSOC-Week-Nine: https://baiblanc.github.io/2020/08/02/GSOC-Week-Nine/
GSOC-Week-Ten: https://baiblanc.github.io/2020/08/13/GSOC-Week-Ten/
GSOC-Week-Eleven: https://baiblanc.github.io/2020/08/18/GSOC-Week-Eleven/
GSOC-Week-Twelve: https://baiblanc.github.io/2020/08/26/GSOC-Week-Twelve/
I’ve worked on the final training and build a simple Grid-search of different meta-parameters of the new NSpM model which presents the final results and comparison with the baseline. A detailed presentation could be found on my Week-Eleven’s blog post. I’ve also worked on a One-command pipeline that tries to automate all the modules, from the creation of the templates to the training of the final NMT model.

Look to the future

The Paraphraser and the final pipeline only focused on the simple questions, with single Basic-Graph-Pattern (BGP), so in the future, it will also be interesting to tackle the problem of more complex questions, like compositional questions and questions that involves more complex expressions of SPARQL queries (SPARQL operators for example).
It would also be interesting to change the model Learner, as we are still using a LSTM-based model, which is no longer state-of-art model for a while. The Transformer or CNN-based models could be a good choice. And introducing Attention Mechanism would be needed as the questions are getting longer at next step.
Another bottleneck of the pipeline is the time-consuming problem, to cover all the ontology classes of the top level, the part of Templater would cost more than 6 hours (that depends on the Internet), this could be optimised in the future.

Read all >>

GSOC-Week-Twelve

2020-08-26

Introduction

This week, I’ve firstly trained another model with only the original questions using the 300d Glove embeddings. Then I principally worked on the “One-command pipeline”.

300-dimension GloVe without Paraphraser

Figure 1. Bleu score 71.89

Figure 2. Accuracy 0.385

Compared to the accuracy of 50-d GloVe model, which reached 0.365, we could see that a larger dimension of embeddings could help with the performance(0.365 -> 0.385), but not as much as the Paraphraser(0.365 -> 0.483).

The complete pipeline

The complete pipeline could be divided into 4 principal steps and several detailed steps as below:

1.Templator: Generation of templates
2.Paraphraser: Batch-Paraphrasing

2.1 Download BERT-Classifier
2.2 Launch Paraphraser

3.Generator: Generation of corpus and data sets

3.1 Generate data.en/sparql
3.2 Generate vocab (simple tokenizing and normalization)
3.3 Generate Glove embeddings

3.3.1 Download GloVe 300d pretrained model
3.3.2 Fine-tune en and Train sparql

4.Learner: NMT training

4.1 Split into train/dev/test
4.2 Training with embedding

The code has already been tested separately locally on my Macbook, but I noticed that some Shell’s commands are slightly different between GNU/Linux and Mac OS, like the different option of command sed. In case that there may be other differences or issues happened on a Linux server, I would also try running the pipeline shell in a linux environment.

Conclusion

The work has nearly reached an end, the next step will be running the complete pipeline on the whole ontology classes of DBpedia.

Read all >>

GSOC-Week-Eleven

2020-08-19

Evaluation of Paraphraser

A brief conclusion of the results

Model	Embeddings	Paraphraser	Accuracy
512 units, 2 layer LSTM	50d GloVe	No	0.365
512 units, 2 layer LSTM	50d GloVe	Yes	0.483
512 units, 2 layer LSTM	300d GloVe	Yes	0.495

Evaluations with lowercase and separated punctuations

Last saturday, Tommaso pointed out that as the test set should also be lower-cased and tokenized before sent to the nmt model. This kind of preparation could be presented as the example below:

1.What is the time zone of Salt Lake City?
—>
1.what is the time zone of salt lake city ?

2.What is the highest mountain in Germany?
—>
2.what is the highest mountain in germany ?

The effect is also immediate, with the same trained model, it give two different SPARQL queries:

1.select var_x where brack_open dbr_Kusugal dbo_timeZone var_x brack_close
—>
1.select var_x where brack_open dbr_Fish_Lake_Valley dbo_timeZone var_x brack_close

2.select var_x where brack_open dbr_Saint-Aubin-Routot dbo_highestMountain var_x brack_close
—>
2.select var_x where brack_open dbr_Prague_10 dbo_lowestMountain var_x brack_close

We could see that both of them matches correctly the property timeZone, and the second one catches more information of the entity Salt Lake City, it matches an entity with lake in the name, while the first one doesn’t catches any information.
This point should be remembered in the future research that even a small punctuation and cases could change a lot to the model, because they strictly correspond to the vocabulary and a single modification will affect the final prediction.

Comparison with the original NSpM model without Paraphraser

As the test set qald-9 is not suitable for the evaluation(0 or very low f1-score), Tommaso proposed to use the test set splitted from the NSpM model itself. I trained a model with 512 units, 2 layer LSTM and 50-dimension GLoVe embedding with the same ontology terms as the model I trained last week but without Paraphraser, and here shows the BLEU/Accuracy curves:

512 units, 2 layer LSTM, 50d GloVe, without Paraphraser BLEU score curve

512 units, 2 layer LSTM, 50d GloVe, without Paraphraser accuracy curve

The final accuracy on test set is 4317/11816 which gives 0.365
To be fair, I used the model with the same meta-parameters, and this time, with Paraphraser. The accuracy on the same test set is 5711/11816, which gives 0.483

Comparison between different dimensions of GloVe embeddings

I used only 50 dimensions of GloVe in the last weeks’ training, Tommaso proposed that more dimensions could be helpful. As the largest available pre-trained GloVe model is of 300 dimensions, so I make comparison between 300d GloVe and 50d GloVe, they used the same train/validation/test split of data during training:

512 units, 2 layer LSTM, 50d GloVe, with Paraphraser BLEU score curve

512 units, 2 layer LSTM, 50d GloVe, with Paraphraser accuracy curve

512 units, 2 layer LSTM, 300d GloVe, with Paraphraser BLEU score curve

512 units, 2 layer LSTM, 300d GloVe, with Paraphraser accuracy curve

We could see that the training of 300d GloVe is a little bit faster than 50d GloVe: it reaches 90 BLEU score 10k steps before 50d GloVe’s model. The final results are close, 300d Glove is a bit higher.
I also test the 300d GloVe model with the test set I used above, and it gives an Accuracy of 5844/11816 = 0.495, which is also a bit higher than 50d GloVe.

Num_unit vs Dimension of Embeddings

As far as I understand and searched on the Internet, the documentation, as well as the codes of NMT, the input size is not necessarily equal to num_unit. The code1, code2show that the input size will be equal to num_unit if embed files are not given, if embeddings are pre-trained, then the input size will equal to the dimension of embeddings (here 50 or 300).
The num_unit in fact controll the dimension of hidden state in each LSTM cell, and the difference between input_size and hidden_state_size could be simply solved by a weight matrix.
This explains why increasing num_unit while using the same dimension of embeddings could also improve the performance of NMT.

Conclusion

We could see that the final accuracy of NSpM with Paraphraser increases 0.118(0.365 —> 0.483) in comparison with the model without Paraphraser, we could say that the Paraphraser could help the model with the questions which don’t differ too much like qald-9.

The miss-linked entities could be solved by introducing more examples per entity, but the number of examples is limited by the Knowledge Base of DBpedia itself and the variable of EXAMPLES_PER_TEMPLATE. The Knowledge Base is what we could not change, so we may only increase EXAMPLES_PER_TEMPLATE, but this does’t mean for sure that examples for all the entities are increased. This should be discussed for the next meeting.

Read all >>

GSOC-Week-Ten

2020-08-14

Works done

Vocabulary issue?

This week, I’ve firstly tried to fix the vocab issue that I mentioned in the last two weeks. The vocabulary inconsistency issue could be explained as below:

When building the vocabulary for the NMT model, our build_vocab.py module not only tokenize the corpus, but also have other specific operations, like:

“RKDartists” -> “RK” and “Dartists”
“EntrezGene” -> “Entrez” and “Gene”
“high-courier” -> “high” and “courier”
“i̇̇smet” -> “̇smet” (To be noted, the “i” is a special character with circumflex accent)

This modifications may cause a lot of “unk” tokens appear in the input sequences because they cannot match any token in the corpus vocabulary.

As a result, I have skipped the build_vocab module and build the corpus vocabulary totally based on tokenization and little normalization to avoid punctuations like “st.” to “st” .
But then, I found that maybe it’s not, or at least not only, the problem of vocabulary.

Size of the model!

Whether I used the self created vocab or the build_vocab vocabulary and tried to reproduce the work of week-seven, the performance of the model on test set are always not acceptable (BLEU score ~ 60, and accuracy < 20):

Figure1: BLEU score of a 256-unit 2-layer LSTM model

Figure2: Accuracy of a 256-unit 2-layer LSTM model

The only difference from week-seven is that I’ve used an paraphrased templet set without any SPARQL operators. Then I used a more complex model and the problem solved magically:

Figure3: BLEU score of a 512-unit 2-layer LSTM model

Figure4: Accuracy of a 512-unit 2-layer LSTM model

Figure5: Accuracy of a 512-unit 2-layer LSTM model for less ontology terms

Figure6: BLEU score of a 512-unit 2-layer LSTM model for less ontology terms

At last, I’ve doubled the size of the model using a 512-unit 2-layer LSTM model and the self-created vocabulary. I will make another try with this size and build_vocab module to see whether the vocabulary affect anything during training.

Other works

As last week, Tommaso proposed the training set should be shuffled before fine-tune the BERT classifier, I used DataLoader package to randomly split the training set into batches.

I’ve continually contributed to the annotation work and would have accomplished 1000 this week.

I’ve also made modifications to the pipeline, trying to better use the BERT classifier to help with our evaluation of paraphrases and changing the separator from “;” to “\t” as there could be “,”in the natural questions and “;” in the SPARQL queries.

Works to be done

The final f1-score on qald-9-train test set is still zero even with the well-trained NMT model with paraphrases. I might need to found other ways to prove the effectiveness of Paraphraser, maybe another testing set. I have already started to work on qald-9-test dataset, extracted the Ontology terms, build templates sets with paraphrases and am training the model.
It could also be interesting to use other criteria to include paraphrases, as now I only introduced two paraphrases, while my mentor Tommaso proposed last week to introduce all the correct and good paraphrases (with label 0 and 1) and remove the original one.

Read all >>

GSOC-Week-Nine

2020-08-03

Introduction

This week, I’ve principally worked on the annotation work and the BERT classifier in order to replace or reinforce Paraphraser’s evaluation score.

Annotation Work

We have shared a spread sheet where we could annotate the paraphrases with 3 labels:

-1: for a bad paraphrase which doesn’t keep the same meaning,

0: for a paraphrase which keeps the same meaning but doesn’t change the structure (only lexical modifications)

1: for a good paraphrase which keeps the same meaning and at the same time change the structure

There are 9500 paraphrases in total, we do not expect to annotate them all, but hopefully, we could annotate some of them for each Ontology term and support enough samples of a Good paraphrase.

Fine-tune BERT

The annotation work is in fact preparation of the data set for the BERT fine-tuning. I’ve principally follow this guide, which realised a Binary Bert classifier for sequences with Pytorch.
I will list some of the difficulties and modifications during my reproduction of the guide, like the most evident one: we are facing a triple classification problem so the final output layer needs three units.

Data Preparation

BERT model has two constraints:

All sentences must be padded or truncated to a single, fixed length.

The maximum sentence length is 512 tokens.

The first constraint need a BERT tokenizer which automatically add a special token [PAD] and will label them as a useless token so that they don’t participate to classification.
The second constraint has nothing to do with us, because our templates of question are no more than 64 tokens even we use a piece word embedding (BERT use piece word to handle OOV problem).
Other works need to be done are listed below:

Add special tokens [CLS] [SEP] to the start and end of each sentence.

Explicitly differentiate real tokens from padding tokens with the “attention mask”.

Unbalanced data problem

Until the redaction of the post, the proportion of the three labels[-1, 0, 1] is 1:1.5:6, which indicates the class imbalance of our data set.
I didn’t pay much attention at first, but found that the classifier only predicts -1 and 0 on my test set samples.

Then I came up with two ideas to address this problem:

The first follows the idea of over-sampling and under-sampling, we could manually add some good paraphrases to each group of candidates, and remove some not-good paraphrases.
The second one is to add weights to the loss function

I didn’t choose the first one because I would like to see how good this Paraphraser could do with the least human intelligence and intervention. To utilise a customized cross-entropy loss function, I created a new class of model which inherit the pre-trained BERT-classifier and rewrite the loss function.

It seems at last the classifier could give a acceptable result:

When is the birth date of “A” ? When is the birthday of “A” ? 0
When is the birth date of “A” ? When was “A” born ? 1
When is the birth date of “A” ? Where does “A” come from ? -1
When is the birth date of “A” ? What is the birth name of “A” ? 1

These sentences have never appeared in the data set, except for the last one, it has done a great job.
So I decided to use the classifier to reinforce the Evaluation Score until we will have got enough annotated samples to train a good classifier.

Now the scoring system works as below:

$Score = 0.05 * (EDS+EDT) + Cosine\ Similarity + Label$

The predicted label (-1, 0, 1) is directly added to the precedent score.

Strange CUDA runtime error

An error happened “Status: device-side assert triggered”, caused by ‘label -1’. A label “-1” is forbidden in such a model, so before starting training on the data set, I also add 1 to all these labels and minus 1 after the predictions.

Conclusion

A BERT pre-trained model is really helpful when we just have limited samples for this classification task ( about 1000 samples). I reaches 0.8 as the macro-recall score, and hopefully, with more and more annotations done, the classifier could work better.
Next week, I would start the training of NMT model with paraphrases, but before that, I will also tackle the inconsistency issue and involve SPARQL operators to the template set.

Read all >>

GSOC-Week-Eight

2020-07-29

Introduction

This week, the code of Paraphraser has been optimised by Tommaso and the process becomes much faster. So instead of working on Sentence-BERT, I’ve accomplished many small tasks

Basic Templates

We thought that some question-templates whose queries involve SPARQL operators are not properly created, so we decided to firstly not include those operators and only generate basic templates.

Basic question form

	1	2	3	4
Beginning pf Question	When is the	Who is the	What is the	Where is the
Beginning of Query	select ?x	select ?x	select ?x	select ?x
Ending of Query	}	}	}	}

SPARQL Operator COUNT

COUNT in simple questions

NL sentence	SPARQL query
1.how many airlines are there?	select (count(*) as ?c) { ?x a dbo:Airline }
2.how many destinations does “Airline” have?	select (COUNT(*)) as ?c { “Airline” dbo:destination ?x }

Non-question sentence

NL sentence	SPARQL query
list all airlines based in germany	select ?x { ?x a dbo:Airline ; dbo:headquarter dbr:Germany }
List all airlines whose destination is China.	select ?x { ?x a dbo:Airline ; dbo:destination dbr:China } OR select ?x { ?x dbo:destination dbr:China }

COUNT in Compositional question

NL sentence	SPARQL query
how many airlines are based in germany?	select (count(*) as ?c) { ?x a dbo:Airline ; dbo:headquarter dbr:Germany }
how many airlines whose destination is China	select (count(*) as ?c) { ?x a dbo:Airline ; dbo:destination dbr:China }

Other SPARQL Operator

1.Highest rank — ORDER BY

NL sentence	SPARQL query	Answer
1.which place has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank 1 }	A list of 73 URIs
1.which place of India has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank 1 ; dbo:country dbr:India }	A list of 31 URIs (weird!)
2.which place has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank ?rank } ORDER BY ASC(?rank) LIMIT 1	http://dbpedia.org/resource/Phnom_Penh (?rank = dbr:Administrative_divisions_of_Camoodia; not a xsd:Integer)
2.which place of India has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank ?rank ; dbo:country dbr:India } ORDER BY ASC(?rank) LIMIT 1	http://dbpedia.org/resource/Tavanur

2.Greatest amount — ORDER BY

NL sentence	SPARQL query	Answer
1.which place has the greatest amount of populationTotal?	SELECT ?uri WHERE { ?uri dbp:populationTotal ?population } Order by DESC(?population) LIMIT 1	http://dbpedia.org/resource/Giza_Governorate
1.which place of Portugal has the greatest amount of populationTotal ?	SELECT ?uri WHERE { ?uri dbp:populationTotal ?population ; dbo:country dbr:Portugal } Order by DESC(?population) LIMIT 1	http://dbpedia.org/resource/Algueirão–Mem_Martins

3.Comparative — FILTER

NL sentence	SPARQL query	Answer
1.which people/ who/ list all people who have the height which is more than 2 metres?	SELECT ?uri WHERE { ?uri dbo:height ?n FILTER ( ?n > 2.0 ) }	A list of 16245 URIs
1.list all basketball players that are higher than 2 meters.	SELECT ?uri WHERE { ?uri a dbo:BasketballPlayer ; dbo:height ?n FILTER ( ?n > 2.0 ) }	A list of 4629 URIs

We could see that the COUNT operator is relatively simple, because it involves only one single variable. And based on the non-question sentence “List …”, the SPARQL query is also simple to be transferred, just change select ?x to select (count(*) as ?c).
So far, we have constructed two structures of natural language questions properly, one counts the entity itself and another counts an attribute of the entity.
For other operators like ORDER BY and FILTER, we should firstly search the orderable attributes, like numeric literals(xsd:integers and xsd:double).

Remote Server

This week, I’ve also got the support of computation from our community. Although it’s just a CPU server, it has 96 GB of RAM and at least, it could train or fine-tune the GloVe model as the co-occurrence matrix would consume a lot of memory. In plus, the download speed is not so fast (~50kb/s).

Issue of vocabulary building

Again, I’ve encountered the issue of inconsistency between the vocabulary and embeddings.
Examples of inconsistency between vocab and embed:

build_vocab.py “RKDartists” -> “RK” and “Dartists”
“EntrezGene” -> “Entrez” and “Gene”

Glove_finetune.py “RKDartists” -> “rkdartists”
“EntrezGene” -> “entrezgene”

The file vocab.en is generated by build_vocab.py using learn.preprocessing.VocabularyProcessor of TensorFlow.contrib, whereas Embeddings are trained with GloVe with simple tokenization and normalisation. As the file build_vocab.py is part of NSpM, I didn’t change it last time and just remove those which don’t appear in the embeddings from vocab.en, but this time, it seems to affect the training, the BLEU score only reaches 55 around 40000 steps(until wrote this line). So I decided to use my own vocab instead of build_vocab.py and we would see if that could help with the BLEU score.

Figure BLEU score till 100k training steps
The precision of the translation is 0.895 (25355/28316), using diff -U 0 test.sparql output_test | grep ^@ | wc -l. The precision here means that 25355 queries are correctly translated among 28316 queries in test set.
We could see that BLEU score is much lower than last time but precision remains acceptable. The low BLEU score could be due to the inconsistency, and I’m training another consistent model so that we would see the difference.

Conclusion

This week I’ve finished many small tasks. I was going to try to fine-tune a Sentence-BERT model with the sentence representation and similarity problems, but as we have a much faster USE model and we are going to annotate our own corpus, I just stopped to handle other small issues mentioned above.

Read all >>

GSOC-Week-Seven

2020-07-18

Introduction

This week, I have first tried a 100k-training steps training with embeddings, and it seems that the model could be still improved a little bit. Maybe 120k would be proper.
I’ve principally studied the other paraphrase generation methods, especially the metric of evaluation they used, but unfortunately, none of them could be used with our Paraphraser.
So I created a metric to evaluate the Paraphrasing, which is a combination of Cosine, two edit distances.

The remaining training

Figure BLEU score till 100k training steps

Assessment for Paraphrasing

As our Paraphraser could give several candidates of the paraphrased questions, so an evaluation metric would be necessary to pick up the best one.
There are three criteria in paraphrase generation: Semantic adequacy, fluency, and dissimilarity. I’ve checked several methods that could be used to automatically evaluate the paraphrasing generation: iBLEU, ParaMetric, PEM, PINC.
Some of these should be used with BLEU score, others could be used solely, but unfortunately, none of these could be used in our Paraphraser, because they either need manual alignments, or they need reference sequence to calculate metrics. As a result, I tried to build a “metric” for paraphrasing evaluation on my own.

Self-created Metric

This score is based on 3 other metrics, cosine evaluation, edit distance of sentence, and edit distance of the POS tagging sequence.

$Score = (EDS+EDT)*Cosine\ Similarity$ $EDS = Edit\ distance\ of\ sentence$ $EDT = Edit\ distance\ of\ POS\ tagging\ sequence$

We need to assure that the cosine similarity is greater than 0.7 so that the paraphrased question is not so far from the original one.
Unfortunately, this cosine similarity’s threshold is not perfect, for examples: between What is the death cause of <A> ? and Why did <A> die ? is 0.67, while between What is the number of abbreviation of <A> ? and What are numbers for <A> ? cosine similarity is 0.83. This shortcoming could be solved when we have better sentence representation than Universal Sentence Encoder. (Fine-tuned BERT model)

List of terminologies of POS tags

I used NLTK library to tokenize the sequence and do POS tagging.

Part of the list of POS tags in NTLK

I only make use of NNP in this list to ensure that the paraphrased questions don’t generate other proper noun, but only XYZ.

WP - Wh-pronoun - what
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
IN - Preposition or subordinating conjunction - in

A Universal Part-of-Speech Tagset

The POS tagset is too detailed to calculate the distance, so I used another Universal POS tagset which is more general.

ADJ - adjective
ADP - adposition
ADV - adverb
CONJ - conjunction
DET - determiner, article
NOUN - noun
PRT - particle
PRON - pronuon
VERB - verb
. - punctuation marks
X - other

Similar Idea from the paper

KQA Pro [1], recently published benchmark of complex KGQA, proposed a similar approach with ours which tries to improve the diversity of questions generated with templates. The difference would be they utilize a crowd-sourcing method on AMT, while we try to realize it automatically and directly on the original templates. The paraphrasing process also tries to remove the questions if they confuse the workers.
This human-been effort could solve a lot of problems that I mentioned in our case, like the semantic dissimilarity problem and the fluency of language, but in the meanwhile, it costs a lot, not only money. but also time, manual effort.

Apart from the similar idea of paraphrasing, this paper pointed out that the existing testing data sets and benchmarks don’t propose a process of inference, while this paper introduces the conceptions of function and program to explicitly present the process of inference. But I don’t think this inference process is important in a Translation based KGQA model.

Conclusion

I should try to paraphrase the template set in the next few days, because each single paraphrasing process takes about 2-3 minutes on Colab with GPU,(Generate candidates -> Extract sentence representation -> Calculate metric score -> Ranking). With a template size of 1000, it takes at least 2000 minutes(33h).

References

[1] Jiaxin Shi, Shulin Cao et al. (2020.7) KQA Pro: A Large Diagnostic Dataset for Complex
Question Answering over Knowledge Base

Read all >>

GSOC-Week-Six

2020-07-14

Introduction

This week’s work will be continuing the training of NMT model with embedding and see if the embeddings could really help as the train_steps goes larger.(As last week I only finish the 20000-steps training and see the trend of progressing, but then reach the limit of usage of GPU).
Another task will be to try to use a BERT model to get the sentence representation in place of Universal Sentence Encoding, and try to compare these two methods of sentence embedding.

Embedding and non-embedding Training results

Figure Bleu score of NMT model with embeddings

Figure Bleu score of NMT model without embeddings

It’s glat to see that embeddings could help NMT model improve its learning performance.

BERT

The goal here is to represent a variable length sentence into a fixed length vector so that we could calculate a similarity score between two sentences. We have already tried Universal Sentence Encoder in the past few weeks, and this week, I have tried another option, using BERT.
The pre-trained model BERT could be directly used to give the sentence vector. In my experiment, I used this model with 12 layer and 768 number of vector dimensions.
To extract the sentence vector, there is a trick:
As BERT model will add token and token to the head and the end of each input sentence, and the output which corresponds to the could be used as the sentence vector. But here, as we will not do any fine-tune to the BERT model, we will take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
The reason of taking the second-to-last is that The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets, and the layers before may not learn enough about the sentence.
Concerning to the token, I will explain later why we don’t use it directly to get the sentence encoder.

BERT vs USE

The experiment if done on my Mac pro, with a CPU of 4-core Intel Core i5

who is the spouse of < a > ?
who is the wife of < a > ?
BERT similarity: 0.99015945 24’’93
USE similarity:0.9243937 51’’10
edit distance:3

What is the height of ?
Where does the height of come from?
BERT:0.9815554 24’’92
USE: 0.85236233 48’’35
edit distance:4

She hates me.
She loves me.
BERT: 0.9860806 23’’96
BERT(-1): 0.9647469
USE: 0.678635 49’’70
edit distance:1

We could see that the similarity(cosine) calculated from a BERT representation are not robust(always >0.9) compared to USE. The only advantage is that it extracts the sentence vector much faster than USE thanks to the optimization of this project.
The reason for not using token as sentence representation is that the pre-trained model is not fine-tuned on any downstream tasks yet. so in this case, the hidden state of is not a good sentence representation. If later we could fine-tune the model, we may use [CLS] as well. But in our case, we won’t do any fine-tune to the BERT model as we want to simply calculate the semantic similarity between two sentences.
Another thing to be pointed out is the reason for why cosine similarity of BERT sentence vectors is always higher than 0.9. I found it’s suggested that a decent representation for a downstream task doesn’t mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. As a result, if we want to use cosine distance anyway, then we should focus on the rank not the absolute value.
For more explanation the BERT pre-trained model, please refer to Hanxiao’s documentation[1]

Conclusion

As we could see, that NMT with embedding could really give a hand to the learning of translation. In addition, the generation of sentence vector of non fine-tuned BERT model is faster but much weaker than Universal Sentence Encoder in terms of sentence similarity task, so I think we should continue using the USE model but find a way to alleviate the time consuming problem(even I ran it on cpu, 50 seconds for one pair of sentence embeddings is a little bit too long).
I’ve also checked if the fine-tuned T5 model could directly give us a sequence embedding directly, but found that is well capsulated and only certain functions could be attached.

Reference

[1] Hanxiao’s documentation for bert-as service:
https://bert-as-service.readthedocs.io/en/latest/section/faq.html#how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something

Read all >>