GSOC-Week-Eight

2020-07-29

Introduction

This week, the code of Paraphraser has been optimised by Tommaso and the process becomes much faster. So instead of working on Sentence-BERT, I’ve accomplished many small tasks

Basic Templates

We thought that some question-templates whose queries involve SPARQL operators are not properly created, so we decided to firstly not include those operators and only generate basic templates.

Basic question form

	1	2	3	4
Beginning pf Question	When is the	Who is the	What is the	Where is the
Beginning of Query	select ?x	select ?x	select ?x	select ?x
Ending of Query	}	}	}	}

SPARQL Operator COUNT

COUNT in simple questions

NL sentence	SPARQL query
1.how many airlines are there?	select (count(*) as ?c) { ?x a dbo:Airline }
2.how many destinations does “Airline” have?	select (COUNT(*)) as ?c { “Airline” dbo:destination ?x }

Non-question sentence

NL sentence	SPARQL query
list all airlines based in germany	select ?x { ?x a dbo:Airline ; dbo:headquarter dbr:Germany }
List all airlines whose destination is China.	select ?x { ?x a dbo:Airline ; dbo:destination dbr:China } OR select ?x { ?x dbo:destination dbr:China }

COUNT in Compositional question

NL sentence	SPARQL query
how many airlines are based in germany?	select (count(*) as ?c) { ?x a dbo:Airline ; dbo:headquarter dbr:Germany }
how many airlines whose destination is China	select (count(*) as ?c) { ?x a dbo:Airline ; dbo:destination dbr:China }

Other SPARQL Operator

1.Highest rank — ORDER BY

NL sentence	SPARQL query	Answer
1.which place has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank 1 }	A list of 73 URIs
1.which place of India has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank 1 ; dbo:country dbr:India }	A list of 31 URIs (weird!)
2.which place has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank ?rank } ORDER BY ASC(?rank) LIMIT 1	http://dbpedia.org/resource/Phnom_Penh (?rank = dbr:Administrative_divisions_of_Camoodia; not a xsd:Integer)
2.which place of India has the highest areaRank?	SELECT ?uri WHERE { ?uri dbp:areaRank ?rank ; dbo:country dbr:India } ORDER BY ASC(?rank) LIMIT 1	http://dbpedia.org/resource/Tavanur

2.Greatest amount — ORDER BY

NL sentence	SPARQL query	Answer
1.which place has the greatest amount of populationTotal?	SELECT ?uri WHERE { ?uri dbp:populationTotal ?population } Order by DESC(?population) LIMIT 1	http://dbpedia.org/resource/Giza_Governorate
1.which place of Portugal has the greatest amount of populationTotal ?	SELECT ?uri WHERE { ?uri dbp:populationTotal ?population ; dbo:country dbr:Portugal } Order by DESC(?population) LIMIT 1	http://dbpedia.org/resource/Algueirão–Mem_Martins

3.Comparative — FILTER

NL sentence	SPARQL query	Answer
1.which people/ who/ list all people who have the height which is more than 2 metres?	SELECT ?uri WHERE { ?uri dbo:height ?n FILTER ( ?n > 2.0 ) }	A list of 16245 URIs
1.list all basketball players that are higher than 2 meters.	SELECT ?uri WHERE { ?uri a dbo:BasketballPlayer ; dbo:height ?n FILTER ( ?n > 2.0 ) }	A list of 4629 URIs

We could see that the COUNT operator is relatively simple, because it involves only one single variable. And based on the non-question sentence “List …”, the SPARQL query is also simple to be transferred, just change select ?x to select (count(*) as ?c).
So far, we have constructed two structures of natural language questions properly, one counts the entity itself and another counts an attribute of the entity.
For other operators like ORDER BY and FILTER, we should firstly search the orderable attributes, like numeric literals(xsd:integers and xsd:double).

Remote Server

This week, I’ve also got the support of computation from our community. Although it’s just a CPU server, it has 96 GB of RAM and at least, it could train or fine-tune the GloVe model as the co-occurrence matrix would consume a lot of memory. In plus, the download speed is not so fast (~50kb/s).

Issue of vocabulary building

Again, I’ve encountered the issue of inconsistency between the vocabulary and embeddings.
Examples of inconsistency between vocab and embed:

build_vocab.py “RKDartists” -> “RK” and “Dartists”
“EntrezGene” -> “Entrez” and “Gene”

Glove_finetune.py “RKDartists” -> “rkdartists”
“EntrezGene” -> “entrezgene”

The file vocab.en is generated by build_vocab.py using learn.preprocessing.VocabularyProcessor of TensorFlow.contrib, whereas Embeddings are trained with GloVe with simple tokenization and normalisation. As the file build_vocab.py is part of NSpM, I didn’t change it last time and just remove those which don’t appear in the embeddings from vocab.en, but this time, it seems to affect the training, the BLEU score only reaches 55 around 40000 steps(until wrote this line). So I decided to use my own vocab instead of build_vocab.py and we would see if that could help with the BLEU score.

Figure BLEU score till 100k training steps
The precision of the translation is 0.895 (25355/28316), using diff -U 0 test.sparql output_test | grep ^@ | wc -l. The precision here means that 25355 queries are correctly translated among 28316 queries in test set.
We could see that BLEU score is much lower than last time but precision remains acceptable. The low BLEU score could be due to the inconsistency, and I’m training another consistent model so that we would see the difference.

Conclusion

This week I’ve finished many small tasks. I was going to try to fine-tune a Sentence-BERT model with the sentence representation and similarity problems, but as we have a much faster USE model and we are going to annotate our own corpus, I just stopped to handle other small issues mentioned above.