Introduction
This week, the code of Paraphraser has been optimised by Tommaso and the process becomes much faster. So instead of working on Sentence-BERT, I’ve accomplished many small tasks
Basic Templates
We thought that some question-templates whose queries involve SPARQL operators are not properly created, so we decided to firstly not include those operators and only generate basic templates.
Basic question form
1 | 2 | 3 | 4 | |
---|---|---|---|---|
Beginning pf Question | When is the | Who is the | What is the | Where is the |
Beginning of Query | select ?x | select ?x | select ?x | select ?x |
Ending of Query | } | } | } | } |
SPARQL Operator COUNT
COUNT in simple questions
NL sentence | SPARQL query |
---|---|
1.how many airlines are there? | select (count(*) as ?c) { ?x a dbo:Airline } |
2.how many destinations does “Airline” have? | select (COUNT(*)) as ?c { “Airline” dbo:destination ?x } |
Non-question sentence
NL sentence | SPARQL query |
---|---|
list all airlines based in germany | select ?x { ?x a dbo:Airline ; dbo:headquarter dbr:Germany } |
List all airlines whose destination is China. | select ?x { ?x a dbo:Airline ; dbo:destination dbr:China } OR select ?x { ?x dbo:destination dbr:China } |
COUNT in Compositional question
NL sentence | SPARQL query |
---|---|
how many airlines are based in germany? | select (count(*) as ?c) { ?x a dbo:Airline ; dbo:headquarter dbr:Germany } |
how many airlines whose destination is China | select (count(*) as ?c) { ?x a dbo:Airline ; dbo:destination dbr:China } |
Other SPARQL Operator
1.Highest rank — ORDER BY
NL sentence | SPARQL query | Answer |
---|---|---|
1.which place has the highest areaRank? | SELECT ?uri WHERE { ?uri dbp:areaRank 1 } | A list of 73 URIs |
1.which place of India has the highest areaRank? | SELECT ?uri WHERE { ?uri dbp:areaRank 1 ; dbo:country dbr:India } | A list of 31 URIs (weird!) |
2.which place has the highest areaRank? | SELECT ?uri WHERE { ?uri dbp:areaRank ?rank } ORDER BY ASC(?rank) LIMIT 1 | http://dbpedia.org/resource/Phnom_Penh (?rank = dbr:Administrative_divisions_of_Camoodia; not a xsd:Integer) |
2.which place of India has the highest areaRank? | SELECT ?uri WHERE { ?uri dbp:areaRank ?rank ; dbo:country dbr:India } ORDER BY ASC(?rank) LIMIT 1 | http://dbpedia.org/resource/Tavanur |
2.Greatest amount — ORDER BY
NL sentence | SPARQL query | Answer |
---|---|---|
1.which place has the greatest amount of populationTotal? | SELECT ?uri WHERE { ?uri dbp:populationTotal ?population } Order by DESC(?population) LIMIT 1 | http://dbpedia.org/resource/Giza_Governorate |
1.which place of Portugal has the greatest amount of populationTotal ? | SELECT ?uri WHERE { ?uri dbp:populationTotal ?population ; dbo:country dbr:Portugal } Order by DESC(?population) LIMIT 1 | http://dbpedia.org/resource/Algueirão–Mem_Martins |
3.Comparative — FILTER
NL sentence | SPARQL query | Answer |
---|---|---|
1.which people/ who/ list all people who have the height which is more than 2 metres? | SELECT ?uri WHERE { ?uri dbo:height ?n FILTER ( ?n > 2.0 ) } | A list of 16245 URIs |
1.list all basketball players that are higher than 2 meters. | SELECT ?uri WHERE { ?uri a dbo:BasketballPlayer ; dbo:height ?n FILTER ( ?n > 2.0 ) } | A list of 4629 URIs |
We could see that the COUNT operator is relatively simple, because it involves only one single variable. And based on the non-question sentence “List …”, the SPARQL query is also simple to be transferred, just change select ?x
to select (count(*) as ?c)
.
So far, we have constructed two structures of natural language questions properly, one counts the entity itself and another counts an attribute of the entity.
For other operators like ORDER BY and FILTER, we should firstly search the orderable attributes, like numeric literals(xsd:integers and xsd:double).
Remote Server
This week, I’ve also got the support of computation from our community. Although it’s just a CPU server, it has 96 GB of RAM and at least, it could train or fine-tune the GloVe model as the co-occurrence matrix would consume a lot of memory. In plus, the download speed is not so fast (~50kb/s).
Issue of vocabulary building
Again, I’ve encountered the issue of inconsistency between the vocabulary and embeddings.
Examples of inconsistency between vocab
and embed
:
build_vocab.py
“RKDartists” -> “RK” and “Dartists”
“EntrezGene” -> “Entrez” and “Gene”
Glove_finetune.py
“RKDartists” -> “rkdartists”
“EntrezGene” -> “entrezgene”
The file vocab.en
is generated by build_vocab.py
using learn.preprocessing.VocabularyProcessor of TensorFlow.contrib, whereas Embeddings are trained with GloVe with simple tokenization and normalisation. As the file build_vocab.py
is part of NSpM, I didn’t change it last time and just remove those which don’t appear in the embeddings from vocab.en
, but this time, it seems to affect the training, the BLEU score only reaches 55 around 40000 steps(until wrote this line). So I decided to use my own vocab instead of build_vocab.py
and we would see if that could help with the BLEU score.
Figure BLEU score till 100k training steps
The precision of the translation is 0.895 (25355/28316), using diff -U 0 test.sparql output_test | grep ^@ | wc -l
. The precision here means that 25355 queries are correctly translated among 28316 queries in test set.
We could see that BLEU score is much lower than last time but precision remains acceptable. The low BLEU score could be due to the inconsistency, and I’m training another consistent model so that we would see the difference.
Conclusion
This week I’ve finished many small tasks. I was going to try to fine-tune a Sentence-BERT model with the sentence representation and similarity problems, but as we have a much faster USE model and we are going to annotate our own corpus, I just stopped to handle other small issues mentioned above.