This week, I’ve principally worked on the annotation work and the BERT classifier in order to replace or reinforce Paraphraser’s evaluation score.
We have shared a spread sheet where we could annotate the paraphrases with 3 labels:
- -1: for a bad paraphrase which doesn’t keep the same meaning,
- 0: for a paraphrase which keeps the same meaning but doesn’t change the structure (only lexical modifications)
- 1: for a good paraphrase which keeps the same meaning and at the same time change the structure
There are 9500 paraphrases in total, we do not expect to annotate them all, but hopefully, we could annotate some of them for each Ontology term and support enough samples of a Good paraphrase.
The annotation work is in fact preparation of the data set for the BERT fine-tuning. I’ve principally follow this guide, which realised a Binary Bert classifier for sequences with Pytorch.
I will list some of the difficulties and modifications during my reproduction of the guide, like the most evident one: we are facing a triple classification problem so the final output layer needs three units.
BERT model has two constraints:
- All sentences must be padded or truncated to a single, fixed length.
- The maximum sentence length is 512 tokens.
The first constraint need a BERT tokenizer which automatically add a special token [PAD] and will label them as a useless token so that they don’t participate to classification.
The second constraint has nothing to do with us, because our templates of question are no more than 64 tokens even we use a piece word embedding (BERT use piece word to handle OOV problem).
Other works need to be done are listed below:
- Add special tokens [CLS] [SEP] to the start and end of each sentence.
- Explicitly differentiate real tokens from padding tokens with the “attention mask”.
Until the redaction of the post, the proportion of the three labels[-1, 0, 1] is 1:1.5:6, which indicates the class imbalance of our data set.
I didn’t pay much attention at first, but found that the classifier only predicts -1 and 0 on my test set samples.
Then I came up with two ideas to address this problem:
- The first follows the idea of over-sampling and under-sampling, we could manually add some good paraphrases to each group of candidates, and remove some not-good paraphrases.
- The second one is to add weights to the loss function
I didn’t choose the first one because I would like to see how good this Paraphraser could do with the least human intelligence and intervention. To utilise a customized cross-entropy loss function, I created a new class of model which inherit the pre-trained BERT-classifier and rewrite the loss function.
It seems at last the classifier could give a acceptable result:
When is the birth date of “A” ? When is the birthday of “A” ? 0
When is the birth date of “A” ? When was “A” born ? 1
When is the birth date of “A” ? Where does “A” come from ? -1
When is the birth date of “A” ? What is the birth name of “A” ? 1
These sentences have never appeared in the data set, except for the last one, it has done a great job.
So I decided to use the classifier to reinforce the Evaluation Score until we will have got enough annotated samples to train a good classifier.
Now the scoring system works as below:
The predicted label (-1, 0, 1) is directly added to the precedent score.
An error happened “Status: device-side assert triggered”, caused by ‘label -1’. A label “-1” is forbidden in such a model, so before starting training on the data set, I also add 1 to all these labels and minus 1 after the predictions.
A BERT pre-trained model is really helpful when we just have limited samples for this classification task ( about 1000 samples). I reaches 0.8 as the macro-recall score, and hopefully, with more and more annotations done, the classifier could work better.
Next week, I would start the training of NMT model with paraphrases, but before that, I will also tackle the inconsistency issue and involve SPARQL operators to the template set.