GSoc2021-DBpedia-NeuralExtraction logo GSoc2021-DBpedia-NeuralExtraction

The target of this week is to 1) continue to apply the classifiers in blog of Week 6; 2) measure the facilities of different classifiers.

Bootstrapping - version 1

Data passing procedures (continued)

At first, it is necessary to acknowledge the data processing procedures of bootstrapping. Please check the following figure:

The Data passing procedures2 Fig1

In Fig 1, each row represents one iteration of the trained models(in gray boxes) and the corresponding datasets (in blue boxes) on which the model is evaluated. In Step 1 and Step 2, the labeling model provided the integrated prediction for those positive examples. In Step 3, the predicated positives will form the augmented train set and it will be used to learn a classifier in Step 4, e.g. LSTM (talked in Week5’s blog). As shown in the red box, we will only focus on the last two steps in this blog.

Augmented train set

If we want to augment dataset with the high predicated positive examples, it is interesting to learn about the increasing amount of positive examples in different iterations.

increasing amount of positive examples Fig2

In Fig2, we provide the delta number of positives examples along with the increasing iterations in X-axis. The iteration 1 indicates the source data without bootstrapping, we notice the great increases in iteration 2 for confidence score 0.7 and 0.8. However, for the case confidence value 0.9, it shows no improvements along with the iterations. We could conclude that the confidence value 0.7 and confidence value 0.8 are more effective in tagging positive examples to augment the dataset.

Classifiers

We trained the LSTM on the augmented train set and evaluated with 4 metrics:

LSTM classifier Fig3

Unfortunately, most of the accuracy and precision values are around or below 0.5, which is rather the worst performance as a classifier.

Except for it, we also tried to train others classifiers, e.g. Logistic Regression and SVM, as comparison.

3 classifier Fig4

From Fig4, we observed that logistic Regression and SVM have fluctuate performance but even better than LSTM.

Briefly, we assume that the LSTM is not well converged, which needs more exploration of parameter tuning. Meanwhile, it is also interesting to focus on logistic Regression for more details.