GSoc2021-DBpedia-NeuralExtraction logo GSoc2021-DBpedia-NeuralExtraction

The target of this week is to 1) prepare the candidate pairs of each sentence before applying labeling functions (debugging operation for the week 3’s blog); 2) augment the train set by Label Model, which merges the labeling functions together in the Snorkel Framework.

Candidates pairs preparation

The applied environment of labeling functions is based on separated sentences, which includes a pair of Noun Phrases(NPs) as candidates for each sentence. Then the labeling functions will apply to judge whether this pair of candidate holds the causative relation according to the context of this sentence.

However, it is difficult to choose the candidates pairs. For example, in one sentence, we could find multiple NPs which bring together many candidates pairs. It turns to be important for us to select the most interesting candidate pairs. We resort to two main approaches: 1) identify entities of a sentence; 2) define the rules of candidate pairs selection.

Entities identification of a sentence

Let me introduce the tools first. The spaCy is an open-source software library for NLP, it has a pipeline Spacy Entity Linker. This pipeline operates by matching potential candidates from each sentence (subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. Furthermore, it allows to easily find the category behind each entity (e.g. “banana” is type “food” OR “Microsoft” is type “company”). It is worthy to mention that the knowledge bases that they used was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data.

For example, we have a sentence:

“bazin’s praise of the film went beyond film theory and reflected his own philosophy towards life itself”.

From this pipeline, we can get information of this sentence:

  NPs Wikidata_label Wikidata_ID Description
0 praise praise 1208425 positive evaluations made by a person of another’s products, performances, or attributes
1 bazin Bazin 21501536 family name
2 film film 11424 sequence of images that give the impression of movement
3 film theory film theory 28793 academic discipline that aims to explore the essence of the cinema and provides conceptual frameworks for understanding film’s relationship to reality, the other arts, individual viewers, and society
4 philosophy philosophy 5891 intellectual and/or logical study of general and fundamental problems
5 life life 3 matter capable of extracting energy from the environment for replication

In next section, i will take about our rules to extract the interesting pairs among those multiple NPs.

Candidates pairs selection from entities

From the last table, we observe that there are 6 NPs tagged as wikidata label. It happens that, sometimes, the corresponding wikidata label has different string as that of original NP. Because this method for choosing an entity given different possible matches is max-prior. It achieves around 70% accuracy on predicting the correct entities behind link descriptions on wikipedia.

Due to the limitation of this method, the first rule of candidate extraction is to ensure that the corresponding wikidata label has the same string as that of original NP. Also we believe that, the longer the NPs, the more precise information the NPs express. Hence, after satisfying the first rule, in the second rule, we select the top-2 longest NPs as the candidate pairs.

For instance, the candidate pair for previous sentence is:

film theory ; philosophy

Finally, we succeeded to extract 410,304 pairs from the 5,038 not-empty wikipages and prepare for the labeling functions.

Train set augmentation

We consider the previous dataset as the whole dataset and manage to divide them into three sub-datasets:

The goal of this step is to judge whether the candidate pairs in Train Set are cause-effect pairs. In fact, we could evaluate them directly by some experts. The limitation lies on the time-consuming and expensive human-efforts.

The general solution here is 1) extract the seed cause-effect pairs; 2) evaluate the Dev Set and Test Set by using the seed pairs; 3) define the labeling functions ;4) train a Label Model based on Dev Set and apply Label Model to judge the candidate pairs in Train Set. More precisely, in this section, we will focus on the mentioned 4-th step.

Label model for train set

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, np.array(Y_dev), n_epochs=5000, log_freq=500, seed=12345)

In previous code, cardinality indicates the number of classes, the binary label has 2 classes; L_train means the label tagged by the labeling functions; Y_dev means the real label comparing to the seed cause-effect pairs.

Then we try to evaluate this Label Model:

from snorkel.analysis import metric_score
from snorkel.utils import probs_to_preds

probs_dev = label_model.predict_proba(L_dev)
preds_dev = probs_to_preds(probs_dev)

# change the results only containing <0 and 1>
Y_dev_01 = np.array([x if x==1 else 0 for x in Y_dev])

print(f"Label model f1 score: {metric_score(Y_dev_01, preds_dev, probs=probs_dev, metric='f1')}")
print(f"Label model roc-auc: {metric_score(Y_dev_01, preds_dev, probs=probs_dev, metric='roc_auc')}")

The result shows that:

Label model f1 score: 0.0; Label model roc-auc: 0.5325569400037615

The low F1 score is due to that, the positive samples are too little: 230 over 81830 in preds_dev and 260 over 81800 in Y_dev_01.

The augmented train set

We have partitioned 60% dataset into the train set. However, not all of the train set are interesting for those labeling functions. Here, we try to filter out training data points which did not receive a label from any Labeling Functions. Finally, we discovered that 27,211 sentences of train set are tagged as positive, instead of only hundreds of sentences before augmentation.

from snorkel.labeling import filter_unlabeled_dataframe

probs_train = label_model.predict_proba(L_train)
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)