The target of this week is to 1) prepare the candidate pairs of each sentence before applying labeling functions (debugging operation for the week 3’s blog); 2) augment the train set by Label Model, which merges the labeling functions together in the Snorkel Framework.
Candidates pairs preparation
The applied environment of labeling functions is based on separated sentences, which includes a pair of Noun Phrases(NPs) as candidates for each sentence. Then the labeling functions will apply to judge whether this pair of candidate holds the causative relation according to the context of this sentence.
However, it is difficult to choose the candidates pairs. For example, in one sentence, we could find multiple NPs which bring together many candidates pairs. It turns to be important for us to select the most interesting candidate pairs. We resort to two main approaches: 1) identify entities of a sentence; 2) define the rules of candidate pairs selection.
Entities identification of a sentence
Let me introduce the tools first. The spaCy is an open-source software library for NLP, it has a pipeline Spacy Entity Linker. This pipeline operates by matching potential candidates from each sentence (subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. Furthermore, it allows to easily find the category behind each entity (e.g. “banana” is type “food” OR “Microsoft” is type “company”). It is worthy to mention that the knowledge bases that they used was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data.
For example, we have a sentence:
“bazin’s praise of the film went beyond film theory and reflected his own philosophy towards life itself”.
From this pipeline, we can get information of this sentence:
NPs | Wikidata_label | Wikidata_ID | Description | |
---|---|---|---|---|
0 | praise | praise | 1208425 | positive evaluations made by a person of another’s products, performances, or attributes |
1 | bazin | Bazin | 21501536 | family name |
2 | film | film | 11424 | sequence of images that give the impression of movement |
3 | film theory | film theory | 28793 | academic discipline that aims to explore the essence of the cinema and provides conceptual frameworks for understanding film’s relationship to reality, the other arts, individual viewers, and society |
4 | philosophy | philosophy | 5891 | intellectual and/or logical study of general and fundamental problems |
5 | life | life | 3 | matter capable of extracting energy from the environment for replication |
In next section, i will take about our rules to extract the interesting pairs among those multiple NPs.
Candidates pairs selection from entities
From the last table, we observe that there are 6 NPs tagged as wikidata label. It happens that, sometimes, the corresponding wikidata label has different string as that of original NP. Because this method for choosing an entity given different possible matches is max-prior. It achieves around 70% accuracy on predicting the correct entities behind link descriptions on wikipedia.
Due to the limitation of this method, the first rule of candidate extraction is to ensure that the corresponding wikidata label has the same string as that of original NP. Also we believe that, the longer the NPs, the more precise information the NPs express. Hence, after satisfying the first rule, in the second rule, we select the top-2 longest NPs as the candidate pairs.
For instance, the candidate pair for previous sentence is:
film theory ; philosophy
Finally, we succeeded to extract 410,304 pairs from the 5,038 not-empty wikipages and prepare for the labeling functions.
Train set augmentation
We consider the previous dataset as the whole dataset and manage to divide them into three sub-datasets:
- Train set (60%) (our targets)
- Dev set (20%)
- Test set (20%)
The goal of this step is to judge whether the candidate pairs in Train Set are cause-effect pairs. In fact, we could evaluate them directly by some experts. The limitation lies on the time-consuming and expensive human-efforts.
The general solution here is 1) extract the seed cause-effect pairs; 2) evaluate the Dev Set and Test Set by using the seed pairs; 3) define the labeling functions ;4) train a Label Model based on Dev Set and apply Label Model to judge the candidate pairs in Train Set. More precisely, in this section, we will focus on the mentioned 4-th step.
Label model for train set
from snorkel.labeling.model import LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, np.array(Y_dev), n_epochs=5000, log_freq=500, seed=12345)
In previous code, cardinality indicates the number of classes, the binary label has 2 classes; L_train means the label tagged by the labeling functions; Y_dev means the real label comparing to the seed cause-effect pairs.
Then we try to evaluate this Label Model:
from snorkel.analysis import metric_score
from snorkel.utils import probs_to_preds
probs_dev = label_model.predict_proba(L_dev)
preds_dev = probs_to_preds(probs_dev)
# change the results only containing <0 and 1>
Y_dev_01 = np.array([x if x==1 else 0 for x in Y_dev])
print(f"Label model f1 score: {metric_score(Y_dev_01, preds_dev, probs=probs_dev, metric='f1')}")
print(f"Label model roc-auc: {metric_score(Y_dev_01, preds_dev, probs=probs_dev, metric='roc_auc')}")
The result shows that:
Label model f1 score: 0.0; Label model roc-auc: 0.5325569400037615
The low F1 score is due to that, the positive samples are too little: 230 over 81830 in preds_dev and 260 over 81800 in Y_dev_01.
The augmented train set
We have partitioned 60% dataset into the train set. However, not all of the train set are interesting for those labeling functions. Here, we try to filter out training data points which did not receive a label from any Labeling Functions. Finally, we discovered that 27,211 sentences of train set are tagged as positive, instead of only hundreds of sentences before augmentation.
from snorkel.labeling import filter_unlabeled_dataframe
probs_train = label_model.predict_proba(L_train)
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)