The target of this week is to 1). prepare the new datasets of cause-effect relations, including train set, development set and test set; 2). define the different linguistic patterns and distance supervision approach to identify the cause-effect pairs from text; 3). evaluate the performance of labeling functions over the datasets.
Prepare the new datasets of cause-effect relations
As we have discussed in the blog of Week 2, we are able to acquire the cause-effect pairs from two sources, i.e. SemEval2010 and WikiData. We would like to utilize these two into different phases:
- Cause-effects from SemEval2010 These pairs work as seed pairs to select text to prepare the new datasets.
- Cause-effects from WikiData These pairs are considered as the extra knowledge assisting the distance supervision approach.
Source datasets
The source dataset is the original Wikimedia dump, which provides enough corpus to build up the new datasets. Wikimedia provides public dumps of the wikis’ content and of related data (Please find the Wikimedia dump mirror to download here; Also you could find the available download tools here).
In this blog, we use the Wikimedia dump enwiki-20210601-pages-articles-multistream.xml.bz2. Regarding to the computation limits, we only apply the corpus under folder AA of this dump, including around 5,000 wikipages whose content exceed 500 characters.
New datasets
For the purpose of experiments, we intend to prepare three datasets:
- Training set (df_train) includes the separate sentences.
- Development set (df_dev) includes the separate sentences and the binary labels of sentences, which indicate whether a seed pair appear together (for the purpose of development).
- Testing set (df_test) includes the separate sentences and the binary labels of sentences, which indicate whether a seed pair appear together (for the purpose of testing).
Here is an example extracted from df_dev:
cause-effect pairs | sentence |
---|---|
[‘work’, ‘composer’] | italian composer sylvano bussotti, whose composing career spans from the 1930s to the first decade of the 21st century, wrote a solo work for bass in 1983 entitled “naked angel face per contrabbasso”. |
[‘work’, ‘composer’] | in 2004 italian double bassist and composer stefano scodanibbio made a double bass arrangement of luciano berio’s 2002 solo cello work “sequenza xiv” with the new title “sequenza xivb”. |
[‘drum’, ‘sound’] | this is a vigorous version of pizzicato where the strings are “slapped” against the fingerboard between the main notes of the bass line, producing a snare drum-like percussive sound. |
[‘shot’, ‘gun’] | during an interview with “nme” magazine, he shot and killed a squirrel with a pellet gun to prevent any further damage to his electrical system in the attic at the location the interview was held. |
[‘individuals’, ‘society’] | similarly, durkheim opines that in societies with more organic solidarity, the diversity of occupations is greater, and individuals depend on each other more, resulting in greater benefits to society as a whole. |
Apply over labeling functions
Snorkel is a system for programmatically building and managing training datasets without manual labeling. Currently, it exposes several key programmatic operations, we will concentrate on labeling data, e.g., using heuristic linguistic rules or distant supervision techniques.
Define labeling functions
Labeling functions (LFs) are one of the core operators for building and managing training datasets programmatically in Snorkel. The basic idea is simple: a labeling function is a function that outputs a label for some subset of the training dataset, which is often referred to as weak supervision.
Our goal for LFs development is to create a high quality set of training labels for our unlabeled dataset, not to label everything or directly create a model for inference using the LFs.
LFs with linguistic rules
1). apply the labeling function to check for the causal connectives words appearing between the mentions (i.e. cause-effect pairs in sentences).
# import the pre-defined functions to pre-process the data
from preprocessors import get_NPs_text, get_text_between, get_left_tokens, get_right_tokens
# invoke the labeling_function
from snorkel.labeling import labeling_function
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
# Check for the `causal connectives` words appearing between the mentions
causalConnectives = {'for this reason', 'with the result that', 'because of','thanks to', 'due to',
'beause', 'as', 'since', 'so', 'so that',
'the reason was that', 'is due to'}
# define the labeling function <lf_causalConnectives>
@labeling_function(resources=dict(causalConnectives=causalConnectives), pre=[get_text_between])
def lf_causalConnectives(x, causalConnectives):
if len(causalConnectives.intersection(set(x.text_between))) > 0 :
return POSITIVE
else:
return ABSTAIN
2). apply the labeling function to check for the causative verbs words appearing between the mentions(i.e. cause-effect pairs in sentences).
# Check for the `causative verbs` words appearing between the mentions
causationVerbs = {'cause', 'lead to', 'bring about', 'generate', 'make', 'force', 'allow',
'kill', 'melt', 'dry', 'break', 'drop',
'poison', 'hang', 'punch', 'clean',
'blacken', 'sweeten', 'thicken', 'nullify', 'liquefy', 'verify',
'kill', 'feed', 'die', 'eat'}
@labeling_function(resources=dict(causationVerbs=causationVerbs), pre=[get_text_between])
def lf_causationVerbs(x, causationVerbs):
return POSITIVE if len(causationVerbs.intersection(set(x.text_between))) > 0 else ABSTAIN
3). apply the labeling function to check for the causation adverbs words appearing between the mentions(i.e. cause-effect pairs in sentences).
causationAdverbs = {'audibly', 'visibly',
'manilestly', 'publicly', 'conspicuously',
'successfully', 'plausibly', 'conveniently', 'amusingly', 'pleasantly',
'irrevocably', 'ously', 'rudely',
'obediently', 'gratefully', 'consequently', 'painfully',
'mechanically', 'magically'}
# Check for the `causation adverbs` appearing in the left of mentions
@labeling_function(resources=dict(causationAdverbs=causationAdverbs), pre=[get_left_tokens])
def lf_causationAdverbs_left(x, causationAdverbs):
if len(set(causationAdverbs).intersection(set(x.ele1_left_tokens))) > 0:
return POSITIVE
elif len(set(causationAdverbs).intersection(set(x.ele2_left_tokens))) > 0:
return POSITIVE
else:
return ABSTAIN
# Check for the `causation adverbs` appearing in the right of mentions
@labeling_function(resources=dict(causationAdverbs=causationAdverbs), pre=[get_right_tokens])
def lf_causationAdverbs_right(x, causationAdverbs):
if len(set(causationAdverbs).intersection(set(x.ele1_right_tokens))) > 0:
return POSITIVE
elif len(set(causationAdverbs).intersection(set(x.ele2_right_tokens))) > 0:
return POSITIVE
else:
return ABSTAIN
LFs with distant supervision
In addition to using factories that encode pattern matching heuristics, we can also write labeling functions that distantly supervise the training datasets. Here, we’ll load in a list of known cause-effect pairs and check to see if the pairs match one of these sentences.
WikiData has defined an interesting cause-effect relation: has_cause(P828). We use the sqarql query to extract the tuples linked by this relations in the WikiData Query Service, with the following query sentence:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
SELECT ?cause ?cause_label ?effect ?effect_label WHERE {
?cause wdt:P828 ?effect .
OPTIONAL {
?cause rdfs:label ?cause_label FILTER (lang(?cause_label) = "en") .
?effect rdfs:label ?effect_label FILTER (lang(?effect_label) = "en") .
}
}
We’ll use a preprocessed snapshot as our knowledge base for all labeling function development, namely known_CauseEffect_pairs. Then we could set up the distant supervision labeling function, which is designed to check whether the Cause-effects from WikiData
appearing in sentences.
# Cause-effects from WikiData --> known_CauseEffect_pairs in this code
# Check whether the **Cause-effects from WikiData** appearing in sentences
@labeling_function(resources=dict(known_CauseEffect_pairs=known_CauseEffect_pairs), pre=[get_NPs_text])
def lf_distant_supervision(x, known_CauseEffect_pairs):
p1, p2 = x.NPs
if (p1, p2) in known_CauseEffect_pairs:
return POSITIVE
else:
return ABSTAI
Evaluate the labeling functions
After we have defined those labeling functions, we could apply them to the data:
from snorkel.labeling import PandasLFApplier
lfs = [
lf_causalConnectives,
lf_causationVerbs,
lf_causationAdverbs_left,
lf_causationAdverbs_right,
lf_distant_supervision
]
applier = PandasLFApplier(lfs)
L_dev = applier.apply(df_dev)
L_train = applier.apply(df_train)
Then we could evaluate performance on training set, by using lots of statistics about those labeling functions. We copy and paste the following summary statistics for multiple LFs from this tutorial page:
- Polarity: The set of unique labels this LF outputs (excluding abstains)
- Coverage: The fraction of the dataset the LF labels
- Overlaps: The fraction of the dataset where this LF and at least one other LF label
- Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
- Correct: The number of data points this LF labels correctly (if gold labels are provided)
- Incorrect: The number of data points this LF labels incorrectly (if gold labels are provided)
- Empirical Accuracy: The empirical accuracy of this LF (if gold labels are provided)
Our temporary experiment results are listed here (still need more efforts for debugging):
LFs | j | Polarity | Coverage | Overlaps | Conflicts | Correct | Incorrect | Emp. Acc. |
---|---|---|---|---|---|---|---|---|
lf_causalConnectives | 0 | [1] | 0.0510498 | 0.000411692 | 0 | 90 | 34 | 0.725806 |
lf_causationVerbs | 1 | [1] | 0.00946892 | 0.000411692 | 0 | 16 | 7 | 0.695652 |
lf_causationAdverbs_left | 2 | [1] | 0.000823384 | 0 | 0 | 2 | 0 | 1 |
lf_causationAdverbs_right | 3 | [1] | 0.000411692 | 0 | 0 | 1 | 0 | 1 |
lf_distant_supervision | 4 | [] | 0 | 0 | 0 | 0 | 0 | 0 |