-
Notifications
You must be signed in to change notification settings - Fork 23
Cheatsheet
This describes measures as listed by nel list-measures
.
Measure | Key | Filter | Aggregator |
---|---|---|---|
Mention evaluation measures | |||
strong_mention_match | span | NA | sets |
strong_typed_mention_match | span,type | NA | sets |
strong_linked_mention_match | span | is_linked | sets |
Linking evaluation measures | |||
strong_link_match | span,kbid | is_linked | sets |
strong_nil_match | span | is_nil | sets |
strong_all_match | span,kbid | NA | sets |
strong_typed_link_match | span,type,kbid | is_linked | sets |
strong_typed_nil_match | span,type | is_nil | sets |
strong_typed_all_match | span,type,kbid | NA | sets |
Document-level tagging evaluation | |||
entity_match | docid,kbid | is_linked | sets |
Clustering evaluation measures | |||
muc | span | NA | muc |
b_cubed | span | NA | b_cubed |
b_cubed_plus | span,kbid | NA | b_cubed |
entity_ceaf | span | NA | entity_ceaf |
mention_ceaf | span | NA | mention_ceaf |
pairwise | span | NA | pairwise |
By default measures are aggregated over the corpus as a whole. Using the --by-doc
and/or --by-type
flags to nel evaluate
will instead aggregate measures per document or entity type, and then report per-doc/type and overall (micro- and macro-averaged) performance. Note that micro-average does not equate to whole-corpus aggregation for coreference aggregates, but represents clustering performance disregarding cross-document coreference.
The key defines how system output is matched against the gold standard.
Key | Description |
---|---|
docid | Document identifier must be the same |
start | Start offset must be the same |
end | End offset must be the same |
span | Shorthand for (docid, start, end) |
type | Entity type must be the same |
kbid | KB identifier must be the same, or must both be NIL |
The filter defines what mentions are removed before precision, recall and f-score calculations.
Filter | Description |
---|---|
is_linked | Only keep mentions that are resolved to known KB identifiers |
is_nil | Only keep mentions that are not resolved to known KB identifiers |
is_first | Only keep the first mention in a document of a given KB/NIL identifier |
Note that the is_first filter is intended to provide clustering evaluation similar to the entity_match evaluation of linking performance.
The aggregator defines how corpus-level scores are computed from individual instances.
Aggregator | Description |
---|---|
Mention, linking, tagging evaluations | |
sets | Take the unique set of tuples as defined by key across the gold and system data, then micro-average document-level tp, fp and fn counts. |
overlap-{max,sum}{max,sum} | For tasks in which the gold and system must produce non-overlapping annotations, these scores account for partial overlap between gold and system mentions, as defined for the LoReHLT evaluation. |
Clustering evaluation | |
muc | Count the total number of edits required to translate from the gold to the system clustering |
b_cubed | Assess the proportion of each mention's cluster that is shared between gold and system clusterings |
entity_ceaf | Calculate optimal one-to-one alignment between system and gold clusters based on Dice coefficient, and get the total aligned score relative to aligning each cluster with itself |
mention_ceaf | Calculate optimal one-to-one alignment between system and gold clusters based on number of overlapping mentions, and get the total aligned score relative to aligning each cluster with itself |
pairwise | The proportion of true co-clustered mention pairs that are predicted, etc., as used in computing BLANC |
pairwise_negative | The proportion of true not co-clustered mention pairs that are predicted, etc., as used in computing BLANC |