本文我们不深入讲解 Entity linking、Relation Extraction、Event Extraction,这些在NLP中都已讲过。我们重点探究 General is-a Relation Extraction 和 Terminology/Term Extraction 这两个task。

General is-a Relation Extraction

is-a relation is the semantic relationship between a more specific word (hyponym 上位词) and the more general term (hypernym 下位词).

性质:传递性

Pattern-based Methods:

ID Pattern
1 NP such as {NP,} *{(or\ and)}NP
2 NP{,}(including\ esprcially){NP,} *{(or\ and)}NP
3 NP {,NP}*{,}(and\ or) other NP

Terminology/Term Extraction

Statistical-based approaches

Termhood: Measure the relevance between term and domain.

TF-IDF (计算某单词在文件中的重要性)

TF (term frequency) = Number of certain word in a document / Number of all words in a document

IDF (inverse document frequency) = log(Number of all document in corpu / Number of documents containing the certain word +1) 体现词汇的独特性,减少”a”/“an”/“the”的影响。

TF-IDF=TF*IDF

Unithood: Measure the correlation between two variables x and y.

MI (Mutual information)

描述了Y被确定之后X的不确定性大小。

PMI