Fact Extraction

<subject, predicate, object>

subject: an entity

predicate: property/relation

object: an entity or a literal value

  1. from Infobox

  2. from categories

    YAGO category maps

    | Number | Regular Expression | Target Relation | else |
    | :——: | :——————————————: | :———————-: | :———————————: |
    | 1 | ([0-9]{3,4})births | bornOnDate | 在0-9中选3-4个数字 |
    | 2 | . established in ([0-9]{3,4}) | establishedOnDate | 前面任意出现一些内容 |
    | 3 | ([0-9]{3,4}) books|novels | writtenOnDate | 后面出现单词二选一 |
    | 4 | [A-Za-z]+(.
    ) winners | hasWonPrize | 字母大小写均可的一个词语 |

    ()中是抽取的内容 <artical entity, targetrelation, ()中内容>

    若出现抽取符合多个Map,那么均代入检测正确性。

Type Inference

from Infoboxes

  1. Names of infobox templates are extracted as entity types (DBpedia).
  2. property-value pairs {<p_1,v_1>,...,<p_k,v_k>}
  3. if p_k belongs to the concept set C, v_k belongs to the instance set I, p_k is the type of v_k.

from Categories

  1. YAGO parses each category name like Naturalized citizens of Germany into a pre-modifer(Naturalized), a head (citizens) and a post-modifier (Germany).
  2. 如果head不是缩写并且是复数,那么它就是预选类
  3. 如果预选类包含一些给定词汇,它就会被过滤掉

from text

  1. Type extraction 从相应文章的第一句提取每个实体的类型(一般会有很多提取规则),获取一些表示类型的词

  2. Type disambiguation 将实体与类型匹配,赋予类型具体概念

Taxonomy Induction

a directed acyclic graph consist of is-a relations between entities(conceptual entities & individual entities).

因此,taxonomy induction 就是去除 not-is-a 的关系。

  1. 提取Wikipedia所有内容

  2. Pre-Cleaning. 去除一些给定词语

  3. Syntax-based Method. 同尾保留(Computer Scientists is-a Scientists), 不同去掉(Crime comics not-is-a Crime)

  4. Connectivity-based Method. 文章名字一般是类别名,层次关系可根据上层判断下层

  5. Lexico-Syntactic based Method.

    | is-a relation patterns | not-is-a relation patterns |
    | ————————————————————- | —————————————- |
    | NP2,?(such as|like|,especially) NP NP1 | NP2’s NP1 |
    | such NP2 as NP
    NP1 | NP1 in NP2 |
    | NP1 NP(and|or|,like) other NP2 | NP2 with NP1 |
    | NP1, one of det_pl{eg.the} NP2 | NP2 contain(s|ed|ing) NP1 |
    | NP1, det_sg{eg.a} NP2 rel_pron | NP1 of NP2 |
    | NP2 like NP
    NP1 | NP1 are? used in NP2 |
    | including | NP2 ha(s|ve|d) NP1 |
    | | consist of |
    | | is also known as |

    (plants,grass) is-a pair

  6. Inference based Method. “is-a” owns the property of transitivity.