Knowledge Engineering Ⅴ | KG Construction from Semi-structured data
Fact Extraction
<subject, predicate, object>
subject: an entity
predicate: property/relation
object: an entity or a literal value
from Infobox
from categories
YAGO category maps
| Number | Regular Expression | Target Relation | else |
| :——: | :——————————————: | :———————-: | :———————————: |
| 1 | ([0-9]{3,4})births | bornOnDate | 在0-9中选3-4个数字 |
| 2 | . established in ([0-9]{3,4}) | establishedOnDate | 前面任意出现一些内容 |
| 3 | ([0-9]{3,4}) books|novels | writtenOnDate | 后面出现单词二选一 |
| 4 | [A-Za-z]+(.) winners | hasWonPrize | 字母大小写均可的一个词语 |()中是抽取的内容
<artical entity, targetrelation, ()中内容>
若出现抽取符合多个Map,那么均代入检测正确性。
Type Inference
from Infoboxes
- Names of infobox templates are extracted as entity types (DBpedia).
- property-value pairs
{<p_1,v_1>,...,<p_k,v_k>}
- if p_k belongs to the concept set C, v_k belongs to the instance set I, p_k is the type of v_k.
from Categories
- YAGO parses each category name like Naturalized citizens of Germany into a pre-modifer(Naturalized), a head (citizens) and a post-modifier (Germany).
- 如果head不是缩写并且是复数,那么它就是预选类
- 如果预选类包含一些给定词汇,它就会被过滤掉
from text
Type extraction 从相应文章的第一句提取每个实体的类型(一般会有很多提取规则),获取一些表示类型的词
Type disambiguation 将实体与类型匹配,赋予类型具体概念
Taxonomy Induction
a directed acyclic graph consist of is-a relations between entities(conceptual entities & individual entities).
因此,taxonomy induction 就是去除 not-is-a 的关系。
提取Wikipedia所有内容
Pre-Cleaning. 去除一些给定词语
Syntax-based Method. 同尾保留(Computer Scientists is-a Scientists), 不同去掉(Crime comics not-is-a Crime)
Connectivity-based Method. 文章名字一般是类别名,层次关系可根据上层判断下层
Lexico-Syntactic based Method.
| is-a relation patterns | not-is-a relation patterns |
| ————————————————————- | —————————————- |
| NP2,?(such as|like|,especially) NP NP1 | NP2’s NP1 |
| such NP2 as NPNP1 | NP1 in NP2 |
| NP1 NP(and|or|,like) other NP2 | NP2 with NP1 |
| NP1, one of det_pl{eg.the} NP2 | NP2 contain(s|ed|ing) NP1 |
| NP1, det_sg{eg.a} NP2 rel_pron | NP1 of NP2 |
| NP2 like NPNP1 | NP1 are? used in NP2 |
| including | NP2 ha(s|ve|d) NP1 |
| | consist of |
| | is also known as |(plants,grass) is-a pair
Inference based Method. “is-a” owns the property of transitivity.