Knowledge Engineering Ⅳ | KG Construction from Structured data
Extracting knowledge from heterogeneous data sources to form a knowledge graph
Basic of Relation Database
在表格中,我们会遇到同一事物的不同表示或者相同的表示针对不同事物,以及一些信息的重复记录,我们可以提取表格中的某些信息形成新的表格。
Database terms:
- A database is a collection of data
- Data is organized into one or more tables
- Each row is a record
- Each column is a field
- Set a data type for each field: Text, Number, Date/time, Currency, Yes/No
example:
NAME | ROLE | TOWN | AGE | |
---|---|---|---|---|
record 1 | Peter | farmer | Oxford | 18 |
record 2 | Mary | weaver | Winchester | 33 |
record 3 | Seth | drover | Bristol | 21 |
Joins between tables: Primary Key
- choose at least one field that only contains unique values
- relate two tables by primary keys and foreign keys
example:
Authorship |
---|
ID (Primary Key) |
Author<下表的PersonID>(foreign keys) |
Publication |
Person |
---|
PersonID (Primary Key) |
FirstName |
Date of birth |
Notes |
Database design workflow:
- Choose fields
- give each field a data type
- arrange the fields into tables
- set primary key fields
- draw relationship between tables
RDB2RDF: Direct Mapping & R2RML
map the content of Relational Databases to RDF
Direct Mapping
default automatic mapping of relational data to RDF
Map
- Base IRI for the whole graph/dataset e.g.
@base<http://foo.example/DB/>
- table to class 一张表一个类
- table row to property e.g.
<People#ID>
,<People#ID>
- row with primary key to resource e.g.
<People/ID=7>
- cell to literal value
- in addition cell to URI
- if there is a foreign key constraint e.g.
<People#ref-addr>
example:
People
PK | —>Addresses(ID) | |
---|---|---|
ID | frame | addr |
7 | Bob | 18 |
8 | Sue | NULL |
Address
PK | ||
---|---|---|
ID | City | State |
18 | Cambridge | Ma |
1 | @base<http://foo.example/DB/> |
R2RML
customizable language to map relational data to RDF
A triple map has three parts:
- the input logical table
- a subject map
- several predicate-object maps
example:
Student
ID | sname | major |
---|---|---|
001 | Wang | 211 |
002 | Wu | 201 |
Major
ID | mname | address |
---|---|---|
211 | AI | CS_building |
1 | <http://data.example.com/student/001> rdf:type ex:Student ; |
1 | <#TripleMap1> |
Triple Extraction from Relational Web Tables
A relation table describes a set of entities in the core column(s) along with their attributes in the remaining columns.
Entity linking: Mention to Entity
Mapping the String mentions in table cells to their referent entities in a given knowledge base.
Candidate Generation
Dictionary-based method
在网页中收集所有带有超链接的文本,并按它们出现的次数进行排序
String similarity based method
计算 mentions 和 entities 的相似度
Levenshtein distance
the minimun number of single-character edits( insertions, deletions or substitutions) required to change one word into the other.
example: “wtx”—>”yty”, LD=2 “wtx”—>”yt”, LD=2
Jaccard similarity
若A,B是两个句子,则按单词分割,若为单词,则按字母n-gram分割。
Synonym-based method
前两种方法的结合, e.g. WorldNet
运用上述方法,选择相似度最高的K个entities作为mention的候选集。
Entity Disambiguation
local Disambiguation
只考虑 given mention 和 target entity 的语义信息来消歧,不考虑 table 中的其他mentions 的 entities
Given a name mention m(context c and name s), find the entity e.
|M| is the total name mention size, N is the number of all entities.
Global Disambiguation
考虑 table 中的所有 mentions 对应的 entities
A Graph Model
将mention与entity画成图
根据相似性计算概率
Column Typing: column to Class
根据table每一列性质一样,只要知道一两个mention的性质,就可以举一反三定义该列entity
Relation Extraction: Semantic Association between Columns to Relation
根据table两列之间部分mention的关系(e.g. locatedAt),推断所有都具备该关系。