Knowledge Engineering Ⅳ | KG Construction from Structured data

Extracting knowledge from heterogeneous data sources to form a knowledge graph

Basic of Relation Database

在表格中，我们会遇到同一事物的不同表示或者相同的表示针对不同事物，以及一些信息的重复记录，我们可以提取表格中的某些信息形成新的表格。

Database terms:

A database is a collection of data
Data is organized into one or more tables
Each row is a record
Each column is a field
Set a data type for each field: Text, Number, Date/time, Currency, Yes/No

example:

	NAME	ROLE	TOWN	AGE
record 1	Peter	farmer	Oxford	18
record 2	Mary	weaver	Winchester	33
record 3	Seth	drover	Bristol	21

Joins between tables: Primary Key

choose at least one field that only contains unique values

relate two tables by primary keys and foreign keys

example:

Authorship
ID (Primary Key)
Author<下表的PersonID>（foreign keys）
Publication

Person
PersonID (Primary Key)
FirstName
Date of birth
Notes

Database design workflow:

Choose fields
give each field a data type
arrange the fields into tables
set primary key fields
draw relationship between tables

RDB2RDF: Direct Mapping & R2RML

map the content of Relational Databases to RDF

Direct Mapping

default automatic mapping of relational data to RDF

Map

Base IRI for the whole graph/dataset e.g.@base<http://foo.example/DB/>

table to class 一张表一个类

table row to property e.g.<People#ID>,<People#ID>

row with primary key to resource e.g.<People/ID=7>

cell to literal value

in addition cell to URI

if there is a foreign key constraint e.g.<People#ref-addr>

example:

People

PK		—>Addresses(ID)
ID	frame	addr
7	Bob	18
8	Sue	NULL

Address

PK
ID	City	State
18	Cambridge	Ma

@base<http://foo.example/DB/>
@prefix rdf:<http://www.w3.org/1999/02/22-rdf-synatax-ns#>

<People/ID=7> rdf:type <People> ;
              <People#ID>"7" ;
              <People#fname>"Bob" ;
              <People#addr>"18" ;
              <People#ref-addr><Addresses/ID=18> .
<people/ID=8> rdf:type <People>;
              <People#ID>"8" ;
              <People#fname>"Sue" .
<Addresses/ID=18> rdf:type <Addresses> ;
                  <Addresses#ID>"18" ;
                  <Addresses#City>"Cambridge" ;
                  <State>"Ma" .

R2RML

customizable language to map relational data to RDF

A triple map has three parts:

the input logical table

a subject map

several predicate-object maps

example:

Student

ID	sname	major
001	Wang	211
002	Wu	201

Major

ID	mname	address
211	AI	CS_building

triples

<http://data.example.com/student/001> rdf:type ex:Student ;
                                      ex:id "001" ;
									  ex:name "Wang" ;
									  ex:major <http://data.example.com/major/211> .
<http://data.example.com/student/002> rdf:type ex:Student ;
									  ex:id "002" ;
									  ex:name "Wu" .
<http://data.example.com/major/211> rdf:type ex:Major ;
									ex:name "AI" .

<#TripleMap1>
rr:logicalTable[rr:tableName"Student"];
rr:subjectMap[
	rr:template "http://data.example.com/student/{ID}";
	rr:class ex:Student;
];
rr:predicateObjectMap[
	rr:predicate ex:id;
	rr:objectMap[rr:column"ID"];
];
rr:predicateObjectMap[
	rr:predicate ex:name;
	rr:objectMap[rr:column"sname"];
];
rr:predicateObjectMap[
	rr:predicate ex:major;
	rr:objectMap[
		rr:parentTriplesMap <#TriplesMap2>;
		rr:joinCondition [
			rr:child "major";  //列题头
			rr:parent "ID";
		];
	];
].

<#TripleMAp2>
rr:logicalTable[rr:tableName"MAjor"];
rr:subjectMap[
	rr:template "http://data.example.com/major/{ID}";
	rr:class ex:Major;
];
rr:predicateObjectMap[
	rr:predicate ex:name;
	rr:objectMap[rr:column"mname"]
].

Triple Extraction from Relational Web Tables

A relation table describes a set of entities in the core column(s) along with their attributes in the remaining columns.

Entity linking: Mention to Entity

Mapping the String mentions in table cells to their referent entities in a given knowledge base.

Candidate Generation
- Dictionary-based method
  
  在网页中收集所有带有超链接的文本，并按它们出现的次数进行排序
- String similarity based method
  
  计算 mentions 和 entities 的相似度
  - Levenshtein distance
    
    the minimun number of single-character edits( insertions, deletions or substitutions) required to change one word into the other.
    
    example: “wtx”—>”yty”, LD=2 “wtx”—>”yt”, LD=2
  - Jaccard similarity
    $J(A,B)=\frac{\mid A\cap B\mid}{\mid A\cup B\mid}$
    若A，B是两个句子，则按单词分割，若为单词，则按字母n-gram分割。
- Synonym-based method
  
  前两种方法的结合， e.g. WorldNet
运用上述方法，选择相似度最高的K个entities作为mention的候选集。
Entity Disambiguation
- local Disambiguation
  
  只考虑 given mention 和 target entity 的语义信息来消歧，不考虑 table 中的其他mentions 的 entities
  
  Given a name mention m(context c and name s), find the entity e.
  $e=arg\ \mathop{max}\limits_{e}P(e)P(s|e)P(c|e)$ $P(e)=\frac{count(e)+1}{N+|M|}$
  |M| is the total name mention size, N is the number of all entities.
  $P(s|e)=\frac{count(e,s)}{\sum_s count(e,s)}$ $p(c|e)=P_e(t_1)P_e(t_2)...P_e(t_n)\\ P_e(t)=\lambda P_{eML}(t)+(1-\lambda)P_g(t)\\ P_{eML}(t)=\frac{Count_e(t)}{\sum\limits_tCount_e(t)}$
- Global Disambiguation
  
  考虑 table 中的所有 mentions 对应的 entities
  
  A Graph Model
  
  将mention与entity画成图
  
  根据相似性计算概率

Column Typing: column to Class

根据table每一列性质一样，只要知道一两个mention的性质，就可以举一反三定义该列entity

Basic of Relation Database

RDB2RDF: Direct Mapping & R2RML

Direct Mapping

R2RML

Triple Extraction from Relational Web Tables

Entity linking: Mention to Entity

Column Typing: column to Class

Relation Extraction: Semantic Association between Columns to Relation