Extracting knowledge from heterogeneous data sources to form a knowledge graph

Basic of Relation Database

在表格中,我们会遇到同一事物的不同表示或者相同的表示针对不同事物,以及一些信息的重复记录,我们可以提取表格中的某些信息形成新的表格。

Database terms:

  • A database is a collection of data
  • Data is organized into one or more tables
  • Each row is a record
  • Each column is a field
  • Set a data type for each field: Text, Number, Date/time, Currency, Yes/No

example:

NAME ROLE TOWN AGE
record 1 Peter farmer Oxford 18
record 2 Mary weaver Winchester 33
record 3 Seth drover Bristol 21

Joins between tables: Primary Key

  • choose at least one field that only contains unique values
  • relate two tables by primary keys and foreign keys

example:

Authorship
ID (Primary Key)
Author<下表的PersonID>(foreign keys)
Publication
Person
PersonID (Primary Key)
FirstName
Date of birth
Notes

Database design workflow:

  1. Choose fields
  2. give each field a data type
  3. arrange the fields into tables
  4. set primary key fields
  5. draw relationship between tables

RDB2RDF: Direct Mapping & R2RML

map the content of Relational Databases to RDF

Direct Mapping

default automatic mapping of relational data to RDF

Map

  • Base IRI for the whole graph/dataset e.g.@base<http://foo.example/DB/>
  • table to class 一张表一个类
  • table row to property e.g.<People#ID>,<People#ID>
  • row with primary key to resource e.g.<People/ID=7>
  • cell to literal value
  • in addition cell to URI
    • if there is a foreign key constraint e.g.<People#ref-addr>

example:

People

PK —>Addresses(ID)
ID frame addr
7 Bob 18
8 Sue NULL

Address

PK
ID City State
18 Cambridge Ma
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@base<http://foo.example/DB/>
@prefix rdf:<http://www.w3.org/1999/02/22-rdf-synatax-ns#>

<People/ID=7> rdf:type <People> ;
<People#ID>"7" ;
<People#fname>"Bob" ;
<People#addr>"18" ;
<People#ref-addr><Addresses/ID=18> .
<people/ID=8> rdf:type <People>;
<People#ID>"8" ;
<People#fname>"Sue" .
<Addresses/ID=18> rdf:type <Addresses> ;
<Addresses#ID>"18" ;
<Addresses#City>"Cambridge" ;
<State>"Ma" .

R2RML

customizable language to map relational data to RDF

A triple map has three parts:

  • the input logical table
  • a subject map
  • several predicate-object maps

example:

Student

ID sname major
001 Wang 211
002 Wu 201

Major

ID mname address
211 AI CS_building
triples
1
2
3
4
5
6
7
8
9
<http://data.example.com/student/001> rdf:type ex:Student ;
ex:id "001" ;
ex:name "Wang" ;
ex:major <http://data.example.com/major/211> .
<http://data.example.com/student/002> rdf:type ex:Student ;
ex:id "002" ;
ex:name "Wu" .
<http://data.example.com/major/211> rdf:type ex:Major ;
ex:name "AI" .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<#TripleMap1>
rr:logicalTable[rr:tableName"Student"];
rr:subjectMap[
rr:template "http://data.example.com/student/{ID}";
rr:class ex:Student;
];
rr:predicateObjectMap[
rr:predicate ex:id;
rr:objectMap[rr:column"ID"];
];
rr:predicateObjectMap[
rr:predicate ex:name;
rr:objectMap[rr:column"sname"];
];
rr:predicateObjectMap[
rr:predicate ex:major;
rr:objectMap[
rr:parentTriplesMap <#TriplesMap2>;
rr:joinCondition [
rr:child "major"; //列题头
rr:parent "ID";
];
];
].

<#TripleMAp2>
rr:logicalTable[rr:tableName"MAjor"];
rr:subjectMap[
rr:template "http://data.example.com/major/{ID}";
rr:class ex:Major;
];
rr:predicateObjectMap[
rr:predicate ex:name;
rr:objectMap[rr:column"mname"]
].

Triple Extraction from Relational Web Tables

A relation table describes a set of entities in the core column(s) along with their attributes in the remaining columns.

Entity linking: Mention to Entity

Mapping the String mentions in table cells to their referent entities in a given knowledge base.

  1. Candidate Generation

    • Dictionary-based method

      在网页中收集所有带有超链接的文本,并按它们出现的次数进行排序

    • String similarity based method

      计算 mentions 和 entities 的相似度

      • Levenshtein distance

        the minimun number of single-character edits( insertions, deletions or substitutions) required to change one word into the other.

        example: “wtx”—>”yty”, LD=2 “wtx”—>”yt”, LD=2

      • Jaccard similarity

        若A,B是两个句子,则按单词分割,若为单词,则按字母n-gram分割。

    • Synonym-based method

      前两种方法的结合, e.g. WorldNet

    运用上述方法,选择相似度最高的K个entities作为mention的候选集。

  2. Entity Disambiguation

    • local Disambiguation

      只考虑 given mention 和 target entity 的语义信息来消歧,不考虑 table 中的其他mentions 的 entities

      Given a name mention m(context c and name s), find the entity e.

      |M| is the total name mention size, N is the number of all entities.

    • Global Disambiguation

      考虑 table 中的所有 mentions 对应的 entities

      A Graph Model

      将mention与entity画成图

      根据相似性计算概率

Column Typing: column to Class

根据table每一列性质一样,只要知道一两个mention的性质,就可以举一反三定义该列entity

Relation Extraction: Semantic Association between Columns to Relation

根据table两列之间部分mention的关系(e.g. locatedAt),推断所有都具备该关系。