SlideShare a Scribd company logo
CRL: A Rule Language
for Table Analysis and Interpretation*
in Unstructured Tabular Data Integration
Alexey Shigarov, shigarov@icc.ru
Matrosov Institute for System Dynamics and Control Theory of SB RAS
17th International Conference on
Data Analytics and Management in Data Intensive Domains
Obninsk, Russia
October 13-16, 2015
* This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042)
and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
Unstructured vs Structured
Unstructured
Tabular Data
Arbitrary Tables in
ASCII-text,
Spreadsheets,
PDF Documents,
Web-Pages
Structured Data
Relational Databases
Easy Way
Hard Way
For Humans
To Understand
No Explicit
Semantics
We Can Read,
Write, and Edit
For Computers
To Understand
Formal
Data Model
(Semantics)
We Can Query (SQL)
and Analyse (DM, OLAP)
2
Hard Way Back to Structured Data World
Table Detection*
Table Recognition*
Table Analysis*
Table Interpretation*
ASCII-text
Untagged PDF Documents
Image
Documnets
Spreadsheets
Web Pages
Word Documents
OCR
Databases
Cannonical Forms
XML
ETL
* Hurst M. Layout and language: Challenges for table understanding on the web //
Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
Our purpose
Globally
to automate unstructured tabular data integration
Databases
Arbitrary Tables
in Spreadsheets
Currently
to automate table analysis and interpretation
Tables in Cannonical Form
4
Ok, We Have Initially an Arbitrary Tagged Table
We know
• structure (rows, columns, cells)
• style settings (fonts, colors, alignments, etc.)
• textual content
5
All We Need Is To Recover Semantics
Relationships like
entry-label, label-label, label-category*
* Our terminology is inspired by
the X. Wang’s abstract table model
[Wang X. Tabular Abstraction, Editing,
and Formatting, PhD Thesis. 1996]
6
When We Know Semantics We Can Generate a Canonical Table
It can be loaded into a database by ETL tools
7
Challenges on the Hard Way Back
• Too many layouts to create a table
• Anyone can invent new one
• Messy data
• No guarantees your tabular data are clear and standardized
• Natural Language
• Table understanding needs using knowleadge
8
Our Idea
When
• A table creator (e.g. a company, a government agency, ad-hoc software)
use a set of rules for table generation
• Tables have similar structure, style, and content
within a set of generating rules
Then
• We can define a set of rules for table analysis and interpretation
• We can use a rule engine to execute these rules
9
Table Analysis and Interpretation Rules
• Rules can be expressed in
• Drools Rule Language* (DRL)
General-purpose language for expressing production rules in Drools* rule engine
• Cells Rule Language (CRL)
Our domain-specific language for expressing table analysis and interpretation rules
• Rules can be executed with Drools* rule engine
*http://drools.org
10
CRL Rules
Rules map known table data to unknown ones
rule
when
Left hand side defines conditions using available facts
(cells, categories)
then
Right hand side defines actions to recover unknown semantics
(entries, labels, categories, entry-label, label-label, label-category)
end
11
CRL: Left Hand Side
factType $variable : Java boolean expressions
cell $cell : constraints
entry $entry : constraints
label $label : constraints
category $category : constraints
12
CRL: Right Hand Side
Merged Cells Splitted Cells
Cell splitting
To split n-tiles cell into n cells
split $cell
Cell merging
To merge two cells into one
merge $cell1 -> $cell2
13
CRL: Right Hand Side
Cell marking
set mark @mark -> $cell
where @mark is a word with @ starting character
Using marks in conditions
cell $cell : mark == @mark, constraints
Short form
cell@mark $cell : constraints
14
CRL: Right Hand Side
Entry creating
Using a cell value
new entry $cell
Using a specified value
new entry value -> $cell
Label creating
Using a cell value
new label $cell
Using a specified value
new label value -> $cell
15
CRL: Right Hand Side
Label categorizing
To associate a label with a category
set category $category -> $label
Trying to find or create a category with a specified name
set category category_name -> $label
16
CRL: Right Hand Side
Label associating
set parent label $label1 -> $label2
• Labels can be organized in a tree
• We can build hierarchical categories
• We can build compound label values like label1|label2|…|labelN
17
CRL: Right Hand Side
Label grouping
group $label1 -> $label2
• A label group constitutes an anonymous category
• We can divide labels into categories without knowing categories
• We can entirely categorize a label group
18
CRL: Right Hand Side
Entry associating
To associate an entry with a label
add label $label -> $entry
Trying to find or create a label in the category with specified value
add label label_value from $category -> $entry
Trying to find or create a category with specified name
add label label_value from category_name -> $entry
19
Cannonical Form Generation
<entries>={1,2,3,4,5,6,7,8}
<labels>={a1,a11,a12,a2,a21,a22,b1,b2}
<categories>={A,B}
<entry-label pairs>={(1,a11),(1,b1),(2,a12),
(2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11),
(5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22),
(8,b2)}
<label-label pairs>={(a11,a1),(a12,a1),
(a21,a2),(a22,a2)}
<label-category pairs>={(a1,A),(a11,A),
(a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)}
DATA A B
1 a1 | a11 b1
2 a1 | a12 b1
3 a2 | a21 b1
4 a2 | a22 b1
5 a1 | a11 b2
6 a1 | a12 b2
7 a2 | a21 b2
8 a2 | a22 b2
a11 a12 a21 a22
b1 1 2 3 4
b2 5 6 7 8
A
B
a1 a2
20
Applying CRL: Critical Cells*
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
* Nagy G. Learning the Characteristics of Critical
Cells from Web Tables // In Proc. of the 21st Int.
Conf. on Pattern Recognition, Tsukuba, Japan,
IEEE Comp. Soc., 2012, pp. 1554-1557
when
cell $cc : cl==1, rt==1, blank
cell $ec : cl>$cc.cr, rt>$cc.rb
then
new entry $ec
-> <entries> = {1,2,3,4,5,6,7}
21
when
cell $cc : cl == 1, rt == 1, blank
cell $clc : cl > $cc.cr, rb <= $cc.rb
then
set mark @ColLabel -> $clc
new label $clc
when
cell@ColLabel $c1
cell@ColLabel $c2 : rt == $c1.rt
then
group $c1.label -> $c2.label
Applying CRL: Label Groups
c d c d e
j 2 2 2 3
k
i l 6 7
h
1
4
5
a b
f g
-> <labels>={a,b,c,d,e,f,g,...}
-> <groups>={{a,b},{c,d,e},
{f,g},...}
22
Applying CRL: Row Label Hierarchies
when
cell $c1 : cl==1, $l1 : label
cell $c2 : cl==1, rt>$c1.rt,
indent==$c1.indent+2, $l2 : label
no cells : cl==1, rt>$c1.rt,
rt<$c2.rt, indent==$c1.indent
then
set parent label $c1.label -> $c2.label
-> <label-label pairs> =
{(c1,c),(c11,c1),(c12,c1),(c2,c),
(c21,c2),(d1,d),(d11,d1)}
23
Applying CRL: YAML* Specified categories
Category YAML specification
# category YEAR
name: Year
description: years from 1982 to 2015
constraints:
-"198[2-9]"
-"200[1-9]"
-"201[0-5]"
when
category $c : name == "Year"
label $l : $c.canHaveLabel(value)
then
set category $c -> $l
Category YAML specification
# category COUNTRY_CODE
name: CountryCode
description: ISO 3166 2-letter country codes
labels:
-AD
-AE
-...
-ZW
when
category $c : name == "CountryCode"
label $l : $c.hasLabel(value)
then
set category $c -> $l
*http://yaml.org
24
Applying CRL: Category Names
when
cell $cc : cl == 1, rt == 1
cell $c : mark == "@ColLabel"
then
set category token($cc, 0) -> $c.label
A
B
a1 a2 a3
b1 1 2 3
b2 4 5 6
-> <categories> = {A,...}
-> <labels> = {a1,a2,a3,...}
-> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...}
25
Applying CRL: Multi-Valued Cells
α β
阿爾法 公測
γ 1 2
伽馬 一 二
δ 3 4
三角洲 三 四
C1 C2 C3
a = 1 b = 2 c = 3
d = 4 e = 5 f = 6
g = 7 h = 8 i = 9
Bilingual Tables Key=Value Cells
when
cell $c : cl==1 || rt==1, !blank
then
new label token($c, 0) -> $c
new label token($c, 1) -> $c
when
cell $c : rt>1
then
new label left($c, '=') -> $c
new entry right($c, '=') -> $c
26
Applying CRL: Footnotes
when
cell $footer : onLastRow, $notes : text
entry $e : cell.text matches ".+*+",
$ref : extract(cell.text, "*+")
then
add label between($notes, $ref, 'n')
from "footnotes" -> $e
c d c d
e 1* 2** 3 4
f 5 6 7 8
g 9 10 11 12
a b
* x
** y
-> <labels>={x,y,...}
-> <categories>={"footnotes",...}
-> <entry-label pairs>={(1,x),(2,y),...}
-> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...}
27
Applying CRL: Colored Tables
when
cell $lc : style.bgColor == "#4f81bd"
cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr
no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl
then
add label $lc.label -> $ec.entry
1l
l2 l3 l4 l2 l3 l2
l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9
l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9
c1 c2
l1
c1 c2
l1
c1 c2
28
Prototype of Spreadsheet Data Extraction
and Transformatiom System
29
Experimental Evaluation
Our purpose is evaluation of recovering entries, labels,
entry-label and label-label relationships
Dataset
• We use the TANGO dataset (http://tango.byu.edu/data)
which
• is a part of the TANGO (Table ANalysis for Generating Ontologies) project
(http://tango.byu.edu)
• is intended for testing table interpretation methods
• has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009
30
Experimental Evaluation
Multi-row
hierarchical layout
Multi-column
plain layout
One-column
hierarchical
layout
Multi-column &
multi-row layout
One-column
plain layout
Category name cells
Row label cells
Column label cells
Entry cells
Table regions
One-column &
one-row layout
Multi-column &
one-row layout
One-row plain layout
Multi-row
plain layout
47,5%
47%
5,5% 100%
94,5% 5,5% 65,5%
26%
8,5%
31
We develop two sets
of CRL rules to define
two table types
• TANGO-200
all tables
• TANGO-SUB
without tables having
hierarchical layout in
the leftmost column
Layouts of TANGO Tables
Experimental Evaluation
Measures
• Recall
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are implicitly contained in its source form are explicitly included in its canonical form
• Presision
• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are explicitly included in its canonical form are implicitly contained in its source form
Process
• Two experts independently compare sources and generated automatically canonical forms of tables
• They referee that each table is processed successfully or not in terms of recall and precision
• When they make opposite decisions on a table, a final decision is made by third expert
32
Experimental Evaluation
Results
Rule Set / Table Type TANGO-200 TANGO-SUB
Tables 200 105
Cells 22757 10893
Rules 16 13
Recall 87% 95%
Precision 89% 95%
For TANGO-200
• 33 tables are processed with errors
• 85% of errors are born in the leftmost column with one-column hierarchical layout
• Two main causes:
1) ambiguity among style characteristics
2) hierarchical relationships expressed by natural language only
33
Comparison with others
Methods and Tools for Table Analysis and Interpretation
1-5 Fixed Types of Tables Programmable Table Types
Knowledge-based
methods
Douglas, 1995
Tijerino, 2005
Embley, 2005
WangJ, 2012
• Domain ontologies
• Taxonomies like
ProBase, FreeBase
Domain-independent methods
Gatterbauer, 2007
Pivk, 2005, 2006, 2007
Kim, 2008
Chen&Cafarella, 2013, 2014
Embley, 2014
Nagy, 2014
• Spatial, style, and textual data
• Several typical table types
We are here!
2014, 2015
• Rule language (CRL, DRL)
• Relative cell addressing
• Fixed target schema
• Spatial, style,
and textual data
Hung, 2011
• Spreadsheet-like formula
mapping language (TranSheet)
• Absolute cell addressing
• Programmable target schema
• Spatial and textual data
34
Conclusions
• Our methodology is mainly oriented on unstructured tabular data integration
• We expect it to be useful in cases when data from a large number of tables
appertaining to a few table types are required for populating a database
• One set of rules can be suitable for processing a wide range of arbitrary tables
with high accuracy
• Experiment demonstrates that narrowing of a table type can cause simplifying of
rules and increase of recall and precision in table canonicalization
35
Further Work
• Table Layouts
to develop techniques for widely used table features,
e.g. for recovering a row label hierarchy in the leftmost column
• Messy Tabular Data
to incorporate data cleansing techniques into table understanding
• Natural Language
to add knowledge, global taxonomies (e.g. FreeBase, DBpedia)
and domain ontologies
36
Supplementary Materials
CRL language specification
Examples of CRL rules
All details of our experiment
http://cells.icc.ru/pub/crl
Source code of our prototype
licensed under Apache License 2.0
https://github.com/shigarov/cells-ssdc
37
Thanks!
This presentation is available on SlideShare.net
http://www.slideshare.net/shig
Alexey Shigarov
shigarov@icc.ru
http://cells.icc.ru
38

More Related Content

ODP
Tips for using Firebird system tables
DOCX
Farheen abdul hameed ip project (MY SQL);
PPT
Sql database object
PDF
SImple SQL
PDF
Sql Basics | Edureka
PPT
PDF
SQL for Data Science Tutorial | Data Science Tutorial | Edureka
DOCX
Dbms record
Tips for using Firebird system tables
Farheen abdul hameed ip project (MY SQL);
Sql database object
SImple SQL
Sql Basics | Edureka
SQL for Data Science Tutorial | Data Science Tutorial | Edureka
Dbms record

What's hot (15)

PDF
Lession 4 the tables of a database
PPTX
PDF
UKOUG Tech14 - Getting Started With JSON in the Database
PDF
JSON Data Parsing in Snowflake (By Faysal Shaarani)
PDF
Rdbms day3
PPTX
DDL DATA DEFINATION LANGUAGE
DOCX
SQL Tutorial for BCA-2
PDF
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
PPTX
Starting with JSON Path Expressions in Oracle 12.1.0.2
PDF
[Www.pkbulk.blogspot.com]dbms05
PPTX
PDF
Sql for dbaspresentation
PDF
PostgreSQL Tutorial for Beginners | Edureka
PDF
BCS4L1-Database Management lab.pdf
PDF
Native XML processing in C++ (BoostCon'11)
Lession 4 the tables of a database
UKOUG Tech14 - Getting Started With JSON in the Database
JSON Data Parsing in Snowflake (By Faysal Shaarani)
Rdbms day3
DDL DATA DEFINATION LANGUAGE
SQL Tutorial for BCA-2
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Starting with JSON Path Expressions in Oracle 12.1.0.2
[Www.pkbulk.blogspot.com]dbms05
Sql for dbaspresentation
PostgreSQL Tutorial for Beginners | Edureka
BCS4L1-Database Management lab.pdf
Native XML processing in C++ (BoostCon'11)
Ad

Viewers also liked (20)

PPT
Tutorials--Logarithmic Functions in Tabular and Graph Form
PPTX
Approaches to Develop Curriculum for Children Visual Impairment
PPT
Kxu stat-anderson-ch02
PPT
PDF
Tabular Data on the Web
PPTX
V.i. ppt copy
PPTX
Visual impairment
PPTX
Visual Impairment Information and Teaching Strategies
PPTX
Ses 4 tabulation
PPT
Visual Impairment
PDF
Case Study: Advanced analytics in healthcare using unstructured data
PPTX
visual impairment
PPT
visual impairment
PDF
Getting Started with Unstructured Data
PPTX
Visual Impairments
PPTX
Ncf 2005
PPTX
Frequency Distributions and Graphs
PPT
Analysis of ‘Unstructured’ Data
PPTX
Policies and Guidelines of Special Education in the Philippines
PDF
Drive Insight From Unstructured Data With Endeca
Tutorials--Logarithmic Functions in Tabular and Graph Form
Approaches to Develop Curriculum for Children Visual Impairment
Kxu stat-anderson-ch02
Tabular Data on the Web
V.i. ppt copy
Visual impairment
Visual Impairment Information and Teaching Strategies
Ses 4 tabulation
Visual Impairment
Case Study: Advanced analytics in healthcare using unstructured data
visual impairment
visual impairment
Getting Started with Unstructured Data
Visual Impairments
Ncf 2005
Frequency Distributions and Graphs
Analysis of ‘Unstructured’ Data
Policies and Guidelines of Special Education in the Philippines
Drive Insight From Unstructured Data With Endeca
Ad

Similar to CRL: A Rule Language for Table Analysis and Interpretation (20)

PDF
From Unstructured to Structured Tabular Data Using a Rule Engine
PDF
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
PPT
Data Structures UNIT II Jntuh syllabus.ppt
PDF
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
PPT
Hashing In Data Structure Download PPT i
PDF
Data mining knowledge representation Notes
PPT
Expert Systems & Prolog
PPTX
Error Tolerant Record Matching PVERConf_May2011
PPT
Whats A Data Warehouse
PDF
Fosdem 2013 petra selmer flexible querying of graph data
PDF
International Conference on Knowledge Discovery and Information Retrieval 2009
PDF
Ieml semantic topology
PDF
064.pdf
PPTX
session 15 hashing.pptx
PPTX
Knowledge representation and reasoning
PDF
Translating SQL to Spreadsheet: A Survey
PPT
Learning for semantic parsing using statistical syntactic parsing techniques
PPTX
Topical_Facets
PDF
Crash-course in Natural Language Processing
PDF
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
From Unstructured to Structured Tabular Data Using a Rule Engine
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
Data Structures UNIT II Jntuh syllabus.ppt
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Hashing In Data Structure Download PPT i
Data mining knowledge representation Notes
Expert Systems & Prolog
Error Tolerant Record Matching PVERConf_May2011
Whats A Data Warehouse
Fosdem 2013 petra selmer flexible querying of graph data
International Conference on Knowledge Discovery and Information Retrieval 2009
Ieml semantic topology
064.pdf
session 15 hashing.pptx
Knowledge representation and reasoning
Translating SQL to Spreadsheet: A Survey
Learning for semantic parsing using statistical syntactic parsing techniques
Topical_Facets
Crash-course in Natural Language Processing
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION

Recently uploaded (20)

PDF
. Radiology Case Scenariosssssssssssssss
PDF
An interstellar mission to test astrophysical black holes
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
famous lake in india and its disturibution and importance
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
diccionario toefl examen de ingles para principiante
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
The scientific heritage No 166 (166) (2025)
. Radiology Case Scenariosssssssssssssss
An interstellar mission to test astrophysical black holes
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
microscope-Lecturecjchchchchcuvuvhc.pptx
Microbiology with diagram medical studies .pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
famous lake in india and its disturibution and importance
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
2. Earth - The Living Planet Module 2ELS
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
diccionario toefl examen de ingles para principiante
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Phytochemical Investigation of Miliusa longipes.pdf
neck nodes and dissection types and lymph nodes levels
HPLC-PPT.docx high performance liquid chromatography
Biophysics 2.pdffffffffffffffffffffffffff
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
The scientific heritage No 166 (166) (2025)

CRL: A Rule Language for Table Analysis and Interpretation

  • 1. CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured Tabular Data Integration Alexey Shigarov, shigarov@icc.ru Matrosov Institute for System Dynamics and Control Theory of SB RAS 17th International Conference on Data Analytics and Management in Data Intensive Domains Obninsk, Russia October 13-16, 2015 * This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042) and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
  • 2. Unstructured vs Structured Unstructured Tabular Data Arbitrary Tables in ASCII-text, Spreadsheets, PDF Documents, Web-Pages Structured Data Relational Databases Easy Way Hard Way For Humans To Understand No Explicit Semantics We Can Read, Write, and Edit For Computers To Understand Formal Data Model (Semantics) We Can Query (SQL) and Analyse (DM, OLAP) 2
  • 3. Hard Way Back to Structured Data World Table Detection* Table Recognition* Table Analysis* Table Interpretation* ASCII-text Untagged PDF Documents Image Documnets Spreadsheets Web Pages Word Documents OCR Databases Cannonical Forms XML ETL * Hurst M. Layout and language: Challenges for table understanding on the web // Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
  • 4. Our purpose Globally to automate unstructured tabular data integration Databases Arbitrary Tables in Spreadsheets Currently to automate table analysis and interpretation Tables in Cannonical Form 4
  • 5. Ok, We Have Initially an Arbitrary Tagged Table We know • structure (rows, columns, cells) • style settings (fonts, colors, alignments, etc.) • textual content 5
  • 6. All We Need Is To Recover Semantics Relationships like entry-label, label-label, label-category* * Our terminology is inspired by the X. Wang’s abstract table model [Wang X. Tabular Abstraction, Editing, and Formatting, PhD Thesis. 1996] 6
  • 7. When We Know Semantics We Can Generate a Canonical Table It can be loaded into a database by ETL tools 7
  • 8. Challenges on the Hard Way Back • Too many layouts to create a table • Anyone can invent new one • Messy data • No guarantees your tabular data are clear and standardized • Natural Language • Table understanding needs using knowleadge 8
  • 9. Our Idea When • A table creator (e.g. a company, a government agency, ad-hoc software) use a set of rules for table generation • Tables have similar structure, style, and content within a set of generating rules Then • We can define a set of rules for table analysis and interpretation • We can use a rule engine to execute these rules 9
  • 10. Table Analysis and Interpretation Rules • Rules can be expressed in • Drools Rule Language* (DRL) General-purpose language for expressing production rules in Drools* rule engine • Cells Rule Language (CRL) Our domain-specific language for expressing table analysis and interpretation rules • Rules can be executed with Drools* rule engine *http://drools.org 10
  • 11. CRL Rules Rules map known table data to unknown ones rule when Left hand side defines conditions using available facts (cells, categories) then Right hand side defines actions to recover unknown semantics (entries, labels, categories, entry-label, label-label, label-category) end 11
  • 12. CRL: Left Hand Side factType $variable : Java boolean expressions cell $cell : constraints entry $entry : constraints label $label : constraints category $category : constraints 12
  • 13. CRL: Right Hand Side Merged Cells Splitted Cells Cell splitting To split n-tiles cell into n cells split $cell Cell merging To merge two cells into one merge $cell1 -> $cell2 13
  • 14. CRL: Right Hand Side Cell marking set mark @mark -> $cell where @mark is a word with @ starting character Using marks in conditions cell $cell : mark == @mark, constraints Short form cell@mark $cell : constraints 14
  • 15. CRL: Right Hand Side Entry creating Using a cell value new entry $cell Using a specified value new entry value -> $cell Label creating Using a cell value new label $cell Using a specified value new label value -> $cell 15
  • 16. CRL: Right Hand Side Label categorizing To associate a label with a category set category $category -> $label Trying to find or create a category with a specified name set category category_name -> $label 16
  • 17. CRL: Right Hand Side Label associating set parent label $label1 -> $label2 • Labels can be organized in a tree • We can build hierarchical categories • We can build compound label values like label1|label2|…|labelN 17
  • 18. CRL: Right Hand Side Label grouping group $label1 -> $label2 • A label group constitutes an anonymous category • We can divide labels into categories without knowing categories • We can entirely categorize a label group 18
  • 19. CRL: Right Hand Side Entry associating To associate an entry with a label add label $label -> $entry Trying to find or create a label in the category with specified value add label label_value from $category -> $entry Trying to find or create a category with specified name add label label_value from category_name -> $entry 19
  • 20. Cannonical Form Generation <entries>={1,2,3,4,5,6,7,8} <labels>={a1,a11,a12,a2,a21,a22,b1,b2} <categories>={A,B} <entry-label pairs>={(1,a11),(1,b1),(2,a12), (2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11), (5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22), (8,b2)} <label-label pairs>={(a11,a1),(a12,a1), (a21,a2),(a22,a2)} <label-category pairs>={(a1,A),(a11,A), (a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)} DATA A B 1 a1 | a11 b1 2 a1 | a12 b1 3 a2 | a21 b1 4 a2 | a22 b1 5 a1 | a11 b2 6 a1 | a12 b2 7 a2 | a21 b2 8 a2 | a22 b2 a11 a12 a21 a22 b1 1 2 3 4 b2 5 6 7 8 A B a1 a2 20
  • 21. Applying CRL: Critical Cells* c d c d e j 2 2 2 3 k i l 6 7 h 1 4 5 a b f g * Nagy G. Learning the Characteristics of Critical Cells from Web Tables // In Proc. of the 21st Int. Conf. on Pattern Recognition, Tsukuba, Japan, IEEE Comp. Soc., 2012, pp. 1554-1557 when cell $cc : cl==1, rt==1, blank cell $ec : cl>$cc.cr, rt>$cc.rb then new entry $ec -> <entries> = {1,2,3,4,5,6,7} 21
  • 22. when cell $cc : cl == 1, rt == 1, blank cell $clc : cl > $cc.cr, rb <= $cc.rb then set mark @ColLabel -> $clc new label $clc when cell@ColLabel $c1 cell@ColLabel $c2 : rt == $c1.rt then group $c1.label -> $c2.label Applying CRL: Label Groups c d c d e j 2 2 2 3 k i l 6 7 h 1 4 5 a b f g -> <labels>={a,b,c,d,e,f,g,...} -> <groups>={{a,b},{c,d,e}, {f,g},...} 22
  • 23. Applying CRL: Row Label Hierarchies when cell $c1 : cl==1, $l1 : label cell $c2 : cl==1, rt>$c1.rt, indent==$c1.indent+2, $l2 : label no cells : cl==1, rt>$c1.rt, rt<$c2.rt, indent==$c1.indent then set parent label $c1.label -> $c2.label -> <label-label pairs> = {(c1,c),(c11,c1),(c12,c1),(c2,c), (c21,c2),(d1,d),(d11,d1)} 23
  • 24. Applying CRL: YAML* Specified categories Category YAML specification # category YEAR name: Year description: years from 1982 to 2015 constraints: -"198[2-9]" -"200[1-9]" -"201[0-5]" when category $c : name == "Year" label $l : $c.canHaveLabel(value) then set category $c -> $l Category YAML specification # category COUNTRY_CODE name: CountryCode description: ISO 3166 2-letter country codes labels: -AD -AE -... -ZW when category $c : name == "CountryCode" label $l : $c.hasLabel(value) then set category $c -> $l *http://yaml.org 24
  • 25. Applying CRL: Category Names when cell $cc : cl == 1, rt == 1 cell $c : mark == "@ColLabel" then set category token($cc, 0) -> $c.label A B a1 a2 a3 b1 1 2 3 b2 4 5 6 -> <categories> = {A,...} -> <labels> = {a1,a2,a3,...} -> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...} 25
  • 26. Applying CRL: Multi-Valued Cells α β 阿爾法 公測 γ 1 2 伽馬 一 二 δ 3 4 三角洲 三 四 C1 C2 C3 a = 1 b = 2 c = 3 d = 4 e = 5 f = 6 g = 7 h = 8 i = 9 Bilingual Tables Key=Value Cells when cell $c : cl==1 || rt==1, !blank then new label token($c, 0) -> $c new label token($c, 1) -> $c when cell $c : rt>1 then new label left($c, '=') -> $c new entry right($c, '=') -> $c 26
  • 27. Applying CRL: Footnotes when cell $footer : onLastRow, $notes : text entry $e : cell.text matches ".+*+", $ref : extract(cell.text, "*+") then add label between($notes, $ref, 'n') from "footnotes" -> $e c d c d e 1* 2** 3 4 f 5 6 7 8 g 9 10 11 12 a b * x ** y -> <labels>={x,y,...} -> <categories>={"footnotes",...} -> <entry-label pairs>={(1,x),(2,y),...} -> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...} 27
  • 28. Applying CRL: Colored Tables when cell $lc : style.bgColor == "#4f81bd" cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl then add label $lc.label -> $ec.entry 1l l2 l3 l4 l2 l3 l2 l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9 l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9 c1 c2 l1 c1 c2 l1 c1 c2 28
  • 29. Prototype of Spreadsheet Data Extraction and Transformatiom System 29
  • 30. Experimental Evaluation Our purpose is evaluation of recovering entries, labels, entry-label and label-label relationships Dataset • We use the TANGO dataset (http://tango.byu.edu/data) which • is a part of the TANGO (Table ANalysis for Generating Ontologies) project (http://tango.byu.edu) • is intended for testing table interpretation methods • has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009 30
  • 31. Experimental Evaluation Multi-row hierarchical layout Multi-column plain layout One-column hierarchical layout Multi-column & multi-row layout One-column plain layout Category name cells Row label cells Column label cells Entry cells Table regions One-column & one-row layout Multi-column & one-row layout One-row plain layout Multi-row plain layout 47,5% 47% 5,5% 100% 94,5% 5,5% 65,5% 26% 8,5% 31 We develop two sets of CRL rules to define two table types • TANGO-200 all tables • TANGO-SUB without tables having hierarchical layout in the leftmost column Layouts of TANGO Tables
  • 32. Experimental Evaluation Measures • Recall • a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which are implicitly contained in its source form are explicitly included in its canonical form • Presision • a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which are explicitly included in its canonical form are implicitly contained in its source form Process • Two experts independently compare sources and generated automatically canonical forms of tables • They referee that each table is processed successfully or not in terms of recall and precision • When they make opposite decisions on a table, a final decision is made by third expert 32
  • 33. Experimental Evaluation Results Rule Set / Table Type TANGO-200 TANGO-SUB Tables 200 105 Cells 22757 10893 Rules 16 13 Recall 87% 95% Precision 89% 95% For TANGO-200 • 33 tables are processed with errors • 85% of errors are born in the leftmost column with one-column hierarchical layout • Two main causes: 1) ambiguity among style characteristics 2) hierarchical relationships expressed by natural language only 33
  • 34. Comparison with others Methods and Tools for Table Analysis and Interpretation 1-5 Fixed Types of Tables Programmable Table Types Knowledge-based methods Douglas, 1995 Tijerino, 2005 Embley, 2005 WangJ, 2012 • Domain ontologies • Taxonomies like ProBase, FreeBase Domain-independent methods Gatterbauer, 2007 Pivk, 2005, 2006, 2007 Kim, 2008 Chen&Cafarella, 2013, 2014 Embley, 2014 Nagy, 2014 • Spatial, style, and textual data • Several typical table types We are here! 2014, 2015 • Rule language (CRL, DRL) • Relative cell addressing • Fixed target schema • Spatial, style, and textual data Hung, 2011 • Spreadsheet-like formula mapping language (TranSheet) • Absolute cell addressing • Programmable target schema • Spatial and textual data 34
  • 35. Conclusions • Our methodology is mainly oriented on unstructured tabular data integration • We expect it to be useful in cases when data from a large number of tables appertaining to a few table types are required for populating a database • One set of rules can be suitable for processing a wide range of arbitrary tables with high accuracy • Experiment demonstrates that narrowing of a table type can cause simplifying of rules and increase of recall and precision in table canonicalization 35
  • 36. Further Work • Table Layouts to develop techniques for widely used table features, e.g. for recovering a row label hierarchy in the leftmost column • Messy Tabular Data to incorporate data cleansing techniques into table understanding • Natural Language to add knowledge, global taxonomies (e.g. FreeBase, DBpedia) and domain ontologies 36
  • 37. Supplementary Materials CRL language specification Examples of CRL rules All details of our experiment http://cells.icc.ru/pub/crl Source code of our prototype licensed under Apache License 2.0 https://github.com/shigarov/cells-ssdc 37
  • 38. Thanks! This presentation is available on SlideShare.net http://www.slideshare.net/shig Alexey Shigarov shigarov@icc.ru http://cells.icc.ru 38