Scaling the (evolving) web data –at low cost-

Scaling the (evolving) web data
–at low cost-
Javier D. Fernández
QuWeDa 2017: Querying the Web of Data
Kosice, 29/05/2017

A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with jokes

About me:
 since 2015 @WU, Inst. for Information Business
Research interest: Semantic Web, Open Data, Big (Semantic) Data Management,
Databases, Data Compression, Privacy and Security
 https://www.wu.ac.at/en/infobiz/team/fernandez/
MadridValladolid Santiago Rome
3
Óscar CorchoPablo de la Fuente
Miguel A. Martínez-Prieto
Claudio Gutiérrez Maurizio Lenzerini
Vienna
Axel Polleres

A good recipe for a WS Keynote
Ingredients (for approx. 20 persons)
 A motivated speaker
 Some knowledge in the area
 An engaged audience
 Slides (number at your convenience)
Method
 Present yourself
 Set the context, give an overall picture of the area
 Touch some of the topics of the event
 Focus the discussion- Sell your work
 Devise future developments in the area
• Mix everything with humour

The Web of Data Eco System
 First, we better know what we can offer…
 What is the Semantic Web/Web of Data/Linked Data?
 Who are we? What have we done so far?
 What we haven‘t done so far?
6
Linked Data Semantic Web
Open Data
Big Data

(Big Semantic Data: Linked Data vs.
Big Data)
 Overlaps:
 LD as a whole is big (38B-150B triples)
 No rigid (e.g., relational) data model
 Big Data technologies (e.g., Hadoop) are used to handle LD
 LD can represent knowledge extracted from big unstructured
data (specially to deal with variety)
 Key Differences:
 Individual linked data sets are typically not "big" per se
(e.g., English DBpedia dump (zip) currently < 5 GB)
 LD is structured, single data model (RDF), "big data lakes" are
typically neither
 Big data based on distributed data infrastructures within an
organization (e.g., Hadoop clusters), LD creates a
decentralized, globally distributed data infrastructure

Let’s study the community…
Survey practitioner needs, technological challenges, and
open research questions on the use of Linked Data
 Austrian FFG ICT of the Future project (exploratory study)
 Consortium: IDC Austria, Technical University of Vienna,
University of Economy Vienna, Semantic Web Company
 Project ended in Dec 2016: https://www.linked-data.at/
Standards*Requirements Literature research*
* Special kudos to Sabrina Kirrane and Axel Polleres for the community analysis

Interviews
 23 interviews:
 Domains
 Consulting, Engineering, Environment, Finance and Insurance,
Government, Healthcare, ICT, IT, Media, Pharmaceutical,
Professional Services, Real Estate, Research, Startup, Tourism,
Transports & Logistics
 Roles
 Business Intelligence, CEO, Chief Engineer, Data and Systems
Architect, Data Scientist, Director Information Management,
Enterprise Architect, Founder, General Secretary, Governance, Risk
& Compliance Manager, Head of Communications and Media, Head
of Development, Head of HR, Head of R&D, Innovation Manager,
Information Architect, IT Project Manager, Management, Managing
director, Marketing Analyst, Principle System Analyst, Project
Coordinator, Researcher, Technical Specialist

Technologies in need…
Analytics
Computational
linguistics & NLP
Concept tagging
& annotation
Data integration
Data
management
Dynamic data /
streaming
Extraction, data
mining, text
mining, entity
extraction
Logic, formal
languages &
reasoning
Human-
Computer
Interaction &
visualization
Knowledge
representation
Machine learning
Ontology/thesaur
us/taxonomy
management
Quality &
Provenance
Recommendation
Robustness,
scalability,
optimization and
performance
Searching,
browsing &
exploration
Security and
privacy
System
engineering
We ended
with most
areas of
the SW

Standards Toolbox (incl. W3C member submissions)

Scaling the (evolving) web data –at low cost-

What can we offer?
Community Analysis
 Monitoring SW community major venues (2006-2015):
 ISWC (since 2006), ESWC (since 2006), SEMANTiCS (since
2007), JWS (since 2006), SWJ (since 2010)
 3 seminal papers:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Topic Categorisation
Interestingly, the
same “empty”
topics in
standards

Semantic Web/Linked Data over
time…
Subtopics:
Expressing Meaning
Knowledge Representation
Ontologies
Agents

Knowledge Representation
& Reasoning

time…
Early adopters:
MITRE
Chevron
British Telecom
Boeing
Ordnance Survey
Eli Lily
Pfizer
Agfa
Food and Drug Administration
National Institutes of Health
Software adopters/products:
Oracle
Adobe
Altova
OpenLink
TopQuadrant
Software AG
Aduna Software
Protége
SAPHIRE

LD Adopters - Companies
0
200
400
600
800
1000
1200
1400
1600
Google Oracle Yahoo SAP IEEE
Intelligent
Systems
Franz Bing Expert
System
IBM Research Poolparty
Occurrences
Companies
Conference Sponsors that appear in papers 2006-2015

To whom we can sell our technology

time…
The authors claim that "early research has
transitioned into these larger, more
applied systems, today’s Semantic Web
research is changing: It builds on the
earlier foundations but it has generated a
more diverse set of pursuits”.

Big Semantic Data and applied
systems

Other topics of the QuWeDa
workshop

Motivation
 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
 Lightweight Binary RDF (HDT)
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x.
 Complex queries (joins) on the same scale of current solutions (Virtuoso, RDF3x).
431 M.triples~
63 GB
DBpedia
NT + gzip
5 GB
HDT
6.6 GB
HDT + gzip
2.7 GB
rdfhdt.org

The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking

The real motivation
http://www.kunsan.af.mil/News/
Article/413995/serving-the-masses/
Oh man I’m hungry and
I don’ t even know if I
will like whatever you
are cooking
consume

Applications
 Compress and share ready-to-consume RDF datasets
 Transfer large data between servers
 Embedded Systems & Phones
 Fast –low cost- SPARQL Query Engine
 Via LDF
 HDT-Jena
 HDT-Cliopatra

But what about Web-scale queries
 E.g. retrieve all entities in LOD with the label “Tim
Berners-Lee“
 Options:
 Crawl and index LOD locally (-no-)
 Follow-your-nose (where should I start?)
 Federated querying (as good as the endpoints you query)
 Use LOD Laundromat as a “good approximation” (still
querying 650K datasets)
36
select distinct ?x {
?x rdfs:label "Tim Berners-Lee"
}

37
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
SPARQL
endpoint
(metadata)
LOD Laundromat

38
LOD-a-lot
- flashforward -

But one could be really hungry
39
https://hwy55burgers.wordpress.com/tag/food-challenge/
LOD-a-lot

40
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
LOD-a-lot
SPARQL
endpoint
(metadata)
LOD-a-lot
Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias
28B triples

LOD-a-lot (some numbers)
Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
LDF page resolution in milliseconds.
41
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)

42
LOD-a-lot
https://datahub.io/dataset/lod-a-lot

LOD-a-lot (some use cases)
 Query resolution at Web scale
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
43
subjects predicates objects

G3b
G1b
Linked Open Data
Cloud
Linked Closed Data
Cloud
dbpedia
G3a G4a
G1a G2a
G1c G2c
G2b
1) Linked Open/Close Data
“Deep Semantic Web”

 A) Exchange: Encryption + HDT (hdtcrypt)
48

49
 B) A secure LD Endpoint
ESWC’17, THU 16:30-17:00
Self-Enforcing Access Control for Encrypted RDF
Javier D. Fernández, Sabrina Kirrane, Axel Polleres and
Simon Steyskal

2) RDF evolution at Scale
ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
Number of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
DBpedia
BTC
Dyldo
Internet
of Things
Virtual/Augmented
Reality
versions?
LOD-a-lot

Managing the Evolution and
Preservation of the Data Web (FP7)
Preserving Linked Data (FP7)
last few years:
51
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
2) RDF evolution at Scale

Use mappings to update
infoboxes and track
pages that need
updating.
3) Ontology-based Data Management
Use case: Dbpedia & SPARQL Update to maintain Wikipedia?

Our approach to OBDM over curated sources
1. Ensure consistency in all cases, automatically resolve
updates on the best-effort basis.
2. Learn from existing data and from principled belief
revision semantics.
 E.g.: many football players with only one foaf:name in
English DBpedia have both name and full name Infobox
properties set.
3. Record, extract and apply best / typical practices.
name foaf:name
full_name
A minimal-change insert translation
would only update one infobox
property.
ESWC’17, TUE 12:00-12:30- Updating Wikipedia via Dbpedia Mappings and
SPARQL. Albin Ahmeti, Javier D Fernández, Axel Polleres and Vadim Savenkov
3) Ontology-based Data Management

Dept. of Information Systems & Operations
Institute for Information Business
Welthandelsplatz 1, 1020 Vienna, Austria
DR. Javier D. Fernández
T +43-1-313 36-5241
F +43-1-313 36-739
jfernand@wu.ac.at
www.ai.wu.ac.at
Thanks!
 Big (Semantic) Data
 Versions
 Evolving Data
 Encryption
 Compression
rdfhdt.org

Scaling the (evolving) web data –at low cost-

More Related Content

What's hot (16)

Similar to Scaling the (evolving) web data –at low cost- (20)

Recently uploaded (20)

Scaling the (evolving) web data –at low cost-

Editor's Notes