Standardizing for Open Data

(1)
Standardizing for Open Data
Ivan
Herman,
W3C

Open
Data
Week

Marseille,
France,
June
26
2013

Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/

(2)
Data
is
everywhere
on
the
Web!

l  Public,
private,
behind
enterprise
ﬁrewalls

l  Ranges
from
informal
to
highly
curated

l  Ranges
from
machine
readable
to
human
readable

l  HTML
tables,
twitter
feeds,
local
vocabularies,

spreadsheets,
…

l  Expressed
in
diverse
models

l  tree,
graph,
table,
…

l  Serialized
in
many
ways

l  XML,
CSV,
RDF,
PDF,
HTML
Tables,
microdata,…

(8)
W3C’s
standardization
focus
was,

traditionally,
on
Web
scale

integration
of
data

l Some
basic
principles:

l  use
of
URIs
everywhere
(to
uniquely
identify
things)

l  relate
resources
among
one
another
(to
connect

things
on
the
Web)

l  discover
new
relationships
through
inferences

l This
is
what
the
Semantic
Web
technologies
are

all
about

(9)
We
have
a
number
of
standards

RDF
1.1

SPARQL
1.1

URI

JSON-‐LD
Turtle
RDFa
RDF/XML

RDF:
data
model,
links,
basic
assertions;

diﬀerent
serializations

SPARQL:
querying
data

A
fairly
stable
set
of
technologies
by
now!

(10)
We
have
a
number
of
standards

RDB2RDF
RDF
1.1

RDFS
1.1
SPARQL
1.1

OWL
2

URI

JSON-‐LD
Turtle
RDFa
RDF/XML

RDF:
data
model,
links,
basic
assertions;

diﬀerent
serializations

SPARQL:
querying
data

RDFS:

simple
vocabularies

OWL:
complex
vocabularies,
ontologies

RDB2RDF:
databases
to
RDF

A
fairly
stable
set
of
technologies
by
now!

(11)
We
have
Linked
Data
principles

(12)
Integration
is
done
in
diﬀerent
ways

l Very
roughly:

l  data
is
accessed
directly
as
RDF
and
turned
into

something
useful

l  relies
on
data
being
“preprocessed”
and
published
as
RDF

l  data
is
collected
from
diﬀerent
sources,
integrated

internally

l  using,
say,
a
triple
store

(15)
However…

l There
is
a
price
to
pay:
a
relatively
heavy

ecosystem

l  many
developers
shy
away
from
using
RDF
and

related
tools

l Not
all
applications
need
this!

l  data
may
be
used
directly,
no
need
for
integration

concerns

l  the
emphasis
may
be
on
easy
production
and

manipulation
of
data
with
simple
tools

(16)
Typical
situation
on
the
Web

l Data
published
in
CSV,
JSON,
XML

l An
application
uses
only
1-‐2
datasets,

integration
done
by
direct
programming
is

straightforward

l  e.g.,
in
a
Web
Application

l Data
is
often
very
large,
direct
manipulation
is

more
eﬃcient

(17)
Non-‐RDF
Data

l In
some
setting
that
data
can
be
converted
into

RDF

l But,
in
many
cases,
it
is
not
done

l  e.g.,
CSV
data
is
way
too
big

l  RDF
tooling
may
not
be
adequate
for
the
task
at

hand

l  integration
is
not
a
major
issue

(19)
What
that
application
does…

l Gets
the
data
published
by
NHS

l Processes
the
data
(e.g.,
through
Hadoop)

l Integrates
the
result
of
the
analysis
with

geographical
data

Ie:
the
raw
data
is
used
without
integration

(20)
The
reality
of
data
on
the
Web…

l It
is
still
a
fairly
messy
space
out
there
L

l  many
different
formats
are
used

l  data
is
difficult
to
find

l  published
data
are
messy,
erroneous,

l  tools
are
complex,
unfinished…

(21)
How
do
developers

perceive
this?

‘When
transportation
agencies
consider
data

integration,
one
pervasive
notion
is
that
the

analysis
of
existing
information
needs
and

infrastructure,
much
less
the
organization
of
data

into
viable
channels
for
integration,
requires
a

monumental
initial
commitment
of
resources

and
staﬀ.
Resource-‐scarce
agencies
identify
this

perceived
major
upfront
overhaul
as

"unachievable"
and
"disruptive.”’

-‐-‐
Data
Integration
Primer:
Challenges
to
Data
Integration,
US

Dept.
of
Transportation

(22)
One
may
look
at
the
problem

through
diﬀerent
goggles

l Two
alternatives
come
to
the
fore:

1.  provide
tools,
environments,
etc.,
to
help

outsiders
to
publish
Linked
Data
(in
RDF)

easily

l  a
typical
example
is
the
Datalift
project

2.  forget
about
RDF,
Linked
Data,
etc,
and

concentrate
on
the
raw
data
instead

(24)
But
religions
and

cultures
can

coexist…
J

(25)
Open
Data
on
the
Web
Workshop

l Had
a
successful
workshop
in
London,
in
April:

l  around
100
participants

l  coming
from
diﬀerent
horizons:
publishers
and
users

of

Linked
Data,
CSV,
PDF,
…

(26)
We
also
talked
to
our

“stakeholders”

l Member
organizations
and
companies

l Open
Data
Institute,
Open
Knowledge

Foundation,
Schema.org

l …

(27)
Some
takeaway

l The
Semantic
Web
community
needs
stability
of

the
technology

l  do
not
add
yet
another
technology
block
J

l  existing
technologies
should
be
maintained

(28)
Some
takeaway

l Look
at
the
more
general
space,
too

l  importance
of
metadata

l  deal
with
non-‐RDF
data
formats

l  best
practices
are
necessary
to
raise
the
quality
of

published
data

(29)
We
need
to
meet
app
developers

where
they
are!

(30)
Metadata
is
of
a
major

importance

l Metadata
describes
the
characteristics
of
the

dataset

l  structure,
datatypes
used

l  access
rights,
licenses

l  provenance,
authorship

l  etc.

l Vocabularies
are
also
key
for
Linked
Data

(31)
Vocabulary
Management
Action

l Standard
vocabularies
are
necessary
to
describe

data

l  there
are
already
some
initiatives:
W3C’s
data
cube,

data
catalog,
PROV,
schema.org,
DCMI,
…

l At
the
moment,
it
is
a
fairly
chaotic
world…

l  many,
possibly
overlapping
vocabularies

l  diﬃcult
to
locate
the
one
that
is
needed

l  vocabularies
may
not
be
properly
managed,

maintained,
versioned,
provided
persistence…

(32)
W3C’s
plan:

l Provide
a
space
whereby

l  communities
can
develop

l  host
vocabularies
at
W3C
if
requested

l  annotate
vocabularies
with
a
proper
set
of
metadata

terms

l  establish
a
vocabulary
directory

l The
exact
structure
is
still
being
discussed:

http://www.w3.org/2013/04/vocabs/

(34)
CSV
on
the
Web

l Planned
work
areas:

l  metadata
vocabulary
to
describe
CSV
data

l  structure,
reference
to
access
rights,
annotations,
etc.

l  methods
to
ﬁnd
the
metadata

l  part
of
an
HTTP
header,
special
rows
and
columns,

packaging
formats…

l  mapping
content
to
RDF,
JSON,
XML

l Possibly
at
a
later
phase:

l  API
standards
to
access
CSV
data

(36)
Open
Data
Best
Practices

l Document
best
practices
for
data
publishers

l  management
of
persistence,
versioning,
URI
design

l  use
of
core
vocabularies
(provenance,
access
control,

ownership,
annotations,…)

l  business
models

l Specialized
Metadata
vocabularies

l  quality
description
(quality
of
the
data,
update

frequencies,
correction
policies,
etc.)

l  description
of
data
access
API-‐s

l  …

(37)
Summary

l Data
on
the
Web
has
many
diﬀerent
facets

l We
have
concentrated
on
the
integration

aspects
in
the
past
years

l We
have
to
take
a
more
general
view,
look
at

other
types
of
data
published
on
the
Web

(38)
In
future…

l We
should
look
at
other
formats,
not
only
CSV

l  MARC,
GIS,
ABIF,…

l Better
outreach
to
data
publishing
communities

and
organizations

l  WF,
RDA,
ODI,
OKFN,
…

Standardizing for Open Data

More Related Content

What's hot (20)

Similar to Standardizing for Open Data (20)

More from Ivan Herman (20)

Recently uploaded (20)

Standardizing for Open Data