SlideShare a Scribd company logo
(1)
Standardizing for Open Data
Ivan	
  Herman,	
  W3C	
  
Open	
  Data	
  Week	
  
Marseille,	
  France,	
  June	
  26	
  2013	
  
Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/
(2)
Data	
  is	
  everywhere	
  on	
  the	
  Web!	
  
l  Public,	
  private,	
  behind	
  enterprise	
  firewalls	
  
l  Ranges	
  from	
  informal	
  to	
  highly	
  curated	
  
l  Ranges	
  from	
  machine	
  readable	
  to	
  human	
  readable	
  
l  HTML	
  tables,	
  twitter	
  feeds,	
  local	
  vocabularies,	
  
spreadsheets,	
  …	
  
l  Expressed	
  in	
  diverse	
  models	
  	
  
l  tree,	
  graph,	
  table,	
  …	
  
l  Serialized	
  in	
  many	
  ways	
  	
  
l  XML,	
  CSV,	
  RDF,	
  PDF,	
  HTML	
  Tables,	
  microdata,…	
  
(3)
(4)
(5)
(6)
(7)
(8)
W3C’s	
  standardization	
  focus	
  was,	
  
traditionally,	
  on	
  Web	
  scale	
  
integration	
  of	
  data	
  
l Some	
  basic	
  principles:	
  
l  use	
  of	
  URIs	
  everywhere	
  (to	
  uniquely	
  identify	
  things)	
  
l  relate	
  resources	
  among	
  one	
  another	
  (to	
  connect	
  
things	
  on	
  the	
  Web)	
  
l  discover	
  new	
  relationships	
  through	
  inferences	
  
l This	
  is	
  what	
  the	
  Semantic	
  Web	
  technologies	
  are	
  
all	
  about	
  
	
  
(9)
We	
  have	
  a	
  number	
  of	
  standards	
  
RDF	
  1.1	
  
SPARQL	
  1.1	
  
URI	
  
JSON-­‐LD	
   Turtle	
   RDFa	
   RDF/XML	
  
RDF:	
  data	
  model,	
  links,	
  basic	
  assertions;	
  
different	
  serializations	
  	
  
SPARQL:	
  querying	
  data	
  
A	
  fairly	
  stable	
  set	
  of	
  technologies	
  by	
  now!	
  
(10)
We	
  have	
  a	
  number	
  of	
  standards	
  
RDB2RDF	
   RDF	
  1.1	
  
RDFS	
  1.1	
  SPARQL	
  1.1	
  
OWL	
  2	
  
URI	
  
JSON-­‐LD	
   Turtle	
   RDFa	
   RDF/XML	
  
RDF:	
  data	
  model,	
  links,	
  basic	
  assertions;	
  
different	
  serializations	
  	
  
SPARQL:	
  querying	
  data	
  
RDFS:	
  	
  simple	
  vocabularies	
  
OWL:	
  complex	
  vocabularies,	
  ontologies	
  
RDB2RDF:	
  databases	
  to	
  RDF	
  
A	
  fairly	
  stable	
  set	
  of	
  technologies	
  by	
  now!	
  
(11)
We	
  have	
  Linked	
  Data	
  principles	
  
(12)
Integration	
  is	
  done	
  in	
  different	
  ways	
  
l Very	
  roughly:	
  
l  data	
  is	
  accessed	
  directly	
  as	
  RDF	
  and	
  turned	
  into	
  
something	
  useful	
  
l  relies	
  on	
  data	
  being	
  “preprocessed”	
  and	
  published	
  as	
  RDF	
  
l  data	
  is	
  collected	
  from	
  different	
  sources,	
  integrated	
  
internally	
  
l  using,	
  say,	
  a	
  triple	
  store	
  
(13)
Standardizing for Open Data
(15)
However…	
  
l There	
  is	
  a	
  price	
  to	
  pay:	
  a	
  relatively	
  heavy	
  
ecosystem	
  
l  many	
  developers	
  shy	
  away	
  from	
  using	
  RDF	
  and	
  
related	
  tools	
  
l Not	
  all	
  applications	
  need	
  this!	
  
l  data	
  may	
  be	
  used	
  directly,	
  no	
  need	
  for	
  integration	
  
concerns	
  
l  the	
  emphasis	
  may	
  be	
  on	
  easy	
  production	
  and	
  
manipulation	
  of	
  data	
  with	
  simple	
  tools	
  
(16)
Typical	
  situation	
  on	
  the	
  Web	
  
l Data	
  published	
  in	
  CSV,	
  JSON,	
  XML	
  
l An	
  application	
  uses	
  only	
  1-­‐2	
  datasets,	
  
integration	
  done	
  by	
  direct	
  programming	
  is	
  
straightforward	
  
l  e.g.,	
  in	
  a	
  Web	
  Application	
  
l Data	
  is	
  often	
  very	
  large,	
  direct	
  manipulation	
  is	
  
more	
  efficient	
  
(17)
Non-­‐RDF	
  Data	
  
l In	
  some	
  setting	
  that	
  data	
  can	
  be	
  converted	
  into	
  
RDF	
  
l But,	
  in	
  many	
  cases,	
  it	
  is	
  not	
  done	
  
l  e.g.,	
  CSV	
  data	
  is	
  way	
  too	
  big	
  
l  RDF	
  tooling	
  may	
  not	
  be	
  adequate	
  for	
  the	
  task	
  at	
  
hand	
  
l  integration	
  is	
  not	
  a	
  major	
  issue	
  
(18)
(19)
What	
  that	
  application	
  does… 	
  	
  
l Gets	
  the	
  data	
  published	
  by	
  NHS	
  
l Processes	
  the	
  data	
  (e.g.,	
  through	
  Hadoop)	
  
l Integrates	
  the	
  result	
  of	
  the	
  analysis	
  with	
  
geographical	
  data	
  
Ie:	
  the	
  raw	
  data	
  is	
  used	
  without	
  integration	
  
(20)
The	
  reality	
  of	
  data	
  on	
  the	
  Web…	
  
l It	
  is	
  still	
  a	
  fairly	
  messy	
  space	
  out	
  there	
  L	
  
l  many	
  different	
  formats	
  are	
  used	
  
l  data	
  is	
  difficult	
  to	
  find	
  
l  published	
  data	
  are	
  messy,	
  erroneous,	
  	
  
l  tools	
  are	
  complex,	
  unfinished…	
  	
  
(21)
How	
  do	
  developers	
  
perceive	
  this?	
  
‘When	
  transportation	
  agencies	
  consider	
  data	
  
integration,	
  one	
  pervasive	
  notion	
  is	
  that	
  the	
  
analysis	
  of	
  existing	
  information	
  needs	
  and	
  
infrastructure,	
  much	
  less	
  the	
  organization	
  of	
  data	
  
into	
  viable	
  channels	
  for	
  integration,	
  requires	
  a	
  
monumental	
  initial	
  commitment	
  of	
  resources	
  
and	
  staff.	
  Resource-­‐scarce	
  agencies	
  identify	
  this	
  
perceived	
  major	
  upfront	
  overhaul	
  as	
  
"unachievable"	
  and	
  "disruptive.”’	
  
	
  	
  -­‐-­‐	
  Data	
  Integration	
  Primer:	
  Challenges	
  to	
  Data	
  Integration,	
  US	
  
Dept.	
  of	
  Transportation	
  
	
  
(22)
One	
  may	
  look	
  at	
  the	
  problem	
  
through	
  different	
  goggles	
  
l Two	
  alternatives	
  come	
  to	
  the	
  fore:	
  
1.  provide	
  tools,	
  environments,	
  etc.,	
  to	
  help	
  
outsiders	
  to	
  publish	
  Linked	
  Data	
  (in	
  RDF)	
  
easily	
  
l  a	
  typical	
  example	
  is	
  the	
  Datalift	
  project	
  
2.  forget	
  about	
  RDF,	
  Linked	
  Data,	
  etc,	
  and	
  
concentrate	
  on	
  the	
  raw	
  data	
  instead	
  
Standardizing for Open Data
(24)
But	
  religions	
  and	
  
cultures	
  can	
  
coexist…	
  J	
  
(25)
Open	
  Data	
  on	
  the	
  Web	
  Workshop	
  
l Had	
  a	
  successful	
  workshop	
  in	
  London,	
  in	
  April:	
  
l  around	
  100	
  participants	
  
l  coming	
  from	
  different	
  horizons:	
  publishers	
  and	
  users	
  
of	
  	
  Linked	
  Data,	
  CSV,	
  PDF,	
  …	
  
	
  
(26)
We	
  also	
  talked	
  to	
  our	
  
“stakeholders”	
  
l Member	
  organizations	
  and	
  companies	
  
l Open	
  Data	
  Institute,	
  Open	
  Knowledge	
  
Foundation,	
  Schema.org	
  
l …	
  
(27)
Some	
  takeaway	
  
l The	
  Semantic	
  Web	
  community	
  needs	
  stability	
  of	
  
the	
  technology	
  
l  do	
  not	
  add	
  yet	
  another	
  technology	
  block	
  J	
  
l  existing	
  technologies	
  should	
  be	
  maintained	
  
(28)
Some	
  takeaway	
  
l Look	
  at	
  the	
  more	
  general	
  space,	
  too	
  
l  importance	
  of	
  metadata	
  
l  deal	
  with	
  non-­‐RDF	
  data	
  formats	
  
l  best	
  practices	
  are	
  necessary	
  to	
  raise	
  the	
  quality	
  of	
  
published	
  data	
  
(29)
We	
  need	
  to	
  meet	
  app	
  developers	
  
where	
  they	
  are!	
  
(30)
Metadata	
  is	
  of	
  a	
  major	
  
importance	
  
l Metadata	
  describes	
  the	
  characteristics	
  of	
  the	
  
dataset	
  
l  structure,	
  datatypes	
  used	
  
l  access	
  rights,	
  licenses	
  
l  provenance,	
  authorship	
  
l  etc.	
  
l Vocabularies	
  are	
  also	
  key	
  for	
  Linked	
  Data	
  
(31)
Vocabulary	
  Management	
  Action	
  
l Standard	
  vocabularies	
  are	
  necessary	
  to	
  describe	
  
data	
  
l  there	
  are	
  already	
  some	
  initiatives:	
  W3C’s	
  data	
  cube,	
  
data	
  catalog,	
  PROV,	
  schema.org,	
  DCMI,	
  …	
  	
  
l At	
  the	
  moment,	
  it	
  is	
  a	
  fairly	
  chaotic	
  world…	
  
l  many,	
  possibly	
  overlapping	
  vocabularies	
  
l  difficult	
  to	
  locate	
  the	
  one	
  that	
  is	
  needed	
  
l  vocabularies	
  may	
  not	
  be	
  properly	
  managed,	
  
maintained,	
  versioned,	
  provided	
  persistence…	
  
(32)
W3C’s	
  plan:	
  	
  
l Provide	
  a	
  space	
  whereby	
  
l  communities	
  can	
  develop	
  
l  host	
  vocabularies	
  at	
  W3C	
  if	
  requested	
  
l  annotate	
  vocabularies	
  with	
  a	
  proper	
  set	
  of	
  metadata	
  
terms	
  
l  establish	
  a	
  vocabulary	
  directory	
  
l The	
  exact	
  structure	
  is	
  still	
  being	
  discussed:	
  
http://www.w3.org/2013/04/vocabs/	
  
Standardizing for Open Data
(34)
CSV	
  on	
  the	
  Web	
  
l Planned	
  work	
  areas:	
  
l  metadata	
  vocabulary	
  to	
  describe	
  CSV	
  data	
  
l  structure,	
  reference	
  to	
  access	
  rights,	
  annotations,	
  etc.	
  
l  methods	
  to	
  find	
  the	
  metadata	
  
l  part	
  of	
  an	
  HTTP	
  header,	
  special	
  rows	
  and	
  columns,	
  
packaging	
  formats…	
  
l  mapping	
  content	
  to	
  RDF,	
  JSON,	
  XML	
  
l Possibly	
  at	
  a	
  later	
  phase:	
  	
  
l  API	
  standards	
  to	
  access	
  CSV	
  data	
  
Standardizing for Open Data
(36)
Open	
  Data	
  Best	
  Practices	
  
l Document	
  best	
  practices	
  for	
  data	
  publishers	
  
l  management	
  of	
  persistence,	
  versioning,	
  URI	
  design	
  
l  use	
  of	
  core	
  vocabularies	
  (provenance,	
  access	
  control,	
  
ownership,	
  annotations,…)	
  
l  business	
  models	
  
l Specialized	
  Metadata	
  vocabularies	
  
l  quality	
  description	
  (quality	
  of	
  the	
  data,	
  update	
  
frequencies,	
  correction	
  policies,	
  etc.)	
  
l  description	
  of	
  data	
  access	
  API-­‐s	
  
l  …	
  
(37)
Summary	
  
l Data	
  on	
  the	
  Web	
  has	
  many	
  different	
  facets	
  
l We	
  have	
  concentrated	
  on	
  the	
  integration	
  
aspects	
  in	
  the	
  past	
  years	
  
l We	
  have	
  to	
  take	
  a	
  more	
  general	
  view,	
  look	
  at	
  
other	
  types	
  of	
  data	
  published	
  on	
  the	
  Web	
  
	
  
	
  
(38)
In	
  future…	
  
l We	
  should	
  look	
  at	
  other	
  formats,	
  not	
  only	
  CSV	
  
l  MARC,	
  GIS,	
  ABIF,…	
  
l Better	
  outreach	
  to	
  data	
  publishing	
  communities	
  
and	
  organizations	
  
l  WF,	
  RDA,	
  ODI,	
  OKFN,	
  …	
  
Enjoy	
  the	
  event!	
  

More Related Content

PPTX
Big Linked Data - Creating Training Curricula
PDF
An introduction to Linked (Open) Data
PPT
PPTX
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
PPTX
Interaction with Linked Data
PPTX
Linked Open Data
PDF
DBpedia Tutorial - Feb 2015, Dublin
Big Linked Data - Creating Training Curricula
An introduction to Linked (Open) Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Interaction with Linked Data
Linked Open Data
DBpedia Tutorial - Feb 2015, Dublin

What's hot (20)

PDF
Documents, services, and data on the web
PDF
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
PDF
20130711 records2 graphs_madrid
PPTX
Providing Linked Data
PPT
Semantic Web special interest group meeting - IFLA WLIC 2012
PDF
20130711 linked datascholarship_madrid
PDF
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
PPTX
Building Linked Data Applications
PDF
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
PDF
20130719 dh2013 beyond_infrastructure
PPT
euclid_linkedup WWW tutorial (Besnik Fetahu)
PDF
Wed roman tut_open_datapub
PPTX
LOD2 Webinar Series: 3rd relase of the Stack
PDF
Fondly Collisions: Archival hierarchy and the Europeana Data Model
PPTX
What can linked data do for digital libraries
ODP
Lod2 review meeting
PPTX
Usage of Linked Data: Introduction and Application Scenarios
PDF
The importance of metadata for datasets: The DCAT-AP European standard
Documents, services, and data on the web
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
20130711 records2 graphs_madrid
Providing Linked Data
Semantic Web special interest group meeting - IFLA WLIC 2012
20130711 linked datascholarship_madrid
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
Building Linked Data Applications
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
20130719 dh2013 beyond_infrastructure
euclid_linkedup WWW tutorial (Besnik Fetahu)
Wed roman tut_open_datapub
LOD2 Webinar Series: 3rd relase of the Stack
Fondly Collisions: Archival hierarchy and the Europeana Data Model
What can linked data do for digital libraries
Lod2 review meeting
Usage of Linked Data: Introduction and Application Scenarios
The importance of metadata for datasets: The DCAT-AP European standard
Ad

Similar to Standardizing for Open Data (20)

PDF
The Web of Data: The W3C Semantic Web Initiative
PDF
Some news about the SW
PPT
Pragmatic Approaches to the Semantic Web
PPT
Future of Web 2.0 & The Semantic Web
PDF
What is New in W3C land?
PPT
RDF and Open Linked Data, a first approach
PDF
The state of the art in Linked Data
PDF
Publishing and Using Linked Data
PDF
20110728 datalift-rpi-troy
PDF
The Future of Semantics on the Web
PPT
Web 3.0 Emerging
PDF
2018 GIS in Development: Semantic Web
PPTX
Omitola birmingham cityuniv
PDF
Open data and linked data
PDF
Implementing Linked Data in Low-Resource Conditions
PDF
WebGUI And The Semantic Web
PPT
Introduction to the Semantic Web
PPT
Linked data and voyager
PDF
Hide the Stack: Toward Usable Linked Data
The Web of Data: The W3C Semantic Web Initiative
Some news about the SW
Pragmatic Approaches to the Semantic Web
Future of Web 2.0 & The Semantic Web
What is New in W3C land?
RDF and Open Linked Data, a first approach
The state of the art in Linked Data
Publishing and Using Linked Data
20110728 datalift-rpi-troy
The Future of Semantics on the Web
Web 3.0 Emerging
2018 GIS in Development: Semantic Web
Omitola birmingham cityuniv
Open data and linked data
Implementing Linked Data in Low-Resource Conditions
WebGUI And The Semantic Web
Introduction to the Semantic Web
Linked data and voyager
Hide the Stack: Toward Usable Linked Data
Ad

More from Ivan Herman (20)

PDF
The convergence of Publishing and the Web
PDF
Livres Numériques / Web : Construire la Convergence
PDF
W3C Digital Publishing Interest Group Update
PDF
Bridging the Web and Digital Publishing: EPUBWEB
PDF
W3C and Digital Publishing
PDF
W3C et les publications numériques
PDF
Digital Publishing and the Open Web Platform
PPTX
The W3C Prov Vocabulary
PPTX
Semantic Web and Related Work at W3C
PPTX
On scholarly communication (report of a Dagstuhl workshop)
PDF
Introduction to RDFa
PPTX
RDFa Tutorial
PPTX
Introduction to Semantic Web Technologies
PPTX
A year on the Semantic Web @ W3C
PDF
Introduction to Semantic Web
PDF
What is the Semantic Web
PDF
What is the Semantic Web (in 15 minutes...)
PDF
Semantic Web Tutorial at ESTC2008, Vienna, on September 24, 2008
PDF
États des lieux du Web sémantique
ODP
State of the Semantic Web
The convergence of Publishing and the Web
Livres Numériques / Web : Construire la Convergence
W3C Digital Publishing Interest Group Update
Bridging the Web and Digital Publishing: EPUBWEB
W3C and Digital Publishing
W3C et les publications numériques
Digital Publishing and the Open Web Platform
The W3C Prov Vocabulary
Semantic Web and Related Work at W3C
On scholarly communication (report of a Dagstuhl workshop)
Introduction to RDFa
RDFa Tutorial
Introduction to Semantic Web Technologies
A year on the Semantic Web @ W3C
Introduction to Semantic Web
What is the Semantic Web
What is the Semantic Web (in 15 minutes...)
Semantic Web Tutorial at ESTC2008, Vienna, on September 24, 2008
États des lieux du Web sémantique
State of the Semantic Web

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks

Standardizing for Open Data

  • 1. (1) Standardizing for Open Data Ivan  Herman,  W3C   Open  Data  Week   Marseille,  France,  June  26  2013   Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/
  • 2. (2) Data  is  everywhere  on  the  Web!   l  Public,  private,  behind  enterprise  firewalls   l  Ranges  from  informal  to  highly  curated   l  Ranges  from  machine  readable  to  human  readable   l  HTML  tables,  twitter  feeds,  local  vocabularies,   spreadsheets,  …   l  Expressed  in  diverse  models     l  tree,  graph,  table,  …   l  Serialized  in  many  ways     l  XML,  CSV,  RDF,  PDF,  HTML  Tables,  microdata,…  
  • 3. (3)
  • 4. (4)
  • 5. (5)
  • 6. (6)
  • 7. (7)
  • 8. (8) W3C’s  standardization  focus  was,   traditionally,  on  Web  scale   integration  of  data   l Some  basic  principles:   l  use  of  URIs  everywhere  (to  uniquely  identify  things)   l  relate  resources  among  one  another  (to  connect   things  on  the  Web)   l  discover  new  relationships  through  inferences   l This  is  what  the  Semantic  Web  technologies  are   all  about    
  • 9. (9) We  have  a  number  of  standards   RDF  1.1   SPARQL  1.1   URI   JSON-­‐LD   Turtle   RDFa   RDF/XML   RDF:  data  model,  links,  basic  assertions;   different  serializations     SPARQL:  querying  data   A  fairly  stable  set  of  technologies  by  now!  
  • 10. (10) We  have  a  number  of  standards   RDB2RDF   RDF  1.1   RDFS  1.1  SPARQL  1.1   OWL  2   URI   JSON-­‐LD   Turtle   RDFa   RDF/XML   RDF:  data  model,  links,  basic  assertions;   different  serializations     SPARQL:  querying  data   RDFS:    simple  vocabularies   OWL:  complex  vocabularies,  ontologies   RDB2RDF:  databases  to  RDF   A  fairly  stable  set  of  technologies  by  now!  
  • 11. (11) We  have  Linked  Data  principles  
  • 12. (12) Integration  is  done  in  different  ways   l Very  roughly:   l  data  is  accessed  directly  as  RDF  and  turned  into   something  useful   l  relies  on  data  being  “preprocessed”  and  published  as  RDF   l  data  is  collected  from  different  sources,  integrated   internally   l  using,  say,  a  triple  store  
  • 13. (13)
  • 15. (15) However…   l There  is  a  price  to  pay:  a  relatively  heavy   ecosystem   l  many  developers  shy  away  from  using  RDF  and   related  tools   l Not  all  applications  need  this!   l  data  may  be  used  directly,  no  need  for  integration   concerns   l  the  emphasis  may  be  on  easy  production  and   manipulation  of  data  with  simple  tools  
  • 16. (16) Typical  situation  on  the  Web   l Data  published  in  CSV,  JSON,  XML   l An  application  uses  only  1-­‐2  datasets,   integration  done  by  direct  programming  is   straightforward   l  e.g.,  in  a  Web  Application   l Data  is  often  very  large,  direct  manipulation  is   more  efficient  
  • 17. (17) Non-­‐RDF  Data   l In  some  setting  that  data  can  be  converted  into   RDF   l But,  in  many  cases,  it  is  not  done   l  e.g.,  CSV  data  is  way  too  big   l  RDF  tooling  may  not  be  adequate  for  the  task  at   hand   l  integration  is  not  a  major  issue  
  • 18. (18)
  • 19. (19) What  that  application  does…     l Gets  the  data  published  by  NHS   l Processes  the  data  (e.g.,  through  Hadoop)   l Integrates  the  result  of  the  analysis  with   geographical  data   Ie:  the  raw  data  is  used  without  integration  
  • 20. (20) The  reality  of  data  on  the  Web…   l It  is  still  a  fairly  messy  space  out  there  L   l  many  different  formats  are  used   l  data  is  difficult  to  find   l  published  data  are  messy,  erroneous,     l  tools  are  complex,  unfinished…    
  • 21. (21) How  do  developers   perceive  this?   ‘When  transportation  agencies  consider  data   integration,  one  pervasive  notion  is  that  the   analysis  of  existing  information  needs  and   infrastructure,  much  less  the  organization  of  data   into  viable  channels  for  integration,  requires  a   monumental  initial  commitment  of  resources   and  staff.  Resource-­‐scarce  agencies  identify  this   perceived  major  upfront  overhaul  as   "unachievable"  and  "disruptive.”’      -­‐-­‐  Data  Integration  Primer:  Challenges  to  Data  Integration,  US   Dept.  of  Transportation    
  • 22. (22) One  may  look  at  the  problem   through  different  goggles   l Two  alternatives  come  to  the  fore:   1.  provide  tools,  environments,  etc.,  to  help   outsiders  to  publish  Linked  Data  (in  RDF)   easily   l  a  typical  example  is  the  Datalift  project   2.  forget  about  RDF,  Linked  Data,  etc,  and   concentrate  on  the  raw  data  instead  
  • 24. (24) But  religions  and   cultures  can   coexist…  J  
  • 25. (25) Open  Data  on  the  Web  Workshop   l Had  a  successful  workshop  in  London,  in  April:   l  around  100  participants   l  coming  from  different  horizons:  publishers  and  users   of    Linked  Data,  CSV,  PDF,  …    
  • 26. (26) We  also  talked  to  our   “stakeholders”   l Member  organizations  and  companies   l Open  Data  Institute,  Open  Knowledge   Foundation,  Schema.org   l …  
  • 27. (27) Some  takeaway   l The  Semantic  Web  community  needs  stability  of   the  technology   l  do  not  add  yet  another  technology  block  J   l  existing  technologies  should  be  maintained  
  • 28. (28) Some  takeaway   l Look  at  the  more  general  space,  too   l  importance  of  metadata   l  deal  with  non-­‐RDF  data  formats   l  best  practices  are  necessary  to  raise  the  quality  of   published  data  
  • 29. (29) We  need  to  meet  app  developers   where  they  are!  
  • 30. (30) Metadata  is  of  a  major   importance   l Metadata  describes  the  characteristics  of  the   dataset   l  structure,  datatypes  used   l  access  rights,  licenses   l  provenance,  authorship   l  etc.   l Vocabularies  are  also  key  for  Linked  Data  
  • 31. (31) Vocabulary  Management  Action   l Standard  vocabularies  are  necessary  to  describe   data   l  there  are  already  some  initiatives:  W3C’s  data  cube,   data  catalog,  PROV,  schema.org,  DCMI,  …     l At  the  moment,  it  is  a  fairly  chaotic  world…   l  many,  possibly  overlapping  vocabularies   l  difficult  to  locate  the  one  that  is  needed   l  vocabularies  may  not  be  properly  managed,   maintained,  versioned,  provided  persistence…  
  • 32. (32) W3C’s  plan:     l Provide  a  space  whereby   l  communities  can  develop   l  host  vocabularies  at  W3C  if  requested   l  annotate  vocabularies  with  a  proper  set  of  metadata   terms   l  establish  a  vocabulary  directory   l The  exact  structure  is  still  being  discussed:   http://www.w3.org/2013/04/vocabs/  
  • 34. (34) CSV  on  the  Web   l Planned  work  areas:   l  metadata  vocabulary  to  describe  CSV  data   l  structure,  reference  to  access  rights,  annotations,  etc.   l  methods  to  find  the  metadata   l  part  of  an  HTTP  header,  special  rows  and  columns,   packaging  formats…   l  mapping  content  to  RDF,  JSON,  XML   l Possibly  at  a  later  phase:     l  API  standards  to  access  CSV  data  
  • 36. (36) Open  Data  Best  Practices   l Document  best  practices  for  data  publishers   l  management  of  persistence,  versioning,  URI  design   l  use  of  core  vocabularies  (provenance,  access  control,   ownership,  annotations,…)   l  business  models   l Specialized  Metadata  vocabularies   l  quality  description  (quality  of  the  data,  update   frequencies,  correction  policies,  etc.)   l  description  of  data  access  API-­‐s   l  …  
  • 37. (37) Summary   l Data  on  the  Web  has  many  different  facets   l We  have  concentrated  on  the  integration   aspects  in  the  past  years   l We  have  to  take  a  more  general  view,  look  at   other  types  of  data  published  on  the  Web      
  • 38. (38) In  future…   l We  should  look  at  other  formats,  not  only  CSV   l  MARC,  GIS,  ABIF,…   l Better  outreach  to  data  publishing  communities   and  organizations   l  WF,  RDA,  ODI,  OKFN,  …