SlideShare a Scribd company logo
Ordering the chaos: 
creating websites using 
imperfect data 
Andrew Stretton 
Oxford University Web SIG November 2014
Who am I, what is ChemBio Hub? 
• Andrew Stretton – Data Architect and Developer 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk (feel free to link to us!) 
@oxchembiohub 
github.com/thesgc
Chembio Hub exists to 
support research at the 
interface of chemistry and 
biology 
by enabling sharing of reagents, expertise 
and data across 20+ departments
Who are we trying to connect and how? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
All of these parts require tagging 
entities in text, how can we do it 
Who are we trying to connect and how? 
cheaply and sustainably? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
What sorts of messy data are we working with? 
• Full text from procedures, biographies, web sites 
• Raw CSV/ Excel formats from multiple machines 
or departmental processes 
• “Standard” XML and JSON formats from various 
sources that do not map perfectly to our 
application 
• Multiple external databases to submit data to
How do most of our users like their web-based tools? 
Simple Search 
Flexible data 
management 
Comprehensive, 
overlapping tagging 
Clear progress, seamless experience
What do we sometimes give them? 
• Incomplete or many-to-one tagging 
• Hyperlinks instead of the right information 
from the other site 
• Dumb search 
• Inflexible schemas 
• Lack of linking between datasets
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
Let’s look at this one first, happy 
to discuss other areas later… 
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
How do we fill in gaps on un-tagged 
data? 
Let’s do an experiment… 
github.com/strets123/web-sig-2014/
Elasicsearch - information extraction on-the-fly 
• Take a dataset of 18801 companies 
~ 50% tagged 
> 80% have some 
text data 
0% 50% 100% 
Tags 
Description 
Overview 
Overview or 
description 
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
Use the “significant terms” feature… 
• What description/overview words most strongly 
linked to each tag? 
travel education music realestate 
Search 
engine 
optimization 
jobs onlinemarketing projectmanagement 
travel students music estate seo job marketing project 
travelers teachers artists real optimization jobs seo projects 
trip learning musicians agents engine employers agency task 
trips education songs property ppc career optimization collaboration 
hotels student labels listings marketing teams 
flights educational playlists search management 
traveler bands click 
travellers song pay 
airline artist 
hotel fans
Now let’s test these queries 
• Which companies have no tag but are most 
likely to need tagging with “music”… 
uPlaya 
Description uPlaya provides independent or unsigned musicians with immediate 
feedback on their music…. 
Category games_video 
Tags - 
Webceleb 
Description Webceleb is music marketplace and community where musicians 
and fans engage and profit from discovering, purchasing and 
downloading the latest independent music.…. 
Category games_video 
Tags -
But what if we have 
NO TAGS?
A process to extract tags from text… 
Index Data 
Assign resources (e.g. 
Amazon spot instance 
for large dataset) 
List word counts with 
the least frequent 
first 
Exclude lowest counts 
Aggregate the 
significant terms for 
each word 
Filter words that have 
a lot of high scoring 
significant terms
What does this give us? 
athletes: [athletes, coaches, athlete, coach, sports, fans] 
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] 
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] 
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] 
dial: [dial, calling, calls, voip, number, call, voice, phone] 
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] 
indie: [indie, labels, artists, music] 
logos: [logos, branding, flash, design] 
pci: [pci, dss, hipaa, compliance, sensitive, compliant] 
portland: [portland, oregon, inc, founded] 
ringtones: [ringtones, ringtone, personalization, games] 
traders: [traders, forex, trader, trading, quotes, stock, trade] 
yellow: [yellow, pages, directory, local] 
abc: [abc, cnn, nbc, television] 
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] 
aviation: [aviation, aircraft, aerospace, defense, transportation] 
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
What else can we do with this? 
Filter words that have 
a lot of high scoring 
significant terms 
De duplicate where 
large overlaps exist 
Assign levels of tags 
in order of frequency 
Use to categorise 
new data on the fly 
using percolate 
Curate manually 
Generate a sidebar 
menu 
github.com/strets123/web-sig-2014/ 
Use elasticsearch 
phrase suggester to 
create phrase tags
Advantages over direct curation / supervised learning: 
• Simplicity and pragmatism 
• Applicable to novel domains 
– e.g. Chemical Biology 
• Auto generated tags choose more appropriate 
word combinations than manual curators 
• No need for complex data formats like rdf 
• Data from many sources can be mixed 
– e.g. categories from other university’s sites…
Where might this technology lead? 
• How about a tag-based file system? 
• How about an implicit social network? 
• Elasticsearch is really easy to scale… 
• Which websites, filesystems and datasets do 
you need to categorise? 
– Do you really need RDF ontologies, curators etc. or 
can you just do something simple?
Summary 
• We now have many options to categorise and 
tidy up messy data 
• Managing variations on schemas takes a lot of 
resources – leave it to the data owners if you 
can! 
• When it comes to tagging… 
– Perfection is in the eye of the beholder 
– Sustainability is really important
Thanks 
• Thanks to the Research 
informatics team at the NDM 
Structural Genomics 
Consortium 
– Paul Barrett 
– Karen Porter 
– Michael O’Hagan 
– Brian Marsden 
– David Damerell 
– Sefa Garsot 
– Anthony Bradley 
• Thanks to the InfoDev team 
at IT services for answering 
my endless questions about 
webauth 
• Funders: 
– John Fell Fund 
– NDM Strategic 
– Welcome Trust 
– Higher Education Funding 
Council 
• To everyone here for listening
Any Questions? 
• Andrew Stretton 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk 
@oxchembiohub 
github.com/thesgc 
Simple example categorisation 
code available here in python 
github.com/strets123/web-sig-2014/
Appendix of other messy 
data techniques
How do we make it easy to 
add spreadsheet data to a 
system?
Working with flat files 
• Sometimes a flat file is the right schema for a 
dataset 
– User defined formats 
– Different types of research 
– Only some of the fields are relevant when 
comparing experiments 
– Data is not in memory unless needed 
• Pandas and HDF allows SQL-like queries on flat 
files
Helpful data management 
• Data Wrangler 
– https://player.vimeo.com/video/19185801 
• Raw 
– http://raw.densitydesign.org 
• Take these as inspiration for our tool for re-shaping 
biochemistry data
Simplifying web crawling 
• Modern web crawling patterns use class 
selectors instead of xPath 
– Less likelihood of change 
• Content can be crawled using a backend web 
browser 
– Dynamic javascript elements are included 
• Using a website’s data for classification is 
more acceptable than wholesale reproduction
Managing multiple JSON schemas with views 
PostgreSQL – also supported by Rails/Activerecord 
Couchbase
Why views over JSON can be useful 
• Expose only required fields from e.g. RDF 
• Input format may change but we don’t want 
crawler to break 
• Required fields may change 
• Versions are easy to support if format 
normalisation is in the database layer 
• Storage is cheap 
• View code is executed only once

More Related Content

PPTX
Linked Open Data in Romania
PPTX
What happened to the Semantic Web?
PDF
Search the Internet like an Expert
PDF
Metadata
PPTX
Intro to Neo4j with Ruby
PDF
The Web of Data: The W3C Semantic Web Initiative
PPT
The Semantic Web
PPTX
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
Linked Open Data in Romania
What happened to the Semantic Web?
Search the Internet like an Expert
Metadata
Intro to Neo4j with Ruby
The Web of Data: The W3C Semantic Web Initiative
The Semantic Web
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...

What's hot (20)

PDF
Schema.org Structured data the What, Why, & How
PDF
NetIKX Semantic Search Presentation
PPT
Marc and beyond: 3 Linked Data Choices
PPTX
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
PDF
Three Linked Data choices for Libraries
PDF
Schema.org - Extending Benefits
PDF
Extending Schema.org
PPT
A review of the state of the art in Machine Learning on the Semantic Web
PPTX
Semantic Web and Schema.org
PDF
Schema.org: What It Means For You and Your Library
PDF
Analysing & Improving Learning Resources Markup on the Web
PDF
An introduction to Semantic Web and Linked Data
PPTX
Knowledge Integration in Practice
PDF
Wimmics Overview 2021
PPTX
Semantic Search on the Rise
PDF
Danbri Drupalcon Export
PPTX
Semantic Search tutorial at SemTech 2012
PDF
semantic markup using schema.org
PDF
Structured Data: It's All About the Graph!
PDF
Ontologies and semantic web
Schema.org Structured data the What, Why, & How
NetIKX Semantic Search Presentation
Marc and beyond: 3 Linked Data Choices
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Three Linked Data choices for Libraries
Schema.org - Extending Benefits
Extending Schema.org
A review of the state of the art in Machine Learning on the Semantic Web
Semantic Web and Schema.org
Schema.org: What It Means For You and Your Library
Analysing & Improving Learning Resources Markup on the Web
An introduction to Semantic Web and Linked Data
Knowledge Integration in Practice
Wimmics Overview 2021
Semantic Search on the Rise
Danbri Drupalcon Export
Semantic Search tutorial at SemTech 2012
semantic markup using schema.org
Structured Data: It's All About the Graph!
Ontologies and semantic web
Ad

Viewers also liked (7)

PPTX
Seven Axiom
PPTX
Get Your Ducks Nccet Webinar
PDF
E11 Physics Evaluation Sheet
PPTX
Chembio Crunch Intro
PPTX
PPTX
California Corporate College Presentation at NCCET 100910
PPTX
California Corporate College Cccaoe Fall 2009
Seven Axiom
Get Your Ducks Nccet Webinar
E11 Physics Evaluation Sheet
Chembio Crunch Intro
California Corporate College Presentation at NCCET 100910
California Corporate College Cccaoe Fall 2009
Ad

Similar to Ordering the chaos: Creating websites with imperfect data (20)

PPTX
Bioschemas Workshop
PPTX
How Lyft Drives Data Discovery
PPT
Fox-Keynote-Now and Now of Data Publishing-nfdp13
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
PDF
Disrupting Data Discovery
PPTX
Data council sf amundsen presentation
PPTX
Strata sf - Amundsen presentation
PDF
Meetup SF - Amundsen
PDF
The original vision of Nutch, 14 years later: Building an open source search ...
PPT
A Brief (and Practical) Introduction to Information Architecture
PPTX
Alamw15 VIVO
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
PDF
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
PDF
Data analytics and SEO to grow your international business | John Caldwell | ...
PDF
Data Discovery and Metadata
PDF
Linked (Open) Data
PPTX
FAIRDOM data management support for ERACoBioTech Proposals
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
PDF
Charting Searchland, ACM SIG Data Mining
Bioschemas Workshop
How Lyft Drives Data Discovery
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Pratical Deep Dive into the Semantic Web - #smconnect
Disrupting Data Discovery
Data council sf amundsen presentation
Strata sf - Amundsen presentation
Meetup SF - Amundsen
The original vision of Nutch, 14 years later: Building an open source search ...
A Brief (and Practical) Introduction to Information Architecture
Alamw15 VIVO
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Jeremy cabral search marketing summit - scraping data-driven content (1)
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Data analytics and SEO to grow your international business | John Caldwell | ...
Data Discovery and Metadata
Linked (Open) Data
FAIRDOM data management support for ERACoBioTech Proposals
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Charting Searchland, ACM SIG Data Mining

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Miokarditis (Inflamasi pada Otot Jantung)
Moving the Public Sector (Government) to a Digital Adoption
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Ordering the chaos: Creating websites with imperfect data

  • 1. Ordering the chaos: creating websites using imperfect data Andrew Stretton Oxford University Web SIG November 2014
  • 2. Who am I, what is ChemBio Hub? • Andrew Stretton – Data Architect and Developer github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk (feel free to link to us!) @oxchembiohub github.com/thesgc
  • 3. Chembio Hub exists to support research at the interface of chemistry and biology by enabling sharing of reagents, expertise and data across 20+ departments
  • 4. Who are we trying to connect and how? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 5. All of these parts require tagging entities in text, how can we do it Who are we trying to connect and how? cheaply and sustainably? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 6. What sorts of messy data are we working with? • Full text from procedures, biographies, web sites • Raw CSV/ Excel formats from multiple machines or departmental processes • “Standard” XML and JSON formats from various sources that do not map perfectly to our application • Multiple external databases to submit data to
  • 7. How do most of our users like their web-based tools? Simple Search Flexible data management Comprehensive, overlapping tagging Clear progress, seamless experience
  • 8. What do we sometimes give them? • Incomplete or many-to-one tagging • Hyperlinks instead of the right information from the other site • Dumb search • Inflexible schemas • Lack of linking between datasets
  • 9. What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 10. Let’s look at this one first, happy to discuss other areas later… What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 11. How do we fill in gaps on un-tagged data? Let’s do an experiment… github.com/strets123/web-sig-2014/
  • 12. Elasicsearch - information extraction on-the-fly • Take a dataset of 18801 companies ~ 50% tagged > 80% have some text data 0% 50% 100% Tags Description Overview Overview or description Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
  • 13. Use the “significant terms” feature… • What description/overview words most strongly linked to each tag? travel education music realestate Search engine optimization jobs onlinemarketing projectmanagement travel students music estate seo job marketing project travelers teachers artists real optimization jobs seo projects trip learning musicians agents engine employers agency task trips education songs property ppc career optimization collaboration hotels student labels listings marketing teams flights educational playlists search management traveler bands click travellers song pay airline artist hotel fans
  • 14. Now let’s test these queries • Which companies have no tag but are most likely to need tagging with “music”… uPlaya Description uPlaya provides independent or unsigned musicians with immediate feedback on their music…. Category games_video Tags - Webceleb Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.…. Category games_video Tags -
  • 15. But what if we have NO TAGS?
  • 16. A process to extract tags from text… Index Data Assign resources (e.g. Amazon spot instance for large dataset) List word counts with the least frequent first Exclude lowest counts Aggregate the significant terms for each word Filter words that have a lot of high scoring significant terms
  • 17. What does this give us? athletes: [athletes, coaches, athlete, coach, sports, fans] avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] dial: [dial, calling, calls, voip, number, call, voice, phone] exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] indie: [indie, labels, artists, music] logos: [logos, branding, flash, design] pci: [pci, dss, hipaa, compliance, sensitive, compliant] portland: [portland, oregon, inc, founded] ringtones: [ringtones, ringtone, personalization, games] traders: [traders, forex, trader, trading, quotes, stock, trade] yellow: [yellow, pages, directory, local] abc: [abc, cnn, nbc, television] argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] aviation: [aviation, aircraft, aerospace, defense, transportation] airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
  • 18. What else can we do with this? Filter words that have a lot of high scoring significant terms De duplicate where large overlaps exist Assign levels of tags in order of frequency Use to categorise new data on the fly using percolate Curate manually Generate a sidebar menu github.com/strets123/web-sig-2014/ Use elasticsearch phrase suggester to create phrase tags
  • 19. Advantages over direct curation / supervised learning: • Simplicity and pragmatism • Applicable to novel domains – e.g. Chemical Biology • Auto generated tags choose more appropriate word combinations than manual curators • No need for complex data formats like rdf • Data from many sources can be mixed – e.g. categories from other university’s sites…
  • 20. Where might this technology lead? • How about a tag-based file system? • How about an implicit social network? • Elasticsearch is really easy to scale… • Which websites, filesystems and datasets do you need to categorise? – Do you really need RDF ontologies, curators etc. or can you just do something simple?
  • 21. Summary • We now have many options to categorise and tidy up messy data • Managing variations on schemas takes a lot of resources – leave it to the data owners if you can! • When it comes to tagging… – Perfection is in the eye of the beholder – Sustainability is really important
  • 22. Thanks • Thanks to the Research informatics team at the NDM Structural Genomics Consortium – Paul Barrett – Karen Porter – Michael O’Hagan – Brian Marsden – David Damerell – Sefa Garsot – Anthony Bradley • Thanks to the InfoDev team at IT services for answering my endless questions about webauth • Funders: – John Fell Fund – NDM Strategic – Welcome Trust – Higher Education Funding Council • To everyone here for listening
  • 23. Any Questions? • Andrew Stretton github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk @oxchembiohub github.com/thesgc Simple example categorisation code available here in python github.com/strets123/web-sig-2014/
  • 24. Appendix of other messy data techniques
  • 25. How do we make it easy to add spreadsheet data to a system?
  • 26. Working with flat files • Sometimes a flat file is the right schema for a dataset – User defined formats – Different types of research – Only some of the fields are relevant when comparing experiments – Data is not in memory unless needed • Pandas and HDF allows SQL-like queries on flat files
  • 27. Helpful data management • Data Wrangler – https://player.vimeo.com/video/19185801 • Raw – http://raw.densitydesign.org • Take these as inspiration for our tool for re-shaping biochemistry data
  • 28. Simplifying web crawling • Modern web crawling patterns use class selectors instead of xPath – Less likelihood of change • Content can be crawled using a backend web browser – Dynamic javascript elements are included • Using a website’s data for classification is more acceptable than wholesale reproduction
  • 29. Managing multiple JSON schemas with views PostgreSQL – also supported by Rails/Activerecord Couchbase
  • 30. Why views over JSON can be useful • Expose only required fields from e.g. RDF • Input format may change but we don’t want crawler to break • Required fields may change • Versions are easy to support if format normalisation is in the database layer • Storage is cheap • View code is executed only once

Editor's Notes

  • #7: Real word data is not: Perfectly tagged In one place In one format In one technology stack Spreadsheet processes don’t just disappear when you build a tool