Ordering the chaos: Creating websites with imperfect data

Ordering the chaos:
creating websites using
imperfect data
Andrew Stretton
Oxford University Web SIG November 2014

Who am I, what is ChemBio Hub?
• Andrew Stretton – Data Architect and Developer
github.com/strets123
@strets123
linkedin (google me)
• Chembio Hub
http://chembiohub.ox.ac.uk (feel free to link to us!)
@oxchembiohub
github.com/thesgc

Chembio Hub exists to
support research at the
interface of chemistry and
biology
by enabling sharing of reagents, expertise
and data across 20+ departments

Who are we trying to connect and how?
User 1:
Scientist at Oxford
User 2:
Potential collaborator
Could be in industry
or anywhere in academia
Stored and curated by ChemBio Hub
Unpublished
results
Negative Data
Methods
Equipment
Reagents
? Not sure yet
Areas of
expertise
Questions
and answers
Contacts
Publications
Held on other sites or social networks
Organised/linked to by ChemBio Hub

All of these parts require tagging
entities in text, how can we do it
Who are we trying to connect and how?
cheaply and sustainably?
User 1:
Scientist at Oxford
User 2:
Potential collaborator
Could be in industry
or anywhere in academia
Stored and curated by ChemBio Hub
Unpublished
results
Negative Data
Methods
Equipment
Reagents
? Not sure yet
Areas of
expertise
Questions
and answers
Contacts
Publications
Held on other sites or social networks
Organised/linked to by ChemBio Hub

What sorts of messy data are we working with?
• Full text from procedures, biographies, web sites
• Raw CSV/ Excel formats from multiple machines
or departmental processes
• “Standard” XML and JSON formats from various
sources that do not map perfectly to our
application
• Multiple external databases to submit data to

How do most of our users like their web-based tools?
Simple Search
Flexible data
management
Comprehensive,
overlapping tagging
Clear progress, seamless experience

What do we sometimes give them?
• Incomplete or many-to-one tagging
• Hyperlinks instead of the right information
from the other site
• Dumb search
• Inflexible schemas
• Lack of linking between datasets

What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce /
Database views
for schema
Normalisation and
data analysis
Web crawling - not as
hard or messy as it
used to be

Let’s look at this one first, happy
to discuss other areas later…
What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce /
Database views
for schema
Normalisation and
data analysis
Web crawling - not as
hard or messy as it
used to be

How do we fill in gaps on un-tagged
data?
Let’s do an experiment…
github.com/strets123/web-sig-2014/

Elasicsearch - information extraction on-the-fly
• Take a dataset of 18801 companies
~ 50% tagged
> 80% have some
text data
0% 50% 100%
Tags
Description
Overview
Overview or
description
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/

Use the “significant terms” feature…
• What description/overview words most strongly
linked to each tag?
travel education music realestate
Search
engine
optimization
jobs onlinemarketing projectmanagement
travel students music estate seo job marketing project
travelers teachers artists real optimization jobs seo projects
trip learning musicians agents engine employers agency task
trips education songs property ppc career optimization collaboration
hotels student labels listings marketing teams
flights educational playlists search management
traveler bands click
travellers song pay
airline artist
hotel fans

Now let’s test these queries
• Which companies have no tag but are most
likely to need tagging with “music”…
uPlaya
Description uPlaya provides independent or unsigned musicians with immediate
feedback on their music….
Category games_video
Tags -
Webceleb
Description Webceleb is music marketplace and community where musicians
and fans engage and profit from discovering, purchasing and
downloading the latest independent music.….
Category games_video
Tags -

A process to extract tags from text…
Index Data
Assign resources (e.g.
Amazon spot instance
for large dataset)
List word counts with
the least frequent
first
Exclude lowest counts
Aggregate the
significant terms for
each word
Filter words that have
a lot of high scoring
significant terms

What does this give us?
athletes: [athletes, coaches, athlete, coach, sports, fans]
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game]
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure]
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features]
dial: [dial, calling, calls, voip, number, call, voice, phone]
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health]
indie: [indie, labels, artists, music]
logos: [logos, branding, flash, design]
pci: [pci, dss, hipaa, compliance, sensitive, compliant]
portland: [portland, oregon, inc, founded]
ringtones: [ringtones, ringtone, personalization, games]
traders: [traders, forex, trader, trading, quotes, stock, trade]
yellow: [yellow, pages, directory, local]
abc: [abc, cnn, nbc, television]
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin]
aviation: [aviation, aircraft, aerospace, defense, transportation]
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]

What else can we do with this?
Filter words that have
a lot of high scoring
significant terms
De duplicate where
large overlaps exist
Assign levels of tags
in order of frequency
Use to categorise
new data on the fly
using percolate
Curate manually
Generate a sidebar
menu
Use elasticsearch
phrase suggester to
create phrase tags

Advantages over direct curation / supervised learning:
• Simplicity and pragmatism
• Applicable to novel domains
– e.g. Chemical Biology
• Auto generated tags choose more appropriate
word combinations than manual curators
• No need for complex data formats like rdf
• Data from many sources can be mixed
– e.g. categories from other university’s sites…

Where might this technology lead?
• How about a tag-based file system?
• How about an implicit social network?
• Elasticsearch is really easy to scale…
• Which websites, filesystems and datasets do
you need to categorise?
– Do you really need RDF ontologies, curators etc. or
can you just do something simple?

Summary
• We now have many options to categorise and
tidy up messy data
• Managing variations on schemas takes a lot of
resources – leave it to the data owners if you
can!
• When it comes to tagging…
– Perfection is in the eye of the beholder
– Sustainability is really important

Thanks
• Thanks to the Research
informatics team at the NDM
Structural Genomics
Consortium
– Paul Barrett
– Karen Porter
– Michael O’Hagan
– Brian Marsden
– David Damerell
– Sefa Garsot
– Anthony Bradley
• Thanks to the InfoDev team
at IT services for answering
my endless questions about
webauth
• Funders:
– John Fell Fund
– NDM Strategic
– Welcome Trust
– Higher Education Funding
Council
• To everyone here for listening

Any Questions?
• Andrew Stretton
github.com/strets123
@strets123
linkedin (google me)
• Chembio Hub
http://chembiohub.ox.ac.uk
@oxchembiohub
github.com/thesgc
Simple example categorisation
code available here in python

Appendix of other messy
data techniques

How do we make it easy to
add spreadsheet data to a
system?

Working with flat files
• Sometimes a flat file is the right schema for a
dataset
– User defined formats
– Different types of research
– Only some of the fields are relevant when
comparing experiments
– Data is not in memory unless needed
• Pandas and HDF allows SQL-like queries on flat
files

Helpful data management
• Data Wrangler
– https://player.vimeo.com/video/19185801
• Raw
– http://raw.densitydesign.org
• Take these as inspiration for our tool for re-shaping
biochemistry data

Simplifying web crawling
• Modern web crawling patterns use class
selectors instead of xPath
– Less likelihood of change
• Content can be crawled using a backend web
browser
– Dynamic javascript elements are included
• Using a website’s data for classification is
more acceptable than wholesale reproduction

Managing multiple JSON schemas with views
PostgreSQL – also supported by Rails/Activerecord
Couchbase

Why views over JSON can be useful
• Expose only required fields from e.g. RDF
• Input format may change but we don’t want
crawler to break
• Required fields may change
• Versions are easy to support if format
normalisation is in the database layer
• Storage is cheap
• View code is executed only once

Ordering the chaos: Creating websites with imperfect data

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Ordering the chaos: Creating websites with imperfect data (20)

Recently uploaded (20)

Ordering the chaos: Creating websites with imperfect data

Editor's Notes