Beautifying Data in the real world

Instructor: Professor Lothar Piepmeyer

Beautifying Data
in the Real World
Group 5:
Toan Do - An Du
Vinh Nguyen - Tan Tran

1

How big is the data on the Internet?

2004: The first time Internet exceed 1EB
2005: Eric Schmidt estimated it was 5 million
Terabytes (~ 5EB)
Cisco forecasts that in 2015, the size of the
Internet will reach nearly 1,000 EB

How big is it?
Source: http://www.wisegeek.com/how-big-is-the-internet.htm
http://techland.time.com/

If 1 byte = 0.5mm

Source:3http://blog.fliptop.com/how-much-data-is-on-the-internet/

Content

Introduction
Open Notebook Sciences appoaching
Curating and presenting the data
Beautfifying the data
Data Visualization & Building a portal from
open data and free services
Demonstration

Data on the internet

Source: http://news.bbc.co.uk/2/hi/technology/8562801.stm

Problems of data in real world
(Scientific)

Noisy source of data
The barrier of data presentation
OCR version
Text version
Human-readable
Machine readable
…
How to verify the data?

Open Notebook Science

Purpose: record full scientific research raw data,
make it available and online
Benefits:
obtain detailed descriptions of procedures
improve the communication of science
increase the progress
reduce time lost due to the repetition of failed
experiments
…

Crowdsourcing

a distributed problem-solving and
production model

Crowdsourcing

Source: http://r18ultrachair.com/

Validating crowdsourced data

According to ONS, all detail data have been
recorded
The doubtful data also be kept and marked
for

Unique Identifiers for Chemical
Entity

Standardize data

Facilitate the integration with other data sets

Consider 3 possibilities
 CAS Registry Number
 InChI
 SMILES

CAS Registry Number

 Proprietary

 Cannot converted to chemical structure

 Dependent to a external organization to issue

For example, the CAS number of water is 7732-18-5: the
checksum 5 is calculated as (8 1 + 1 2 + 2 3 + 3 4 + 7 5 +
7 6) = 105; 105 mod 10 = 5
http://en.wikipedia.org/wiki/CAS_registry_number

InChI
 IUPAC International Chemical Identifier
 Freely usable and non-proprietary
 Do not have to be assigned by some organization
 Can be computed from structural information
 Human readable (with practice)

http://en.wikipedia.org/wiki/Inchi

SMILES

 Simplified molecular-input
line-entry system

 More human-readable than
InChI

 Can convert to InChI

http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

18
http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

Analysis Options

Access to live data
Get Summary
Complex Statistical representations of
models
Mark the skeptical data for later
consideration

Google Docs API

Allows developers to create, retrieve, update, and
delete Google Docs files and collections
Also provides some advanced features like resource
archives, Optical Character
Recognition, translation, and revision history.
Useful to store data in the cloud, perform resource
management, convert document formats

https://developers.google.com/google-apps/documents-list/

Google Visualization API

Chart Library
JavaScript classes
Data Table
JavaScript DataTable class
Data Source
Chart Tools Datasource
protocol

https://developers.google.com/chart/interactive/docs/index

24
https://google-developers.appspot.com/chart/interactive/docs/gallery

RESTful Web Service

 Representational State Transfer - a simpler alternative to
SOAP - and Web Services Description Language (WSDL)
based Web services
 Principles:
 Use HTTP methods explicitly.
 Be stateless.
 Expose directory structure-like URIs.
 Transfer XML, JavaScript Object
 Notation (JSON), or both.

http://www.ibm.com/developerworks/webservices/library/ws-restful/

Compare REST and SOAP

Who's using REST?
All of Yahoo's web services use REST, including Flickr,
del.icio.us API uses it, pubsub, bloglines, technorati, and
both eBay, and Amazon have web services for both
REST and SOAP.
Who's using SOAP?
Google seams to be consistent in implementing their
web services to use SOAP, with the exception of
Blogger, which uses XML-RPC. You will find SOAP web
services in lots of enterprise software as well.
http://www.petefreitag.com/item/431.cfm

Compare REST and SOAP

REST SOAP
Lightweight - not a Easy to consume -
lot of extra xml sometimes
markup Rigid - type
Human Readable checking, adheres to
Results a contract
Easy to build - no Development tools
toolkits required

An Effort to Aggregate Data from
Multiple Sources

Introducing ChemSpider
An online lookup engine for Chemists
http://www.chemspider.com
40 mil substances
Multiple data sources
A "link farm" to other sources

What is "wrong" with
wikipedia.com?

30

Wikipedia.com

Not “wrong”:

 Very informative for human being

Wikipedia.com

This little guy is left behind

Not machine-readable

Semantic Web

Describing things in a way that computers
applications can understand it.
“The Beatles was a band from Liverpool”
Describes the relationships between things (like A
is a part of B and Y is a member of Z) and
the properties of things (like size, weight, age, and
price)
“..will make all the data in the world look like
one huge database“ – Tim Berners-Lee
http://www.w3schools.com/web/web_semantic.asp

Resource Description Framework

Is a language to describe resources on
the web
Component of the Semantic Web
Data is self-describing
Triples: "subject", "predicate" and "value“
URIs are used to denote resources

RDF

Graph Database
Nodes
Edges

Well-suited for Knowledge Representation
Beautified Data => Knowledge

RDF Example

<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http://www.recshop.fake/cd#">
<rdf:Description
rdf:about="http://www.recshop.fake/cd/Empire Burlesque">
<cd:artist>Bob Dylan</cd:artist>
<cd:country>USA</cd:country>
<cd:company>Columbia</cd:company>
<cd:price>10.90</cd:price>
<cd:year>1985</cd:year>
</rdf:Description>
</rdf:RDF>

Semantic Web Example: DBPedia

“Old School” wikipedia:
 http://en.wikipedia.org/wiki/Porsche_Panamera

DbPedia Entries

 http://dbpedia.org/page/Porsche_Panamera
 http://dbpedia.org/page/Chromium_carbide

Query Language: SPARQL (sparkle)

Query Language for RDF
Graph Traversal
Matching the triples
Example:
Data:
<http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> "SPARQL
Tutorial”

Query:
SELECT ?title
WHERE { <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title>
?title . }

Query Result: title "SPARQL Tutorial"

To Infinity and Beyond

• DB2 and Oracle are ready for this train

•Object Database
Versant OODBMS, anybody?

•Machine-Readable Data
Will they become self-awareness?

39

“Data Finds Data” and Semantic Data
Model – A Hypothesis

40

Non-Obvious Relationship Awareness

LÂM

BẢO

41


LÂM’s
iPhone

LÂM

BẢO

42


LÂM’s
iPhone

BẢO’s
SS Galaxy

LÂM

BẢO

43

TheGioiDi
Dong.com

LÂM’s
iPhone

BẢO’s
SS Galaxy

LÂM

BẢO

44

TheGioiDi
Dong.com

LÂM’s
iPhone

BẢO’s
SS Galaxy

LÂM

BẢO

45

TheGioiDi
Dong.com

LÂM’s
iPhone

BẢO’s
SS Galaxy

LÂM

BẢO
Connection Detected!
-Bao could have met Lam at Thegioididong?
-They could have discussed their World domination
scheme during the meeting there?
-??? 46

TheGioiDi
Dong.com

LÂM’s
iPhone

BẢO’s
SS Galaxy

LÂM

BẢO

47

 Data Visualization

 Building a portal from open data and
free services

Visualization of Data

Top million web
sites (per Alexa
traffic data) was
performed in
early 2010 ]

Source http://nmap.org/favicon/

Second Life
Second Life is a 3D world where everyone you see is a real person and
every place you visit is built by people just like you.

SL- The Opportunity for "Edutainment"

iSchool Teaching: Quizzes and Lectures

Classrooms with Powerpoint Research Center
Drexel Island on Second Life

3-D Environments

http://3rdrockgrid.com/
http://www.secondlife.com/

http://www.craft-world.org

http://www.osgrid.org/

http://youralternativelife.com//

Visualization To Suggest New
Experiments

Building A Portal From Open Data And
Free Services

 Freely hosted Wiki service
 Google Spreadsheet
 Google Docs API / javascripts
 Visualization services/anlalysis services (2D, 3D)
 RDF/ Senmantic Web/ Webservices
 Cost: free or fit to the purpose

Key To Success

Model
+ Transparency
Information

Data

Records

Demonstration
 Google Docs
 Second Life

References

Oreilly – Beautiful data – Chapter 16th
Beautifying data in the real world
http://techland.time.com/2011/06/01/how-big-
is-the-internet-spoiler-not-as-big-as-itll-be-in-
2015/
http://drexelisland.wikispaces.com/
SMILE to 3D – Secon Life,
http://www.youtube.com/watch?v=tOfhuoRbn
Cg&feature=player_embedded

Beautifying Data in the real world

More Related Content

What's hot (17)

Viewers also liked (6)

Similar to Beautifying Data in the real world (20)

More from Tan Tran (11)

Recently uploaded (20)

Beautifying Data in the real world