Open Data Summit Presentation by Joe Olsen

JoeOlson
DataArchitect
SmartChicagoCollaborative
27Mar2014
joe.olson@cct.org
(All the cool buzzwords in one place!)
Social Media,
Cloud Computing,
Machine Learning,
Open Source, and
Big Data Analytics

Social Media - Twitter
• What can we learn from Twitter?
• 400 million tweets per day
source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
• 218 million users
source: http://techcrunch.com/2013/10/03/bweeting/
• Excellent source of sentiment
• Excellent source of big data
• Prototyping
• Modeling natural language
• Resume padding

• How do we get at the data?
• Twitter provided APIs:
• https://dev.twitter.com/docs
• Streaming
• Set up a real time data stream (json) based on keywords
• REST (v1.1)
• Make REST requests, and get results
• Possible parameters:
• Geospatial bounding box
• By time
• By user, hashtag, retweets etc
• Fire hose
• Big $$$. Big data

• Information & Obstacles
• Who
• What
• At best: Plain English (!)
• Worse: (Spanish or Arabic or Portuguese...)
• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.
• Absolute Worst: combination of all of them
• Where
• 1-2% with latitude / longitude
• Geocode
• When

JSON Tweet example:
• "created_at":"Sun Oct 27 13:57:40 +0000 2013",
• "id":394462908261740540,
• "text":"Flu :(",
• "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>",
• "user":{
• "id":594141140,
• "name":"Yultiana Farida N",
• "screen_name":"yultiana",
• "followers_count":231,
• "friends_count":252,
• "created_at":"Tue May 29 23:58:25 +0000 2012",
• "statuses_count":2397,
• },
• "geo":null,

Cloud Computing
• What does cloud computing bring to the table?
• Amazon’s EC2:
• Commoditized hardware
• Low cost
• Only charged for resources you use
• No long term commitments
• Scalable
• "Throwaway" mentality
**IF** you play by their rules!

Cloud Computing – AWS
• Tools
• Virtual Machines
• # of Processors, RAM, OS, disk capacity and I/O – all configurable
• Price range: $.02/hr - $4.60/hr
• Licensed OSes cost 50% more than Linux OSes
• Archive Storage
• S3 / Glacier
• Work Queues
• SQS
• Data Stores
• Dynamo (key value store), Red Shift (analysis store)
• Virtual Networking
• Routers, VPN gateways, access control lists, etc
• APIs
• Command line
• HTTPS REST
• Native programming languages (Python, bash, PHP, Java etc.)
Ideal for rapid prototyping / proof of concepts

Cloud Computing – AWS
• APIs
• Basic
• Start an instance (and start billing)
• Stop an instance (stop billing)
• Insert item into queue
• Remove item from queue
• Write to backup store
• Ultra advanced
• Reserved vs. on demand vs. spot instances
• Price can drop as much as 80% due to market demand
• Instance can disappear at any time

Big Data Analytics
• Can we skirt the “big data” problem by distilling the tweets
down from millions and millions “noise” tweets into a more
desirable data set?
• Enrich in real time, rather than on archived data, and avoid the
overhead of map/reduce?
• Possible Enrichment of raw data:
• Classification – separate tweets into “relevant” and “irrelevant”
• Geocoding – improve on the 1-2% ?
• Aggregation –> map reduce
• Mapping -> Reduce Function -> Output
• AWS – Elastic Map Reduce
• Clustering

Machine Learning
• Classification: relevant, or irrelevant?
• Human trained model
• Once model is established, bounce new data off it for
classification
• Validation of model
• Accuracy =
(Total # of classifications – Mismatches between machine / human)
Total # of classifications
• Crowdsourcing – AWS Mechanical Turk
• Improve model by feeding disagreements back into the model
• Our best text classification model to date: low 90%

Open Source
• Friendly to the commoditized computing paradigm
• Don’t have to worry about licensing issues
• Contributes to the “throwaway” discipline
• Don’t have to re-invent the wheel (collaboration)
• Solutions applicable to all parts of the architecture
• Acquire data: Node.js – non blocking
• Analyze data: R – statistical engine
• Store and query data: MongoDB (document store) or Riak (key-
value database)

Architecture
• We know Twitter is providing a mountain of data from all parts
of the world
• We know Amazon is providing a framework of low cost, on-
demand, no commitment computing
• Open source is providing a rich tool set
• Goals:
• Architect with cost in mind!
• Enrichment - Real time and after-the-fact enrichment (open data)
• Scalable
• Decoupled
• Service based
• Rapid development
• Prove the concepts

Architecture - Acquire
• Acquire the data from Twitter
• If classifying in real time:
• Store then classify?
• Classify then store?
• Tools
• Twitter streaming API
• Keywords
• Node.js
• Several different packages to interface with Twitter APIs
• Amazon
• EC2
• SQS (?) Extremely useful, but drives the cost up

Architecture - Analyze
• Classification interface
• Service based – HTTP REST
• Push or pull?
• Push – classifiers listen on port 80
• Pull – classifier starts pulling from an established work queue
• Both highly scalable and flexible with respect to cost.
• Stateless
• R
• Human trained machine learning packages available
• Cloud friendly – no licenses
• Automatable – from install, configuration, execution

Architecture - Store
• Store JSON as an object (document store) or normalize (relational
database)?
• Relational databases
• disk I/O intensive – not cloud friendly
• allow complex indexing
• Easy to get a business intelligence front end on them
• Requires a schema / ETL
• Key-value document stores
• Designed to be scalable – doesn’t need fast disks
• Indexing is not nearly as flexible as RDBMS
• More difficult to front a UI – no “drag and drop” tools
• No schema / ETL needed.
• Not as mature
• MongoDB / Riak

Architecture – Presentation
• Least need for cloud friendly scalability here?
• Options
• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho
• Open source BI software – SpagoBI
• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc
• Connect to an existing system instead?

Costs – Real Time Classification
• Number of tweets collected per day: 1,000,000 (comfortable - .25%)
• Machine used on EC2 to acquire (node.js): micro
• $.02/hr * 24 hrs = .48/day
• Machine used on EC2 to classify (R): small (x2)
• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day
• Machine used on EC2 to store (MongoDB): large
• $.24/hr * 24 hrs = $5.76 /day
• Machine used on EC2 for GUI (Apache): small
• $.06/hr * 24 = $1.44
•
$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 =
.00001056 cents/tweet
Can add more zeros if you relax real-time classification (spot instances)

Costs - Archive
• Size of average tweet: 2.5 KB
• Cost to archive:
• s3 : .095 GB/month
• 0.0000002 per tweet per month
• Glacier: .01 GB/month
• 0.00000002 per tweet per month
• Compression will add even more zeros, but will require more
computing power, and mean more latency for post collection
data analysis. Can be automated.

Use Cases
• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)
• Public-private partnership with City of Chicago Dept. of Public Health
and Smart Chicago Collaborative
• Reach out to city residents on Twitter tweeting about food poisoning
symptoms, in an attempt to get them to log information in the City’s
311 database (via the Open311 API)
• Once in the 311 database, it follows established City workflows, and
becomes actionable
• Numbers (1 year):
• 2,390 tweets classified as related to food poisoning
• 282 tweets responded to
• 205 reports submitted
• 145 inspections
• Real time classification examples:
• “Ugh! I got food poisoning from the McDonalds’s on Halstead!”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead
• “U of Chicago releases a new paper on the effects of food poisoning”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning
• Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be

Use Cases
• Disease Tracker
• Large scale attempt to track disease occurrences in the United
States.
• Sponsored by the Dept. of HHS
• Approximately 1 million tweets a day (cold, flu) classified in real
time
• EC2 scalable instances
• Geolocation
• Cost to run for 6 months: $850

Future Directions
• Turnkey service
• Can all this functionality be abstracted down to a pushbutton
service?
• Open data
• Can you advertise the data collected, how you enriched it, and
allow others to come along an enrich it as well?
• General purpose bridge between Twitter and issue tracking
databases
• Big industry problem

Github Sources
• Tweet Collector
• https://github.com/smartchicago/TweetCollector
• Classifier Code
• https://github.com/corynissen/foodborne_classifier

Open Data Summit Presentation by Joe Olsen

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Open Data Summit Presentation by Joe Olsen (20)

More from Christopher Whitaker (20)

Recently uploaded (20)

Open Data Summit Presentation by Joe Olsen