SlideShare a Scribd company logo
JoeOlson
DataArchitect
SmartChicagoCollaborative
27Mar2014
joe.olson@cct.org
(All the cool buzzwords in one place!)
Social Media,
Cloud Computing,
Machine Learning,
Open Source, and
Big Data Analytics
Social Media - Twitter
• What can we learn from Twitter?
• 400 million tweets per day
source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
• 218 million users
source: http://techcrunch.com/2013/10/03/bweeting/
• Excellent source of sentiment
• Excellent source of big data
• Prototyping
• Modeling natural language
• Resume padding
Social Media - Twitter
• How do we get at the data?
• Twitter provided APIs:
• https://dev.twitter.com/docs
• Streaming
• Set up a real time data stream (json) based on keywords
• REST (v1.1)
• Make REST requests, and get results
• Possible parameters:
• Geospatial bounding box
• By time
• By user, hashtag, retweets etc
• Fire hose
• Big $$$. Big data
Social Media - Twitter
• Information & Obstacles
• Who
• What
• At best: Plain English (!)
• Worse: (Spanish or Arabic or Portuguese...)
• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.
• Absolute Worst: combination of all of them
• Where
• 1-2% with latitude / longitude
• Geocode
• When
Social Media - Twitter
JSON Tweet example:
• "created_at":"Sun Oct 27 13:57:40 +0000 2013",
• "id":394462908261740540,
• "text":"Flu :(",
• "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>",
• "user":{
• "id":594141140,
• "name":"Yultiana Farida N",
• "screen_name":"yultiana",
• "followers_count":231,
• "friends_count":252,
• "created_at":"Tue May 29 23:58:25 +0000 2012",
• "statuses_count":2397,
• },
• "geo":null,
Cloud Computing
• What does cloud computing bring to the table?
• Amazon’s EC2:
• Commoditized hardware
• Low cost
• Only charged for resources you use
• No long term commitments
• Scalable
• "Throwaway" mentality
**IF** you play by their rules!
Cloud Computing – AWS
• Tools
• Virtual Machines
• # of Processors, RAM, OS, disk capacity and I/O – all configurable
• Price range: $.02/hr - $4.60/hr
• Licensed OSes cost 50% more than Linux OSes
• Archive Storage
• S3 / Glacier
• Work Queues
• SQS
• Data Stores
• Dynamo (key value store), Red Shift (analysis store)
• Virtual Networking
• Routers, VPN gateways, access control lists, etc
• APIs
• Command line
• HTTPS REST
• Native programming languages (Python, bash, PHP, Java etc.)
Ideal for rapid prototyping / proof of concepts
Cloud Computing – AWS
• APIs
• Basic
• Start an instance (and start billing)
• Stop an instance (stop billing)
• Insert item into queue
• Remove item from queue
• Write to backup store
• Ultra advanced
• Reserved vs. on demand vs. spot instances
• Price can drop as much as 80% due to market demand
• Instance can disappear at any time
Big Data Analytics
• Can we skirt the “big data” problem by distilling the tweets
down from millions and millions “noise” tweets into a more
desirable data set?
• Enrich in real time, rather than on archived data, and avoid the
overhead of map/reduce?
• Possible Enrichment of raw data:
• Classification – separate tweets into “relevant” and “irrelevant”
• Geocoding – improve on the 1-2% ?
• Aggregation –> map reduce
• Mapping -> Reduce Function -> Output
• AWS – Elastic Map Reduce
• Clustering
Machine Learning
• Classification: relevant, or irrelevant?
• Human trained model
• Once model is established, bounce new data off it for
classification
• Validation of model
• Accuracy =
(Total # of classifications – Mismatches between machine / human)
Total # of classifications
• Crowdsourcing – AWS Mechanical Turk
• Improve model by feeding disagreements back into the model
• Our best text classification model to date: low 90%
Open Source
• Friendly to the commoditized computing paradigm
• Don’t have to worry about licensing issues
• Contributes to the “throwaway” discipline
• Don’t have to re-invent the wheel (collaboration)
• Solutions applicable to all parts of the architecture
• Acquire data: Node.js – non blocking
• Analyze data: R – statistical engine
• Store and query data: MongoDB (document store) or Riak (key-
value database)
Architecture
• We know Twitter is providing a mountain of data from all parts
of the world
• We know Amazon is providing a framework of low cost, on-
demand, no commitment computing
• Open source is providing a rich tool set
• Goals:
• Architect with cost in mind!
• Enrichment - Real time and after-the-fact enrichment (open data)
• Scalable
• Decoupled
• Service based
• Rapid development
• Prove the concepts
Architecture - Acquire
• Acquire the data from Twitter
• If classifying in real time:
• Store then classify?
• Classify then store?
• Tools
• Twitter streaming API
• Keywords
• Node.js
• Several different packages to interface with Twitter APIs
• Amazon
• EC2
• SQS (?) Extremely useful, but drives the cost up
Architecture - Analyze
• Classification interface
• Service based – HTTP REST
• Push or pull?
• Push – classifiers listen on port 80
• Pull – classifier starts pulling from an established work queue
• Both highly scalable and flexible with respect to cost.
• Stateless
• R
• Human trained machine learning packages available
• Cloud friendly – no licenses
• Automatable – from install, configuration, execution
Architecture - Store
• Store JSON as an object (document store) or normalize (relational
database)?
• Relational databases
• disk I/O intensive – not cloud friendly
• allow complex indexing
• Easy to get a business intelligence front end on them
• Requires a schema / ETL
• Key-value document stores
• Designed to be scalable – doesn’t need fast disks
• Indexing is not nearly as flexible as RDBMS
• More difficult to front a UI – no “drag and drop” tools
• No schema / ETL needed.
• Not as mature
• MongoDB / Riak
Architecture – Presentation
• Least need for cloud friendly scalability here?
• Options
• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho
• Open source BI software – SpagoBI
• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc
• Connect to an existing system instead?
Costs – Real Time Classification
• Number of tweets collected per day: 1,000,000 (comfortable - .25%)
• Machine used on EC2 to acquire (node.js): micro
• $.02/hr * 24 hrs = .48/day
• Machine used on EC2 to classify (R): small (x2)
• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day
• Machine used on EC2 to store (MongoDB): large
• $.24/hr * 24 hrs = $5.76 /day
• Machine used on EC2 for GUI (Apache): small
• $.06/hr * 24 = $1.44
•
$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 =
.00001056 cents/tweet
Can add more zeros if you relax real-time classification (spot instances)
Costs - Archive
• Size of average tweet: 2.5 KB
• Cost to archive:
• s3 : .095 GB/month
• 0.0000002 per tweet per month
• Glacier: .01 GB/month
• 0.00000002 per tweet per month
• Compression will add even more zeros, but will require more
computing power, and mean more latency for post collection
data analysis. Can be automated.
Use Cases
• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)
• Public-private partnership with City of Chicago Dept. of Public Health
and Smart Chicago Collaborative
• Reach out to city residents on Twitter tweeting about food poisoning
symptoms, in an attempt to get them to log information in the City’s
311 database (via the Open311 API)
• Once in the 311 database, it follows established City workflows, and
becomes actionable
• Numbers (1 year):
• 2,390 tweets classified as related to food poisoning
• 282 tweets responded to
• 205 reports submitted
• 145 inspections
• Real time classification examples:
• “Ugh! I got food poisoning from the McDonalds’s on Halstead!”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead
• “U of Chicago releases a new paper on the effects of food poisoning”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning
• Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
Use Cases
• Disease Tracker
• Large scale attempt to track disease occurrences in the United
States.
• Sponsored by the Dept. of HHS
• Approximately 1 million tweets a day (cold, flu) classified in real
time
• EC2 scalable instances
• Geolocation
• Cost to run for 6 months: $850
Future Directions
• Turnkey service
• Can all this functionality be abstracted down to a pushbutton
service?
• Open data
• Can you advertise the data collected, how you enriched it, and
allow others to come along an enrich it as well?
• General purpose bridge between Twitter and issue tracking
databases
• Big industry problem
Github Sources
• Tweet Collector
• https://github.com/smartchicago/TweetCollector
• Classifier Code
• https://github.com/corynissen/foodborne_classifier

More Related Content

PDF
Cloud Big Data Architectures
PDF
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
PDF
Technology behind-real-time-log-analytics
PDF
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
PDF
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
PDF
Big Data - in the cloud or rather on-premises?
PDF
Big Data Architecture
PDF
How a Tweet Went Viral - BIWA Summit 2017
Cloud Big Data Architectures
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
Technology behind-real-time-log-analytics
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Big Data - in the cloud or rather on-premises?
Big Data Architecture
How a Tweet Went Viral - BIWA Summit 2017

What's hot (20)

PDF
Comparing Microsoft Big Data Platform Technologies
PPTX
NoSQL for the SQL Server Pro
PDF
Lambda architecture for real time big data
PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
PDF
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
PDF
Scaling to Infinity - Open Source meets Big Data
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
PDF
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
PPTX
Azure Stream Analytics : Analyse Data in Motion
PDF
Relational to Big Graph
PPTX
MongoDB & Hadoop - Understanding Your Big Data
PDF
PDF
Treasure Data From MySQL to Redshift
PDF
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
PDF
Yahoo's Next Generation User Profile Platform
PPTX
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PPTX
OWF 2014 - Take back control of your Web tracking - Dataiku
Comparing Microsoft Big Data Platform Technologies
NoSQL for the SQL Server Pro
Lambda architecture for real time big data
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Scaling to Infinity - Open Source meets Big Data
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
Azure Stream Analytics : Analyse Data in Motion
Relational to Big Graph
MongoDB & Hadoop - Understanding Your Big Data
Treasure Data From MySQL to Redshift
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
Yahoo's Next Generation User Profile Platform
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
OWF 2014 - Take back control of your Web tracking - Dataiku
Ad

Viewers also liked (17)

PPT
Expungement Talk from LAF
PPTX
Chicago's role in the national civic innovation network
PPTX
プロジェクト進捗レポート
PPTX
PPTX
Yaneth leon
PPT
Civic Technology on the Front Lines
PPTX
How to Level Up Your Event - Code for America Brigade Training
PPTX
Civic Hacking 101 - 2015
PPS
Niver edinéia
DOCX
4. silabus
PPTX
PPTX
Building A Civic Innovation Ecosystem in Chicago
DOCX
Office Technology Training
PDF
Jay Van Patten OpenGov Hack Night Presentation
PPT
Civic Technology on the Front Lines
PPSX
Bambolina - livro sem fala
PPTX
Code for Japan / Civic Tech Forum (Japanese Version)
Expungement Talk from LAF
Chicago's role in the national civic innovation network
プロジェクト進捗レポート
Yaneth leon
Civic Technology on the Front Lines
How to Level Up Your Event - Code for America Brigade Training
Civic Hacking 101 - 2015
Niver edinéia
4. silabus
Building A Civic Innovation Ecosystem in Chicago
Office Technology Training
Jay Van Patten OpenGov Hack Night Presentation
Civic Technology on the Front Lines
Bambolina - livro sem fala
Code for Japan / Civic Tech Forum (Japanese Version)
Ad

Similar to Open Data Summit Presentation by Joe Olsen (20)

PDF
Chirp 2010: Scaling Twitter
PDF
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
PDF
Extracting Insights from Data at Twitter
PPTX
The Big Data Stack
KEY
Analytics for the Real-Time Web
PDF
Introduction to Big Data
PDF
The Open Source... Behind the Tweets
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
KEY
Big data and APIs for PHP developers - SXSW 2011
PPTX
Sharing a Startup’s Big Data Lessons
PDF
Lessons from Highly Scalable Architectures at Social Networking Sites
ODP
Prezentare: Big Data demistificat
PPTX
Big Data_Architecture.pptx
PDF
As simple as Apache Spark
PDF
Data Infrastructure for a World of Music
PPTX
Open Source india 2014
PPTX
Software architecture for data applications
PDF
Big Data to SMART Data : Process Scenario
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing twitter
Chirp 2010: Scaling Twitter
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Extracting Insights from Data at Twitter
The Big Data Stack
Analytics for the Real-Time Web
Introduction to Big Data
The Open Source... Behind the Tweets
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Big data and APIs for PHP developers - SXSW 2011
Sharing a Startup’s Big Data Lessons
Lessons from Highly Scalable Architectures at Social Networking Sites
Prezentare: Big Data demistificat
Big Data_Architecture.pptx
As simple as Apache Spark
Data Infrastructure for a World of Music
Open Source india 2014
Software architecture for data applications
Big Data to SMART Data : Process Scenario
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing twitter

More from Christopher Whitaker (20)

PDF
CTDC Ecosystem Mapping Guide
PDF
CTDC DC Case Study
PDF
Harnessing Civic Tech & Data for Justice in STL
PDF
CTDC 21st Century Solutions
PDF
CTDC Infographic
PDF
01 boston cs_final_update
PPTX
Cook County at Chi Hack Night
PPTX
Modelling pension reform in illinois
DOC
Swop job description data specialist 2014-11-24
DOCX
Chicago connected training schedule november 2014
PDF
Tech gyrls jitterbug
PDF
August 2014 ctc schedule
PDF
Ctc july 2014 schedule
PDF
Tech gyrls google sketch up
PDF
Ywca ctc may schedule 1
PDF
Mindstorms lego flyer 2014
PDF
CTC Course description 1
PDF
Kelly Hall YMCA May Schedule
PDF
1 april business technology courses
PDF
Techgirls flyer 1
CTDC Ecosystem Mapping Guide
CTDC DC Case Study
Harnessing Civic Tech & Data for Justice in STL
CTDC 21st Century Solutions
CTDC Infographic
01 boston cs_final_update
Cook County at Chi Hack Night
Modelling pension reform in illinois
Swop job description data specialist 2014-11-24
Chicago connected training schedule november 2014
Tech gyrls jitterbug
August 2014 ctc schedule
Ctc july 2014 schedule
Tech gyrls google sketch up
Ywca ctc may schedule 1
Mindstorms lego flyer 2014
CTC Course description 1
Kelly Hall YMCA May Schedule
1 april business technology courses
Techgirls flyer 1

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Programs and apps: productivity, graphics, security and other tools
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release

Open Data Summit Presentation by Joe Olsen

  • 1. JoeOlson DataArchitect SmartChicagoCollaborative 27Mar2014 joe.olson@cct.org (All the cool buzzwords in one place!) Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics
  • 2. Social Media - Twitter • What can we learn from Twitter? • 400 million tweets per day source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter • 218 million users source: http://techcrunch.com/2013/10/03/bweeting/ • Excellent source of sentiment • Excellent source of big data • Prototyping • Modeling natural language • Resume padding
  • 3. Social Media - Twitter • How do we get at the data? • Twitter provided APIs: • https://dev.twitter.com/docs • Streaming • Set up a real time data stream (json) based on keywords • REST (v1.1) • Make REST requests, and get results • Possible parameters: • Geospatial bounding box • By time • By user, hashtag, retweets etc • Fire hose • Big $$$. Big data
  • 4. Social Media - Twitter • Information & Obstacles • Who • What • At best: Plain English (!) • Worse: (Spanish or Arabic or Portuguese...) • Worst: “Textspeak” symbols :-0, UTF8 chars, etc. • Absolute Worst: combination of all of them • Where • 1-2% with latitude / longitude • Geocode • When
  • 5. Social Media - Twitter JSON Tweet example: • "created_at":"Sun Oct 27 13:57:40 +0000 2013", • "id":394462908261740540, • "text":"Flu :(", • "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>", • "user":{ • "id":594141140, • "name":"Yultiana Farida N", • "screen_name":"yultiana", • "followers_count":231, • "friends_count":252, • "created_at":"Tue May 29 23:58:25 +0000 2012", • "statuses_count":2397, • }, • "geo":null,
  • 6. Cloud Computing • What does cloud computing bring to the table? • Amazon’s EC2: • Commoditized hardware • Low cost • Only charged for resources you use • No long term commitments • Scalable • "Throwaway" mentality **IF** you play by their rules!
  • 7. Cloud Computing – AWS • Tools • Virtual Machines • # of Processors, RAM, OS, disk capacity and I/O – all configurable • Price range: $.02/hr - $4.60/hr • Licensed OSes cost 50% more than Linux OSes • Archive Storage • S3 / Glacier • Work Queues • SQS • Data Stores • Dynamo (key value store), Red Shift (analysis store) • Virtual Networking • Routers, VPN gateways, access control lists, etc • APIs • Command line • HTTPS REST • Native programming languages (Python, bash, PHP, Java etc.) Ideal for rapid prototyping / proof of concepts
  • 8. Cloud Computing – AWS • APIs • Basic • Start an instance (and start billing) • Stop an instance (stop billing) • Insert item into queue • Remove item from queue • Write to backup store • Ultra advanced • Reserved vs. on demand vs. spot instances • Price can drop as much as 80% due to market demand • Instance can disappear at any time
  • 9. Big Data Analytics • Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set? • Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce? • Possible Enrichment of raw data: • Classification – separate tweets into “relevant” and “irrelevant” • Geocoding – improve on the 1-2% ? • Aggregation –> map reduce • Mapping -> Reduce Function -> Output • AWS – Elastic Map Reduce • Clustering
  • 10. Machine Learning • Classification: relevant, or irrelevant? • Human trained model • Once model is established, bounce new data off it for classification • Validation of model • Accuracy = (Total # of classifications – Mismatches between machine / human) Total # of classifications • Crowdsourcing – AWS Mechanical Turk • Improve model by feeding disagreements back into the model • Our best text classification model to date: low 90%
  • 11. Open Source • Friendly to the commoditized computing paradigm • Don’t have to worry about licensing issues • Contributes to the “throwaway” discipline • Don’t have to re-invent the wheel (collaboration) • Solutions applicable to all parts of the architecture • Acquire data: Node.js – non blocking • Analyze data: R – statistical engine • Store and query data: MongoDB (document store) or Riak (key- value database)
  • 12. Architecture • We know Twitter is providing a mountain of data from all parts of the world • We know Amazon is providing a framework of low cost, on- demand, no commitment computing • Open source is providing a rich tool set • Goals: • Architect with cost in mind! • Enrichment - Real time and after-the-fact enrichment (open data) • Scalable • Decoupled • Service based • Rapid development • Prove the concepts
  • 13. Architecture - Acquire • Acquire the data from Twitter • If classifying in real time: • Store then classify? • Classify then store? • Tools • Twitter streaming API • Keywords • Node.js • Several different packages to interface with Twitter APIs • Amazon • EC2 • SQS (?) Extremely useful, but drives the cost up
  • 14. Architecture - Analyze • Classification interface • Service based – HTTP REST • Push or pull? • Push – classifiers listen on port 80 • Pull – classifier starts pulling from an established work queue • Both highly scalable and flexible with respect to cost. • Stateless • R • Human trained machine learning packages available • Cloud friendly – no licenses • Automatable – from install, configuration, execution
  • 15. Architecture - Store • Store JSON as an object (document store) or normalize (relational database)? • Relational databases • disk I/O intensive – not cloud friendly • allow complex indexing • Easy to get a business intelligence front end on them • Requires a schema / ETL • Key-value document stores • Designed to be scalable – doesn’t need fast disks • Indexing is not nearly as flexible as RDBMS • More difficult to front a UI – no “drag and drop” tools • No schema / ETL needed. • Not as mature • MongoDB / Riak
  • 16. Architecture – Presentation • Least need for cloud friendly scalability here? • Options • Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho • Open source BI software – SpagoBI • Roll your own - PHP, Ruby, Visual Basic, Javascript, etc • Connect to an existing system instead?
  • 17. Costs – Real Time Classification • Number of tweets collected per day: 1,000,000 (comfortable - .25%) • Machine used on EC2 to acquire (node.js): micro • $.02/hr * 24 hrs = .48/day • Machine used on EC2 to classify (R): small (x2) • $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day • Machine used on EC2 to store (MongoDB): large • $.24/hr * 24 hrs = $5.76 /day • Machine used on EC2 for GUI (Apache): small • $.06/hr * 24 = $1.44 • $0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet Can add more zeros if you relax real-time classification (spot instances)
  • 18. Costs - Archive • Size of average tweet: 2.5 KB • Cost to archive: • s3 : .095 GB/month • 0.0000002 per tweet per month • Glacier: .01 GB/month • 0.00000002 per tweet per month • Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.
  • 19. Use Cases • Foodborne Chicago (http://foodborne.smartchicagoapps.org/) • Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative • Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API) • Once in the 311 database, it follows established City workflows, and becomes actionable • Numbers (1 year): • 2,390 tweets classified as related to food poisoning • 282 tweets responded to • 205 reports submitted • 145 inspections • Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!” http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead • “U of Chicago releases a new paper on the effects of food poisoning” http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning • Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
  • 20. Use Cases • Disease Tracker • Large scale attempt to track disease occurrences in the United States. • Sponsored by the Dept. of HHS • Approximately 1 million tweets a day (cold, flu) classified in real time • EC2 scalable instances • Geolocation • Cost to run for 6 months: $850
  • 21. Future Directions • Turnkey service • Can all this functionality be abstracted down to a pushbutton service? • Open data • Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well? • General purpose bridge between Twitter and issue tracking databases • Big industry problem
  • 22. Github Sources • Tweet Collector • https://github.com/smartchicago/TweetCollector • Classifier Code • https://github.com/corynissen/foodborne_classifier