SlideShare a Scribd company logo
Daniel Burseth 
Co-president MIT Big Data Explorers 
dburseth@mit.edu 
@dmbnyc 
Github: dburseth
 Acronyms abound 
 Tremendous complexity 
 Use building blocks not code
 This is easy 
EPPM of 10 requires 500 professionals
MIT Big Data Explorers - presentation by Daniel Burseth
 http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work. 
html?emc=eta1&_r=0 
Data preparation and cleansing: 
• Missing 
• Duplicative 
• Conventions (dates, time, 
geographies) 
• Spacing 
• Can we measure data 
cleanliness? 
• What’s our Pareto point?
 AWS -> EC2 
 Launch instance: ami-c6b61fae (US-EAST) 
 Instance type m3.medium 
 Connect 
 You should see some software on the desktop
 Scrape all of Craiglist’s Boston apartment listings using WebHarvy 
 Examine, clean, and prepare the data set using OpenRefine 
 Map our data and apply filters using Tableau 
……all without writing a single line of code.
MIT Big Data Explorers - presentation by Daniel Burseth
 A hyper-intelligent utility to scrape website 
data. 
 SysNucleus, makers of USBTrace 
 Heavy duty alternatives: Scrapy (scrappy.org), 
Beautiful Soup
HTTP://SHOUTKEY.COM/WIRE 
1. Start Config 
2. Click on Hungry Mother – 
capture text 
3. Click on Hungry Mother – 
capture URL 
4. Click on Kendall 
Square/MIT – capture text 
5. Click lasts review– 
capture text 
CLEAR 
1. Mine -> Scrape a list of 
similar links 
2. Click on Hungry Mother
 Let’s start collecting 
information in the first sub-page.
 Edit Clear 
 Navigate into a sub-page 
 Start Config 
 Set as Next Page Link
 Scheduler 
 Input keywords 
 Puase Inject (word of caution: scraping often violates TOS. Potentially not viable 
for apps, commercial purposes!) 
 TRY VISITING CRAIGSLIST IN AWS BTW!! 
 Proxy 
 Database export
 Download Craigslist Boston from http://shoutkey.com/glorify 
 Look at our data: open Boston Dirty.csv (20k rows of mess!) 
 Time to CLEAN: Launch GOOGLE-REFINE.EXE 
 Within MOZILLA, navigate to http://127.0.0.1:3333/ 
 Create Project -> This Computer -> Browse 
 Parse by tab 
 Create Project
1. First, sort your column. 
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of 
the middle of the data table. 
3. Then invoke Edit cells and Blank down on the Title column. 
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown 
menu. 
6. Remove the facet.
MIT Big Data Explorers - presentation by Daniel Burseth
MIT Big Data Explorers - presentation by Daniel Burseth
MIT Big Data Explorers - presentation by Daniel Burseth
 Then run the “To Number” transform again
MIT Big Data Explorers - presentation by Daniel Burseth
MIT Big Data Explorers - presentation by Daniel Burseth
MIT Big Data Explorers - presentation by Daniel Burseth
 Increment the radius to 7 
and make judgment calls 
along the way. 
 Change the Distance 
Function and do the same 
thing
MIT Big Data Explorers - presentation by Daniel Burseth
MIT Big Data Explorers - presentation by Daniel Burseth
 Looks like we have SOME really expensive real 
estate. Data errors????
 Boston Clean.csv
 Load Boston clean.csv 
 “Go to Worksheet”
 Great “semantic” example. Tableau understands that this text translates to a 
lat/long
 Look on the map in the lower right corner 
 Let’s “Filter Data”
 Under “Measures”, drag “Price” onto size in “Marks” 
 Change sum(Price) to avg(Price) 
 Drag Price, change to max(price) into Filters and select an “At Most” 
 Right click on the filter and show “Quick Filter” 
 Drag “City” onto “Label” 
 Menu Map -> Map Options 
 Click on a node for info and drill down potential
MIT Big Data Explorers - presentation by Daniel Burseth
1. Explored various webpage structures and scraped them 
2. Exported the data to Refine 
3. Parsed columns to extract critical price and location information 
4. Used clustering algorithms to merge related geographies 
5. Applied filters to identify errant prices 
6. Exported the data to Tableau 
7. Completed a real cursory mapping visualization
 Please come talk to me
MIT Big Data Explorers - presentation by Daniel Burseth

More Related Content

PDF
How to Convert XRD, FTIR File Data to Word Table in Order to Draw Plot on Ori...
PPTX
Georgian Pingbacks: Mapping Attribution Networks in a 19th-Century Newspaper ...
PDF
EX16_WD_CH01_GRADER_CAP_HW (completed solution)
PPT
How to geocode using AbbyisQueen
PPTX
Becca Aaronson: "Visualizing Health Data," 7.23.15
PDF
GraphQL Overview and Practice
PDF
TechEvent 2019: The sleeping Power of Data; Eberhard Lösch - Trivadis
PPT
Biotech day 3
How to Convert XRD, FTIR File Data to Word Table in Order to Draw Plot on Ori...
Georgian Pingbacks: Mapping Attribution Networks in a 19th-Century Newspaper ...
EX16_WD_CH01_GRADER_CAP_HW (completed solution)
How to geocode using AbbyisQueen
Becca Aaronson: "Visualizing Health Data," 7.23.15
GraphQL Overview and Practice
TechEvent 2019: The sleeping Power of Data; Eberhard Lösch - Trivadis
Biotech day 3

Viewers also liked (17)

PPTX
Harappan civilisation
PDF
4Design Building Material Effect Rendering Software Introduction
DOCX
Barrett's digital brown bag understanding the new language of the vivid brand
PDF
CV-SANAL-MAY15
DOCX
Textos 304
PDF
Matt Wertz 10th Anniversary Tour Submission
PPTX
1.tugas keamanan sistem dan jaringan komputer
PPTX
Dark souls 2 connects
PPTX
Power point
PPT
M1(1) zaman pra sejarah
PDF
Mirizzi syndrome history, present and
PDF
Dintelligence Credentials
PPT
Asma ul husna
PPTX
Bahan ajar unsur-senyawa-campuran
PPTX
Catedra upesista reglamento estudiantil
PPTX
Bahan ajar listrik magnet herman mursito
PDF
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
Harappan civilisation
4Design Building Material Effect Rendering Software Introduction
Barrett's digital brown bag understanding the new language of the vivid brand
CV-SANAL-MAY15
Textos 304
Matt Wertz 10th Anniversary Tour Submission
1.tugas keamanan sistem dan jaringan komputer
Dark souls 2 connects
Power point
M1(1) zaman pra sejarah
Mirizzi syndrome history, present and
Dintelligence Credentials
Asma ul husna
Bahan ajar unsur-senyawa-campuran
Catedra upesista reglamento estudiantil
Bahan ajar listrik magnet herman mursito
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
Ad

Similar to MIT Big Data Explorers - presentation by Daniel Burseth (20)

PPTX
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
PDF
Querying a custom table in google big query
PDF
Informatica complex transformation i
PPTX
End-to-End Machine Learning Project
PDF
PDF
Scraping Handout
PDF
Bridging data analysis and interactive visualization
PPTX
dbms ms access basics and introduction to ms access
PPT
E mine by V.DINESH KUMAR KSRCT
PDF
Spatial query tutorial for nyc subway income level along subway
PPTX
Benefits of Using MongoDB Over RDBMSs
PPT
Mr bi
PPSX
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
PDF
Hands on With Advanced Data Grid
PDF
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
PPTX
PATTERNS07 - Data Representation in C#
PPT
Data preprocessing in precision agriculture
DOCX
Excel Training
PPT
Potter’S Wheel
PDF
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Querying a custom table in google big query
Informatica complex transformation i
End-to-End Machine Learning Project
Scraping Handout
Bridging data analysis and interactive visualization
dbms ms access basics and introduction to ms access
E mine by V.DINESH KUMAR KSRCT
Spatial query tutorial for nyc subway income level along subway
Benefits of Using MongoDB Over RDBMSs
Mr bi
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Hands on With Advanced Data Grid
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
PATTERNS07 - Data Representation in C#
Data preprocessing in precision agriculture
Excel Training
Potter’S Wheel
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
Ad

Recently uploaded (20)

PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to Data Science and Data Analysis
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Introduction to Inferential Statistics.pptx
PDF
Introduction to the R Programming Language
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
DOCX
Factor Analysis Word Document Presentation
PPTX
Managing Community Partner Relationships
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
Navigating the Thai Supplements Landscape.pdf
Introduction to Inferential Statistics.pptx
Introduction to the R Programming Language
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Pilar Kemerdekaan dan Identi Bangsa.pptx
IMPACT OF LANDSLIDE.....................
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Business_Capability_Map_Collection__pptx
New ISO 27001_2022 standard and the changes
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Leprosy and NLEP programme community medicine
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
retention in jsjsksksksnbsndjddjdnFPD.pptx
CYBER SECURITY the Next Warefare Tactics
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Factor Analysis Word Document Presentation
Managing Community Partner Relationships

MIT Big Data Explorers - presentation by Daniel Burseth

  • 1. Daniel Burseth Co-president MIT Big Data Explorers dburseth@mit.edu @dmbnyc Github: dburseth
  • 2.  Acronyms abound  Tremendous complexity  Use building blocks not code
  • 3.  This is easy EPPM of 10 requires 500 professionals
  • 5.  http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work. html?emc=eta1&_r=0 Data preparation and cleansing: • Missing • Duplicative • Conventions (dates, time, geographies) • Spacing • Can we measure data cleanliness? • What’s our Pareto point?
  • 6.  AWS -> EC2  Launch instance: ami-c6b61fae (US-EAST)  Instance type m3.medium  Connect  You should see some software on the desktop
  • 7.  Scrape all of Craiglist’s Boston apartment listings using WebHarvy  Examine, clean, and prepare the data set using OpenRefine  Map our data and apply filters using Tableau ……all without writing a single line of code.
  • 9.  A hyper-intelligent utility to scrape website data.  SysNucleus, makers of USBTrace  Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup
  • 10. HTTP://SHOUTKEY.COM/WIRE 1. Start Config 2. Click on Hungry Mother – capture text 3. Click on Hungry Mother – capture URL 4. Click on Kendall Square/MIT – capture text 5. Click lasts review– capture text CLEAR 1. Mine -> Scrape a list of similar links 2. Click on Hungry Mother
  • 11.  Let’s start collecting information in the first sub-page.
  • 12.  Edit Clear  Navigate into a sub-page  Start Config  Set as Next Page Link
  • 13.  Scheduler  Input keywords  Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)  TRY VISITING CRAIGSLIST IN AWS BTW!!  Proxy  Database export
  • 14.  Download Craigslist Boston from http://shoutkey.com/glorify  Look at our data: open Boston Dirty.csv (20k rows of mess!)  Time to CLEAN: Launch GOOGLE-REFINE.EXE  Within MOZILLA, navigate to http://127.0.0.1:3333/  Create Project -> This Computer -> Browse  Parse by tab  Create Project
  • 15. 1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.
  • 19.  Then run the “To Number” transform again
  • 23.  Increment the radius to 7 and make judgment calls along the way.  Change the Distance Function and do the same thing
  • 26.  Looks like we have SOME really expensive real estate. Data errors????
  • 28.  Load Boston clean.csv  “Go to Worksheet”
  • 29.  Great “semantic” example. Tableau understands that this text translates to a lat/long
  • 30.  Look on the map in the lower right corner  Let’s “Filter Data”
  • 31.  Under “Measures”, drag “Price” onto size in “Marks”  Change sum(Price) to avg(Price)  Drag Price, change to max(price) into Filters and select an “At Most”  Right click on the filter and show “Quick Filter”  Drag “City” onto “Label”  Menu Map -> Map Options  Click on a node for info and drill down potential
  • 33. 1. Explored various webpage structures and scraped them 2. Exported the data to Refine 3. Parsed columns to extract critical price and location information 4. Used clustering algorithms to merge related geographies 5. Applied filters to identify errant prices 6. Exported the data to Tableau 7. Completed a real cursory mapping visualization
  • 34.  Please come talk to me

Editor's Notes

  • #6: http://datacleaner.org/ Certain algorithms This aspect has certainly lagged technology