SlideShare a Scribd company logo
Data data data Session III
I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
What do we mean when we talk about data? Session III
We ask many many questions about the world around us.
To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
 
Data are the raw facts about our world
 
 
Thomas Nylen & Andrew Fountain (PSU), NASA, NSF
A lot of this data is available for you to use
Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
Direct data
Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
photo by  solarnu  on  Flickr
Indirect data
Indirect data Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org / http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com / http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
Source: Google Flu Trends (http://www.google.org/flutrends)
There has been an almost incomprehensible growth in digital data
 
5 Megabytes  - a high resolution photo 5 Gigabytes  - just more than a common DVD stores 1 Terabyte  - size of common home computer harddrive 15 Petabyte  - data produced at  CERN each year 5 Exabytes  ~ every word  spoken by humans 1.2 Zettabytes ~   Digital Universe in 2010 1,180,591,620,717,411,303,424 bytes or 2 70 
Okay, so there is a lot of data!
Is all this data free to use?
No
What you can do with data is largely dictated by license and copyright
Because of this many organizations have begun advocating and practicing  'open data' 'open data'
What is open data?
Free and unrestricted access to data
http://www.youtube.com/watch?v=3YcZ3Zqk0a8
Kepler Data
'...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
Eric.Nielsen.Photos on Flickr
 
Different but related ideas Open Government data.gov, data.gov.uk Open Access plos.org Open Source Apache, Firefox
What is data science? Part II
Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997,  Introduction to Data Analysis: The Rules of Evidence
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
It is a set of skills performed often but not exclusively by scientists
The availability of data on the internet is making data analysis accessible to anyone
http://www.youtube.com/watch?v=PnpGIgzNBJo&feature=player_embedded
Part III Where do you find the data you need?
My community has a particular set of data that we rely on very often
 
 
 
 
During our class, to find the data you need...
Search First
Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
Now you have found it, how do you get it?
Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
 
Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from. 
 
Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable. 
 
Remember that all of these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
XLS VS CSV
How can we join datasets together?
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau AFG
If you start working with data, really interesting things can appear.
 
 
  GBIF.org
from Eric Fisher on Flickr

More Related Content

PPT
Data Journalism (City Online Journalism wk8)
PDF
ITWS Capstone (RPI, Fall 2013)
PPTX
"What is Data Science?"
PPT
The Semantic Web: 2010 Update
PDF
Sharing Data on the Web
PDF
Unknown Unknowns
PPT
Web 3.0 Emerging
PDF
Better Data for a Better World
Data Journalism (City Online Journalism wk8)
ITWS Capstone (RPI, Fall 2013)
"What is Data Science?"
The Semantic Web: 2010 Update
Sharing Data on the Web
Unknown Unknowns
Web 3.0 Emerging
Better Data for a Better World

What's hot (20)

PDF
Data Science Popup Austin: Meet the PyData Community
PDF
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
PPTX
ProQuest Quantum - 14_0227
PPTX
Open Data
PPTX
Fsci 2018 tuesday31_july_am6
PPTX
Open Data
PDF
Learning Multilingual Semantics from Big Data on the Web
PPTX
Programming for Everybody in Python
PPTX
Inferring Web Citations using Social Data and SPARQL Rules
PPT
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
PDF
Data Science Popup Austin: Applied Machine Learning for IOT
PPTX
Engr185 Spring 2012
PPT
Development of the CyberCemetery (2011)
PDF
Transcript - Tracking Research Data Footprints via Integration with Research ...
PDF
Conducting Twitter Reserch
PDF
The Simple Power of the Link - ELAG 2014 Workshop
PDF
Museum impact: linking-up specimens with research published on them
PPT
More than Raw: Government Data Online
PDF
Open Research Data: Licensing | Standards | Future
PDF
Data Science Popup Austin: Back to The Future for Data and Analytics
Data Science Popup Austin: Meet the PyData Community
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
ProQuest Quantum - 14_0227
Open Data
Fsci 2018 tuesday31_july_am6
Open Data
Learning Multilingual Semantics from Big Data on the Web
Programming for Everybody in Python
Inferring Web Citations using Social Data and SPARQL Rules
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Data Science Popup Austin: Applied Machine Learning for IOT
Engr185 Spring 2012
Development of the CyberCemetery (2011)
Transcript - Tracking Research Data Footprints via Integration with Research ...
Conducting Twitter Reserch
The Simple Power of the Link - ELAG 2014 Workshop
Museum impact: linking-up specimens with research published on them
More than Raw: Government Data Online
Open Research Data: Licensing | Standards | Future
Data Science Popup Austin: Back to The Future for Data and Analytics
Ad

Similar to Data, data, data (20)

PPTX
Spark Social Media
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
PDF
Data Science: Harnessing Open Data for High Impact Solutions
PPTX
HKU Data Curation MLIM7350 Class 8
PDF
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
PPT
Open Data in a Big Data World: easy to say, but hard to do?
PDF
Briefing on US EPA Open Data Strategy using a Linked Data Approach
PDF
Rda nitrd 2015 berman - final
PPTX
2014 aus-agta
PPTX
Big data Analytics Fundamentals Chapter 1
PDF
From DARPA to Shakespeare: All the Data we Can Handle
PPTX
The life changing magic of tidying up your data: The art and science of makin...
KEY
Big data and APIs for PHP developers - SXSW 2011
PPTX
Responsible conduct of research: Data Management
PPTX
HKU Data Curation MLIM7350 Class 7
PPTX
METRO RDM Webinar
PPTX
Data Management and Horizon 2020
PPTX
Big Data in NATO and Your Role
PPTX
Biomedical Data Science: We Are Not Alone
PPTX
The Neuroscience Information Framework: A Scalable Platform for Information E...
Spark Social Media
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science: Harnessing Open Data for High Impact Solutions
HKU Data Curation MLIM7350 Class 8
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Open Data in a Big Data World: easy to say, but hard to do?
Briefing on US EPA Open Data Strategy using a Linked Data Approach
Rda nitrd 2015 berman - final
2014 aus-agta
Big data Analytics Fundamentals Chapter 1
From DARPA to Shakespeare: All the Data we Can Handle
The life changing magic of tidying up your data: The art and science of makin...
Big data and APIs for PHP developers - SXSW 2011
Responsible conduct of research: Data Management
HKU Data Curation MLIM7350 Class 7
METRO RDM Webinar
Data Management and Horizon 2020
Big Data in NATO and Your Role
Biomedical Data Science: We Are Not Alone
The Neuroscience Information Framework: A Scalable Platform for Information E...
Ad

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Institutional Correction lecture only . . .
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Cell Types and Its function , kingdom of life
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Insiders guide to clinical Medicine.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Cell Structure & Organelles in detailed.
PDF
Computing-Curriculum for Schools in Ghana
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Pharma ospi slides which help in ospi learning
PDF
Pre independence Education in Inndia.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Institutional Correction lecture only . . .
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Complications of Minimal Access Surgery at WLH
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Cell Types and Its function , kingdom of life
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
human mycosis Human fungal infections are called human mycosis..pptx
Renaissance Architecture: A Journey from Faith to Humanism
Insiders guide to clinical Medicine.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Cell Structure & Organelles in detailed.
Computing-Curriculum for Schools in Ghana
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Pharma ospi slides which help in ospi learning
Pre independence Education in Inndia.pdf
Microbial diseases, their pathogenesis and prophylaxis
VCE English Exam - Section C Student Revision Booklet
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

Data, data, data

  • 1. Data data data Session III
  • 2. I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
  • 3. What do we mean when we talk about data? Session III
  • 4. We ask many many questions about the world around us.
  • 5. To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
  • 6. “ The goal is to transform data into information, and information into insight”  Carly Fiorina
  • 7.  
  • 8. Data are the raw facts about our world
  • 9.  
  • 10.  
  • 11. Thomas Nylen & Andrew Fountain (PSU), NASA, NSF
  • 12. A lot of this data is available for you to use
  • 13. Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
  • 15. Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
  • 16. photo by  solarnu  on  Flickr
  • 18. Indirect data Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org / http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com / http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
  • 19. Source: Google Flu Trends (http://www.google.org/flutrends)
  • 20. There has been an almost incomprehensible growth in digital data
  • 21.  
  • 22. 5 Megabytes - a high resolution photo 5 Gigabytes - just more than a common DVD stores 1 Terabyte  - size of common home computer harddrive 15 Petabyte - data produced at CERN each year 5 Exabytes  ~ every word spoken by humans 1.2 Zettabytes ~   Digital Universe in 2010 1,180,591,620,717,411,303,424 bytes or 2 70 
  • 23. Okay, so there is a lot of data!
  • 24. Is all this data free to use?
  • 25. No
  • 26. What you can do with data is largely dictated by license and copyright
  • 27. Because of this many organizations have begun advocating and practicing 'open data' 'open data'
  • 28. What is open data?
  • 29. Free and unrestricted access to data
  • 32. '...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
  • 33. Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
  • 35.  
  • 36. Different but related ideas Open Government data.gov, data.gov.uk Open Access plos.org Open Source Apache, Firefox
  • 37. What is data science? Part II
  • 38. Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997,  Introduction to Data Analysis: The Rules of Evidence
  • 39. “ The goal is to transform data into information, and information into insight”  Carly Fiorina
  • 40. It is a set of skills performed often but not exclusively by scientists
  • 41. The availability of data on the internet is making data analysis accessible to anyone
  • 43. Part III Where do you find the data you need?
  • 44. My community has a particular set of data that we rely on very often
  • 45.  
  • 46.  
  • 47.  
  • 48.  
  • 49. During our class, to find the data you need...
  • 51. Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
  • 52. See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
  • 53. Now you have found it, how do you get it?
  • 54. Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
  • 55. Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
  • 56.  
  • 57. Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from. 
  • 58.  
  • 59. Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable. 
  • 60.  
  • 61. Remember that all of these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
  • 62. Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
  • 64. How can we join datasets together?
  • 65. Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
  • 66. Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau AFG
  • 67. If you start working with data, really interesting things can appear.
  • 68.  
  • 69.  
  • 71. from Eric Fisher on Flickr

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: \n
  • #5: Humans are naturally curious about our world. \nIn addition, we have social, economic, and personal motivations to understand how and why the world around us changes\n
  • #6: \n
  • #7: CEO HP\n
  • #8: 1993 David Vaughan British Anta Survey\nPredicted breaking in 30yrs\n2008 he conceded that his estimates had been to conservative\n
  • #9: \n
  • #10: \n
  • #11: \n
  • #12: Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
  • #13: Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
  • #14: \n
  • #15: \n
  • #16: \n
  • #17: \n
  • #18: \n
  • #19: \n
  • #20: \n
  • #21: \n
  • #22: Genbank\n
  • #23: \n
  • #24: \n
  • #25: \n
  • #26: \n
  • #27: \n
  • #28: \n
  • #29: \n
  • #30: \n
  • #31: \n
  • #32: \n
  • #33: \n
  • #34: \n
  • #35: Not everyone has the means to take this data and study it in a sophisticated analysis\nBut a lot of people are interested in space, astronomy, and our universe\nSo how could Kepler insure that these people could help them and have fun\n
  • #36: \n
  • #37: \n
  • #38: \n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: \n
  • #45: \n
  • #46: \n
  • #47: \n
  • #48: \n
  • #49: \n
  • #50: \n
  • #51: \n
  • #52: \n
  • #53: \n
  • #54: \n
  • #55: \n
  • #56: \n
  • #57: \n
  • #58: \n
  • #59: \n
  • #60: You will not likely encounter this anytime over the next couple of weeks\nbut it is best to be aware of\n
  • #61: \n
  • #62: \n
  • #63: \n
  • #64: XLS might be easier for you to navigate\nopen it in excel, sort columns, search for what you want\nbut CSV will almost always be easier to use anyplace other than excel\nsmaller, compact, but easily parsable\n
  • #65: This is not the linked data\n
  • #66: ISO - International organization for standards\n\n
  • #67: \n
  • #68: \n
  • #69: \n
  • #70: \n
  • #71: \n
  • #72: \n