SlideShare a Scribd company logo
Designing and Scoping a
Data Science Project
Data Science for Beginners, Session 1
About these Sessions
Session Format
Session:
• One topic
• Learn 4-6 concepts related to that topic
• Try apps or code related to that topic
Before each session:
• Install required tools (see the ā€˜tool installs’ instructions sheet)
• Do background reading
Session Topics
People
• Designing a data science project
• Communicating results
Tools
• Python basics
• Enterprise data tools
Getting Data
• Acquiring data
• Cleaning and exploring data
Special data types
• Handling text data
• Handling geospatial data
• Handling big data
Learning from data
• Predicting values from data
• Learning relationships from data
• Learning classes from data
Sessions Timeline
1. Scoping a data science project
2. Python basics
3. Acquiring data
4. Communicating results
5. Cleaning and exploring data
6. Predicting values from data
7. Handling text data
8. Handling geospatial data
9. Learning relationships from data
10. Enterprise data tools
11. Learning classes from data
12. Handling big data
Session 1: your 5-7 things
• What is data science?
• Data science is a process
• What’s a data scientist?
• Data science competitions
• Writing a problem statement
What is Data Science?
Defining Data Science
ā€œA data scientist… excels at analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.ā€
ā€œThe analysis of data using the scientific methodā€
ā€œA data scientist is an individual, organization or application that performs statistical
analysis, data mining and retrieval processes on a large amount of data to identify
trends, figures and other relevant information.ā€
Understanding through Data
Data Science is a Process
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize your results
Ask an interesting question
Write hypotheses that can be explored
ā— Do people have more phones than toilets?
ā— How is Ebola spreading?
ā— Is using wood fires sustainable in rural Tanzania?
ā— Can we feed 9 billion people?
Make them simple, actionable, incremental
Get the data
Data files (CSV, Excel, Json, Xml...)
ā— Databases (sqlite, mysql, oracle, postgresql...)
ā— APIs
ā— Report tables (tables on websites, in pdf reports...)
ā— Text (reports and other documents…)
ā— Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
ā— Images (satellite images, drone footage, pictures, videos…)
Most data is small, but…
Reformat the data
Explore the data
Model the Data
Communicate results
What’s a Data
Scientist?
The Data Science Venn Diagram
How do you become a data scientist?
Learning and Practice
ā— Kaggle - online datascience competitions
ā— Driven Data - social good datascience competitions
ā— Innocentive - some datascience challenges
ā— CrowdAnalytix - business datascience competitions
Should you become a data
scientist?
ā— Not necessarily. There are lots of data science
students desperate for good problems to work on.
ā— You might want to become someone who can
work with data scientists
ā— Which means learning how to specify data
problems well
Problem examples:
Data Science
Competitions
Who Does What
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize
your results
Problem Owner
Competitor
?
DrivenData
Kaggle
DataKind
Example project: Pump It Up
Tanzania wells:
ā€œYour goal is to predict the
operating condition of a
waterpoint for each record in the
datasetā€
Example project: Cervical cancer
DrivenData competition guidelines
Impact: ā€œā€¦ clear win for the organisation in terms of effective planning, resources
saved or people served… good story around how they generate social impactā€¦ā€
Challenge: ā€œā€¦ challenging enough for a rich competitionā€¦ā€
Feasibility: ā€œā€¦.the right kind of data to answer the question at hand… does it
have enough signal to be useful?...ā€
Privacy: ā€œā€¦ can answer this question while protecting the privacy of individuals in
the dataset and the operational privacy of an organisationā€¦ā€
Writing a Problem
Statement
Design your project
Context: who needs this work, and what are they doing it for?
Needs: what are you trying to fix
Vision: what do you expect your final result to look like?
Outcome: how do you get your results to the people who need them? What
happens next?
Design your questions
Is the question concrete enough?
Can you translate the question into an experiment?
Is it actionable?
What actions will be taken given the answer?
What data is needed to do the analysis?
Data Science Ethics
Data Risk and Ethics
You’re responsible for your data outputs
Could your outputs increase risk to anyone?
How will you respect privacy and security?
Data Risk
Risk: ā€œThe probability of something happening multiplied by the resulting cost or
benefit if it doesā€
Risk of: physical, legal, reputational, privacy harm
Likelihood (e.g. low, medium, high)
Risk to: data subjects, collectors, processors, releasers, users
PII: Personally Identifiable Information
ā€œPersonally identifiable information (PII) is any data that could potentially
identify a specific individual. Any information that can be used to distinguish one
person from another and can be used for de-anonymizing anonymous data can be
considered PII.ā€
PII Red Flags
Names, addresses, phone numbers
Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
Members of small populations
Untranslated text
Codes (e.g. ā€œ41ā€)
Slang terms
Exercises
3-minute exercise: Ask interesting questions
Either your own questions:
Questions that data might help with
Stories you want to tell with data
Datasets you’d like to explore
Or pick an existing question:
ā— Competition questions: Kaggle, DrivenData
ā— A data science project that interested you
3-minute exercise: Get the data
Pick one of your questions
List the ideal data you need to answer it
List the data that’s (probably) available
Think about what you’ll do if the data you need isn’t available
What compromises could you make
Where would you look for more data
Are there proxies (other datasets that tell you something about your question)
3-min exercise: design your communications
List the types of people you’d want to show your results to
How do you want them to change the world? Can they take actions, can they
change opinions etc
Describe the types of outputs that might be persuasive to them - visuals, text,
numbers, stories, art… be as wild with this as you want
Things to do before next week
See file Tool Install Instructions
• Make friends with the terminal window
• Install iPython
• Install Git

More Related Content

PPTX
Data Science Project Lifecycle and Skill Set
PDF
Data Science Project Lifecycle
PPTX
Session 10 handling bigger data
PPTX
Session 04 communicating results
PDF
Data science presentation
PDF
Introduction to Data Science
PPTX
Introduction to data science
PDF
Unit 3 part 2
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle
Session 10 handling bigger data
Session 04 communicating results
Data science presentation
Introduction to Data Science
Introduction to data science
Unit 3 part 2

What's hot (20)

PPTX
Data Science presentation for elementary school students
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
PPTX
Big data and data science overview
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PPTX
Data science 101
PDF
Data science
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PPTX
A Practical-ish Introduction to Data Science
PDF
Data science presentation 2nd CI day
PDF
Different Career Paths in Data Science
PDF
Data science vs. Data scientist by Jothi Periasamy
PDF
Introduction To Data Science
PPTX
Data science | What is Data science
PPTX
Data science applications and usecases
PDF
Introduction to data science intro,ch(1,2,3)
PPTX
Data science
PDF
Data Science
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
PPTX
Introduction to Data Science
PPTX
Introduction to data science
Data Science presentation for elementary school students
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Big data and data science overview
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data science 101
Data science
Intro to Data Science by DatalentTeam at Data Science Clinic#11
A Practical-ish Introduction to Data Science
Data science presentation 2nd CI day
Different Career Paths in Data Science
Data science vs. Data scientist by Jothi Periasamy
Introduction To Data Science
Data science | What is Data science
Data science applications and usecases
Introduction to data science intro,ch(1,2,3)
Data science
Data Science
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science
Introduction to data science
Ad

Viewers also liked (18)

PPT
The Anatomy of a Data Science Project
PPS
The Lotus Seed
PDF
Fledgling ideas to useful dashboard
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PPTX
Information processing cycle
PPT
01 Information Processing Cycle
PPTX
Introduction of Data Science
PDF
Tools and techniques for data science
PPTX
Data processing cycle
PDF
Putting the Magic in Data Science
PDF
Introduction on Data Science
PPT
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PPT
Data Processing-Presentation
PDF
How to Become a Data Scientist
PPT
Data Preparation and Processing
PPT
Types of Data Processing
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
The Anatomy of a Data Science Project
The Lotus Seed
Fledgling ideas to useful dashboard
National seminar on emergence of internet of things (io t) trends and challe...
Information processing cycle
01 Information Processing Cycle
Introduction of Data Science
Tools and techniques for data science
Data processing cycle
Putting the Magic in Data Science
Introduction on Data Science
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Processing-Presentation
How to Become a Data Scientist
Data Preparation and Processing
Types of Data Processing
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Ad

Similar to Session 01 designing and scoping a data science project (20)

PPT
Data Munging in concepts of data mining in DS
PPTX
Data Science presentation for explanation of numpy and pandas
PPTX
Data Science Introduction to Data Science
PPTX
Introducition to Data scinece compiled by hu
PPTX
Data science unit1
PPTX
Data science.chapter-1,2,3
PDF
Data Science Introduction and Process in Data Science
PDF
Data fluency for the 21st century
PDF
20151016 Data Science For Project Managers
PPTX
Unit 1-FDS. .pptx
PPTX
DATASCIENCE.pptx
PPTX
Data analytics using Scalable Programming
PPTX
Behind the scenes of data science
PDF
How to Prepare for a Career in Data Science
PPTX
Data Science Introduction: Concepts, lifecycle, applications.pptx
PDF
IICT-Big Data.pdf slideshow information to communication
PDF
IICT-Big Data.pdf slideshow Information to communication technology
PDF
Luciano uvi hackfest.28.10.2020
PPTX
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Data Munging in concepts of data mining in DS
Data Science presentation for explanation of numpy and pandas
Data Science Introduction to Data Science
Introducition to Data scinece compiled by hu
Data science unit1
Data science.chapter-1,2,3
Data Science Introduction and Process in Data Science
Data fluency for the 21st century
20151016 Data Science For Project Managers
Unit 1-FDS. .pptx
DATASCIENCE.pptx
Data analytics using Scalable Programming
Behind the scenes of data science
How to Prepare for a Career in Data Science
Data Science Introduction: Concepts, lifecycle, applications.pptx
IICT-Big Data.pdf slideshow information to communication
IICT-Big Data.pdf slideshow Information to communication technology
Luciano uvi hackfest.28.10.2020
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...

More from Sara-Jayne Terp (20)

PPTX
Distributed defense against disinformation: disinformation risk management an...
PPTX
Risk, SOCs, and mitigations: cognitive security is coming of age
PPTX
disinformation risk management: leveraging cyber security best practices to s...
PPTX
Cognitive security: all the other things
PPTX
The Business(es) of Disinformation
PPTX
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
PPTX
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
PPTX
2021-02-10_CogSecCollab_UBerkeley
PPTX
Using AMITT and ATT&CK frameworks
PPTX
2020 12 nyu-workshop_cog_sec
PPTX
2020 09-01 disclosure
PDF
2019 11 terp_mansonbulletproof_master copy
PPTX
BSidesLV 2018 talk: social engineering at scale, a community guide
PPTX
Social engineering at scale
PPTX
engineering misinformation
PPTX
Online misinformation: they're coming for our brainz now
PPTX
Sj terp ciwg_nyc2017_credibility_belief
PPT
Belief: learning about new problems from old things
PPT
risks and mitigations of releasing data
PPTX
Session 10 handling bigger data
Distributed defense against disinformation: disinformation risk management an...
Risk, SOCs, and mitigations: cognitive security is coming of age
disinformation risk management: leveraging cyber security best practices to s...
Cognitive security: all the other things
The Business(es) of Disinformation
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021-02-10_CogSecCollab_UBerkeley
Using AMITT and ATT&CK frameworks
2020 12 nyu-workshop_cog_sec
2020 09-01 disclosure
2019 11 terp_mansonbulletproof_master copy
BSidesLV 2018 talk: social engineering at scale, a community guide
Social engineering at scale
engineering misinformation
Online misinformation: they're coming for our brainz now
Sj terp ciwg_nyc2017_credibility_belief
Belief: learning about new problems from old things
risks and mitigations of releasing data
Session 10 handling bigger data

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PDF
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Moving the Public Sector (Government) to a Digital Adoption
IB Computer Science - Internal Assessment.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Session 01 designing and scoping a data science project

  • 1. Designing and Scoping a Data Science Project Data Science for Beginners, Session 1
  • 3. Session Format Session: • One topic • Learn 4-6 concepts related to that topic • Try apps or code related to that topic Before each session: • Install required tools (see the ā€˜tool installs’ instructions sheet) • Do background reading
  • 4. Session Topics People • Designing a data science project • Communicating results Tools • Python basics • Enterprise data tools Getting Data • Acquiring data • Cleaning and exploring data Special data types • Handling text data • Handling geospatial data • Handling big data Learning from data • Predicting values from data • Learning relationships from data • Learning classes from data
  • 5. Sessions Timeline 1. Scoping a data science project 2. Python basics 3. Acquiring data 4. Communicating results 5. Cleaning and exploring data 6. Predicting values from data 7. Handling text data 8. Handling geospatial data 9. Learning relationships from data 10. Enterprise data tools 11. Learning classes from data 12. Handling big data
  • 6. Session 1: your 5-7 things • What is data science? • Data science is a process • What’s a data scientist? • Data science competitions • Writing a problem statement
  • 7. What is Data Science?
  • 8. Defining Data Science ā€œA data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.ā€ ā€œThe analysis of data using the scientific methodā€ ā€œA data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.ā€
  • 10. Data Science is a Process • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results
  • 11. Ask an interesting question Write hypotheses that can be explored ā— Do people have more phones than toilets? ā— How is Ebola spreading? ā— Is using wood fires sustainable in rural Tanzania? ā— Can we feed 9 billion people? Make them simple, actionable, incremental
  • 12. Get the data Data files (CSV, Excel, Json, Xml...) ā— Databases (sqlite, mysql, oracle, postgresql...) ā— APIs ā— Report tables (tables on websites, in pdf reports...) ā— Text (reports and other documents…) ā— Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) ā— Images (satellite images, drone footage, pictures, videos…)
  • 13. Most data is small, but…
  • 19. The Data Science Venn Diagram
  • 20. How do you become a data scientist? Learning and Practice ā— Kaggle - online datascience competitions ā— Driven Data - social good datascience competitions ā— Innocentive - some datascience challenges ā— CrowdAnalytix - business datascience competitions
  • 21. Should you become a data scientist? ā— Not necessarily. There are lots of data science students desperate for good problems to work on. ā— You might want to become someone who can work with data scientists ā— Which means learning how to specify data problems well
  • 23. Who Does What • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results Problem Owner Competitor ?
  • 27. Example project: Pump It Up Tanzania wells: ā€œYour goal is to predict the operating condition of a waterpoint for each record in the datasetā€
  • 29. DrivenData competition guidelines Impact: ā€œā€¦ clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impactā€¦ā€ Challenge: ā€œā€¦ challenging enough for a rich competitionā€¦ā€ Feasibility: ā€œā€¦.the right kind of data to answer the question at hand… does it have enough signal to be useful?...ā€ Privacy: ā€œā€¦ can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisationā€¦ā€
  • 31. Design your project Context: who needs this work, and what are they doing it for? Needs: what are you trying to fix Vision: what do you expect your final result to look like? Outcome: how do you get your results to the people who need them? What happens next?
  • 32. Design your questions Is the question concrete enough? Can you translate the question into an experiment? Is it actionable? What actions will be taken given the answer? What data is needed to do the analysis?
  • 34. Data Risk and Ethics You’re responsible for your data outputs Could your outputs increase risk to anyone? How will you respect privacy and security?
  • 35. Data Risk Risk: ā€œThe probability of something happening multiplied by the resulting cost or benefit if it doesā€ Risk of: physical, legal, reputational, privacy harm Likelihood (e.g. low, medium, high) Risk to: data subjects, collectors, processors, releasers, users
  • 36. PII: Personally Identifiable Information ā€œPersonally identifiable information (PII) is any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.ā€
  • 37. PII Red Flags Names, addresses, phone numbers Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier) Members of small populations Untranslated text Codes (e.g. ā€œ41ā€) Slang terms
  • 39. 3-minute exercise: Ask interesting questions Either your own questions: Questions that data might help with Stories you want to tell with data Datasets you’d like to explore Or pick an existing question: ā— Competition questions: Kaggle, DrivenData ā— A data science project that interested you
  • 40. 3-minute exercise: Get the data Pick one of your questions List the ideal data you need to answer it List the data that’s (probably) available Think about what you’ll do if the data you need isn’t available What compromises could you make Where would you look for more data Are there proxies (other datasets that tell you something about your question)
  • 41. 3-min exercise: design your communications List the types of people you’d want to show your results to How do you want them to change the world? Can they take actions, can they change opinions etc Describe the types of outputs that might be persuasive to them - visuals, text, numbers, stories, art… be as wild with this as you want
  • 42. Things to do before next week See file Tool Install Instructions • Make friends with the terminal window • Install iPython • Install Git