SlideShare a Scribd company logo
Pittsburgh Data Jam 2016
Bringing Big Data Education and Awareness to
Pittsburgh High School Students
February 26, 2016
Introductions
Saman Haqqi - President - Pittsburgh Dataworks
 saman.haqqi@pghdataworks.org
Brian Macdonald – Data Scientist – Oracle Corporation
 brian.macdonald@oracle.com
Pitt Science Outreach
 Margaret Farrell mef85@pitt.edu
 Laura Marshall LJM82@pitt.edu
 Jenny Lundahl jal225@pitt.edu
 Jackie Choffo jac335@pitt.edu
 Kyle Wiche KAW196@pitt.edu
 Chris Davis CJD81@pit.edu
Mentors
 Each team will be assigned a mentor
Can ask questions via email at any time
 Copy everyone on your team
 Copy your teacher
Pitt Science Outreach students
 Send email to all
Have a regular scheduled call with your mentor
 Don’t wait to right before presentations.
Data Analysis Workshop
Today’s Goals
Identifying relevant variables
Depicting them graphically
Doing the analysis
Drawing conclusions
Making recommendations
What technology will you use?
Lots of tools are available
Keep it simple at the beginning
Use Excel
Tableau is also available
Many Others
 R, SAS, Cognos, Oracle Business Intelligence, Google Apps,
Matlab, Pyhton, Spotfire, QlikView
Data Analysis Process
A standard repeatable process to guide data analysis.
Used formally and informally
 If you do analysis, you will do these steps.
Used for Big Data or not so Big Data
Becomes second nature as you do more analysis.
Is not about using a cool data analysis tool
 Although they are extremely helpful.
The Data Analysis Process
Define your Problem
Identify Data
Plan your Analysis
 Explore Data
 Prepare Data
 Model Data
Tell A Story
Make Recommendations
Determine What’s Next
Today’s Focus
In practice it looks like this
https://cyberitgs.wikispaces.com/Sandbox+Yerlan
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Data Exploration
Exploratory Data Analysis (EDA)
 Goal is to get an understanding of what data you have
What are your variables
Basic Statistics
Graph Data
Look for missing values
Look for outliers
Will this data help you answer your question?
Basic Statistics
Goal is to get a basic understanding of your data
 Mean (Average)
• Sum of values/Count of values
 Median
• Mid Point of Values
 Maximum, Minimum (Range)
 Standard Deviation (σ) & Variance (σ^2)
• How spread out the values are compared to the mean
 Quartiles
• Nice buckets of the spread of the data
Demo - Statistics in Excel
Graphing Data
Helps visualize patterns in the data
Especially with large data sets.
 https://www.mapbox.com/labs/twitter-
gnip/locals/#12/40.4620/-80.0151
Spot exceptions
Use the best graph for the data types
Help tell your story
Demo - Graphing in Excel
Missing Values
Can have large impact on basic statistics
Count # of missing values of every variable (column)
Important to understand why data is missing?
 Data entry
 Wasn’t collected
 Isn’t relevant
Should you use the variable?
Should you fill in missing values
 Use mean, median, max, min, 0.
 You need to determine best method
Outliers
Outliers are values at the extreme
Much larger or smaller than most of your data
May have many causes
 Data Entry Error
 Instrument Malfunction
 Real Exceptional data
Is 140º F an Outlier
Some are easy to spot within a single variable
Some are only found with multiple variables
Outliers
Need to decide how to treat Outliers
 Is the variable ok to use? Do you question the validity of the
data?
 Remove them from your data set?
 Keep them as is?
 Change the value (i.e. make it less extreme)
 Infer the real meaning
• -90º F temperature in Miami is likely 90º
Make sure you understand implications
Document your decision making
Demo – Missing Values &
Outlier Detection in Excel
One Last Thought on Exploring Data
You must be observant
Count the Number of F’s in the following sentence.
 You will have 15 Seconds
FINISHED FILES ARE THE RE-
SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH
THE EXPERIENCE OF YEARS.
Leave your assumptions at the door!
FINISHED FILES ARE THE RE-
SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH
THE EXPERIENCE OF YEARS.
Exploration Exercise
Using Excel
Sort
Filter
Summarize
Create Crosstabs
Charting
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Data Preparation
 This step will fix any issues you found during data exploration
 Fix missing values
 Remove bad data
 Create new variables
 Add/Subtract/Multiply/Divide multiple variables
 Ratios
 Binning
 Other functions like Square Root or Exponents
Anything else you feel appropriate
 Have fun and experiment. You can not hurt data.
Demo – Data Preparation
Preparation Exercise
Using Excel
Merge data
New Calculations
Fix Missing Data
Fix Outliers
Basic Steps for Analysis
Data Exploration
Data Preparation
Build Models
Explaining Insights
How do you know what you
see is valid?
And not due to chance?
Correlation
http://musicthatmakesyoudumb.virgil.gr/
Correlation
The degree to which two or more attributes or measurements on the
same group of elements show a tendency to vary together
Positive when values increase together
Negative when values decrease together
http://www.mathsisfun.com/data/correlation.html
What can you tell me about this graph?
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80
Ice Cream Consumption/Capita
Ice Cream
Consumption/Capita
Linear (Ice Cream
Consumption/Capita)
IceCreamconsumption/capita
Drownings
Does Ice Cream Consumption Cause
Drowning?
Obviously not
Correlation does not imply Causation
 One may cause the other, but correlation just defines how
they vary.
 There may be other reasons. i.e. Hot temperatures
Be very cautious with Causation
 There are tests to determine causation
How do I know if variables are correlated
R = Correlation Coefficient
 Values between -1 & 1
 Positive Correlation > 0 - As one variable increases, the other
increases
 Perfect Correlation = 1
 Negative Correlation < 0 - As one variable increases, the other
decreases
 Perfect Negative Correlation = -1
 0 = No correlation
 Can be shown with a trend line
Understanding R and R2
How do I know if variables are correlated
R2 = Coefficient of Determination
 Tells how likely one variable predicts the other variable
 Values between 0 & 1
 If R 2 = 0.850, 85% of the total variation in y can be explained
by the linear relationship between x and y
 R2 is more commonly used
Understanding R and R2
Some Terminology
Independent Variable
 These are the variables that you modify
 In trend equation they are the X values
Dependent Variable
 These values depend on the values of the Independent
variables.
 In trend equation they are the Y values
y = 0.0045x + 691.18
y is Living Area
x is Sale Price
Slope Intercept
Demo – Modeling Data
Modeling Exercise
Using Excel
Create scatter plot
Show Coefficient of determination
Create a formula to predict a value
What did the Data Tell You
Did it support your initial question?
 What conclusions can you make?
 Make sure they are fact based
 Check your bias
What is your story?
 Is it compelling?
• Does x influence y?
 Can it support actions to be taken?
 If not, is there still some benefit?
What did the Data Tell You
What recommendations will you make?
 Will you stand behind them?
 If not, why not?
 Can they really be implemented?
 What is the value of implementing the recommendation
What new questions would you ask?
 To clarify your analysis?
 Expand on your analysis
 Can better questions be asked?
And the most important Item
Have
Questions?
Always ask questions!!!!
Timing
Introductions – 10 Minutes
Overview/Data exploration Lecture – 35 Minutes
Exploration Hands-on – 30 Minutes
Data Prep Lecture – 20 Minutes
Data Prep Hands-on – 25 Minutes
Data Modeling Lecture – 20 Minutes
Data Modeling – Hand-on – 30 Minutes
Questions/Wrap Up – 10 Minutes
Total 3:00

More Related Content

PDF
ML Drift - How to find issues before they become problems
PDF
Improving predictions: Lasso, Ridge and Stein's paradox
PPTX
Statistics in the age of data science, issues you can not ignore
DOC
critique writing guidence
PPTX
DIY Max-Diff webinar slides
PDF
Trends on Pinterest
PDF
Clinical prediction models: development, validation and beyond
PDF
SQLDay2013_MarcinSzeliga_DataInDataMining
ML Drift - How to find issues before they become problems
Improving predictions: Lasso, Ridge and Stein's paradox
Statistics in the age of data science, issues you can not ignore
critique writing guidence
DIY Max-Diff webinar slides
Trends on Pinterest
Clinical prediction models: development, validation and beyond
SQLDay2013_MarcinSzeliga_DataInDataMining

What's hot (19)

PPTX
Standard error and sample size
PDF
Presenting data
PDF
Exploring the Data science Process
PPTX
Introduction to data science
PPTX
Imputation of missing data in clinical trials
PPTX
Too Large To Fail: Large Samples and False Discoveries
PDF
CRISP-DM - Agile Approach To Data Mining Projects
PPTX
Cause and effect analysis
PDF
Introduction to machine learning and deep learning
PPTX
Analysis of "A Predictive Analytics Primer" by Tom Davenport
PPTX
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
PPT
Natureof Sci
PPTX
Building Better Models
PPTX
StatVignette06_HypTesting.pptx
PPTX
What is statistics
PDF
Barga Data Science lecture 1
PPTX
Basic Analytics Module for Sponsors
PPTX
Top 5 tips on how to learn statistics more effectively
PDF
On the Measurement of Test Collection Reliability
Standard error and sample size
Presenting data
Exploring the Data science Process
Introduction to data science
Imputation of missing data in clinical trials
Too Large To Fail: Large Samples and False Discoveries
CRISP-DM - Agile Approach To Data Mining Projects
Cause and effect analysis
Introduction to machine learning and deep learning
Analysis of "A Predictive Analytics Primer" by Tom Davenport
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Natureof Sci
Building Better Models
StatVignette06_HypTesting.pptx
What is statistics
Barga Data Science lecture 1
Basic Analytics Module for Sponsors
Top 5 tips on how to learn statistics more effectively
On the Measurement of Test Collection Reliability
Ad

Viewers also liked (11)

PPTX
Austin, TX: State of the Economy
PPTX
Space Evaders Hacking for Diplomacy week 8
PPTX
Team 621 Hacking for Diplomacy week 8
PPTX
Hacking CT Lessons Learned H4Dip Stanford 2016
PPTX
Aggregate db Lessons Learned H4Dip Stanford 2016
PPTX
Trace Lessons Learned H4Dip Stanford 2016
PPTX
Peacekeeping Lessons Learned H4Dip Stanford 2016
PPTX
Exodus Lessons Learned H4Dip Stanford 2016
PPTX
Hacking CT Hacking for Diplomacy week 8
PPTX
Space Evaders Lessons Learned H4Dip Stanford 2016
PPTX
Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016
Austin, TX: State of the Economy
Space Evaders Hacking for Diplomacy week 8
Team 621 Hacking for Diplomacy week 8
Hacking CT Lessons Learned H4Dip Stanford 2016
Aggregate db Lessons Learned H4Dip Stanford 2016
Trace Lessons Learned H4Dip Stanford 2016
Peacekeeping Lessons Learned H4Dip Stanford 2016
Exodus Lessons Learned H4Dip Stanford 2016
Hacking CT Hacking for Diplomacy week 8
Space Evaders Lessons Learned H4Dip Stanford 2016
Fatal journeys (Team 621) Lessons Learned H4Dip Stanford 2016
Ad

Similar to 2016 Pittsburgh Data Jam Student Workshop (20)

PPT
R for Statistical Computing
PDF
Exploratory Data Analysis - Satyajit.pdf
PDF
Advancing Into Analytics From Excel To Python And R 1st Edition George Mount
PDF
Data Science - Part III - EDA & Model Selection
PDF
Data Preparation with the help of Analytics Methodology
PPTX
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
PDF
SQLBits Module 2 RStats Introduction to R and Statistics
PPTX
Data_Preparation.pptx
PPTX
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PDF
3 module 2
PDF
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
PPT
Pre_processing_the_data_using_advance_technique
DOC
Devry bis 155 i lab 8
PPT
Data Preprocessing
PDF
An introduction to data cleaning with r
PDF
SELECTED DATA PREPARATION METHODS
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
PPT
Data ware housing and data mining for educational purpose
PPTX
Module 4 data analysis
R for Statistical Computing
Exploratory Data Analysis - Satyajit.pdf
Advancing Into Analytics From Excel To Python And R 1st Edition George Mount
Data Science - Part III - EDA & Model Selection
Data Preparation with the help of Analytics Methodology
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
SQLBits Module 2 RStats Introduction to R and Statistics
Data_Preparation.pptx
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
3 module 2
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
Pre_processing_the_data_using_advance_technique
Devry bis 155 i lab 8
Data Preprocessing
An introduction to data cleaning with r
SELECTED DATA PREPARATION METHODS
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
Data ware housing and data mining for educational purpose
Module 4 data analysis

Recently uploaded (20)

PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Computing-Curriculum for Schools in Ghana
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Introduction to Building Materials
PDF
RMMM.pdf make it easy to upload and study
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Lesson notes of climatology university.
PDF
Classroom Observation Tools for Teachers
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Paper A Mock Exam 9_ Attempt review.pdf.
Weekly quiz Compilation Jan -July 25.pdf
A systematic review of self-coping strategies used by university students to ...
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Final Presentation General Medicine 03-08-2024.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Computing-Curriculum for Schools in Ghana
Indian roads congress 037 - 2012 Flexible pavement
Digestion and Absorption of Carbohydrates, Proteina and Fats
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Introduction to Building Materials
RMMM.pdf make it easy to upload and study
Hazard Identification & Risk Assessment .pdf
Lesson notes of climatology university.
Classroom Observation Tools for Teachers
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Supply Chain Operations Speaking Notes -ICLT Program
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
ChatGPT for Dummies - Pam Baker Ccesa007.pdf

2016 Pittsburgh Data Jam Student Workshop

  • 1. Pittsburgh Data Jam 2016 Bringing Big Data Education and Awareness to Pittsburgh High School Students February 26, 2016
  • 2. Introductions Saman Haqqi - President - Pittsburgh Dataworks  saman.haqqi@pghdataworks.org Brian Macdonald – Data Scientist – Oracle Corporation  brian.macdonald@oracle.com Pitt Science Outreach  Margaret Farrell mef85@pitt.edu  Laura Marshall LJM82@pitt.edu  Jenny Lundahl jal225@pitt.edu  Jackie Choffo jac335@pitt.edu  Kyle Wiche KAW196@pitt.edu  Chris Davis CJD81@pit.edu
  • 3. Mentors  Each team will be assigned a mentor Can ask questions via email at any time  Copy everyone on your team  Copy your teacher Pitt Science Outreach students  Send email to all Have a regular scheduled call with your mentor  Don’t wait to right before presentations.
  • 4. Data Analysis Workshop Today’s Goals Identifying relevant variables Depicting them graphically Doing the analysis Drawing conclusions Making recommendations
  • 5. What technology will you use? Lots of tools are available Keep it simple at the beginning Use Excel Tableau is also available Many Others  R, SAS, Cognos, Oracle Business Intelligence, Google Apps, Matlab, Pyhton, Spotfire, QlikView
  • 6. Data Analysis Process A standard repeatable process to guide data analysis. Used formally and informally  If you do analysis, you will do these steps. Used for Big Data or not so Big Data Becomes second nature as you do more analysis. Is not about using a cool data analysis tool  Although they are extremely helpful.
  • 7. The Data Analysis Process Define your Problem Identify Data Plan your Analysis  Explore Data  Prepare Data  Model Data Tell A Story Make Recommendations Determine What’s Next Today’s Focus In practice it looks like this https://cyberitgs.wikispaces.com/Sandbox+Yerlan
  • 8. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 9. Data Exploration Exploratory Data Analysis (EDA)  Goal is to get an understanding of what data you have What are your variables Basic Statistics Graph Data Look for missing values Look for outliers Will this data help you answer your question?
  • 10. Basic Statistics Goal is to get a basic understanding of your data  Mean (Average) • Sum of values/Count of values  Median • Mid Point of Values  Maximum, Minimum (Range)  Standard Deviation (σ) & Variance (σ^2) • How spread out the values are compared to the mean  Quartiles • Nice buckets of the spread of the data
  • 11. Demo - Statistics in Excel
  • 12. Graphing Data Helps visualize patterns in the data Especially with large data sets.  https://www.mapbox.com/labs/twitter- gnip/locals/#12/40.4620/-80.0151 Spot exceptions Use the best graph for the data types Help tell your story
  • 13. Demo - Graphing in Excel
  • 14. Missing Values Can have large impact on basic statistics Count # of missing values of every variable (column) Important to understand why data is missing?  Data entry  Wasn’t collected  Isn’t relevant Should you use the variable? Should you fill in missing values  Use mean, median, max, min, 0.  You need to determine best method
  • 15. Outliers Outliers are values at the extreme Much larger or smaller than most of your data May have many causes  Data Entry Error  Instrument Malfunction  Real Exceptional data Is 140º F an Outlier Some are easy to spot within a single variable Some are only found with multiple variables
  • 16. Outliers Need to decide how to treat Outliers  Is the variable ok to use? Do you question the validity of the data?  Remove them from your data set?  Keep them as is?  Change the value (i.e. make it less extreme)  Infer the real meaning • -90º F temperature in Miami is likely 90º Make sure you understand implications Document your decision making
  • 17. Demo – Missing Values & Outlier Detection in Excel
  • 18. One Last Thought on Exploring Data You must be observant Count the Number of F’s in the following sentence.  You will have 15 Seconds FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.
  • 19. Leave your assumptions at the door! FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF- IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.
  • 21. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 22. Data Preparation  This step will fix any issues you found during data exploration  Fix missing values  Remove bad data  Create new variables  Add/Subtract/Multiply/Divide multiple variables  Ratios  Binning  Other functions like Square Root or Exponents Anything else you feel appropriate  Have fun and experiment. You can not hurt data.
  • 23. Demo – Data Preparation
  • 24. Preparation Exercise Using Excel Merge data New Calculations Fix Missing Data Fix Outliers
  • 25. Basic Steps for Analysis Data Exploration Data Preparation Build Models
  • 26. Explaining Insights How do you know what you see is valid? And not due to chance? Correlation http://musicthatmakesyoudumb.virgil.gr/
  • 27. Correlation The degree to which two or more attributes or measurements on the same group of elements show a tendency to vary together Positive when values increase together Negative when values decrease together http://www.mathsisfun.com/data/correlation.html
  • 28. What can you tell me about this graph? 0.2 0.3 0.4 0.5 0.6 0 20 40 60 80 Ice Cream Consumption/Capita Ice Cream Consumption/Capita Linear (Ice Cream Consumption/Capita) IceCreamconsumption/capita Drownings
  • 29. Does Ice Cream Consumption Cause Drowning? Obviously not Correlation does not imply Causation  One may cause the other, but correlation just defines how they vary.  There may be other reasons. i.e. Hot temperatures Be very cautious with Causation  There are tests to determine causation
  • 30. How do I know if variables are correlated R = Correlation Coefficient  Values between -1 & 1  Positive Correlation > 0 - As one variable increases, the other increases  Perfect Correlation = 1  Negative Correlation < 0 - As one variable increases, the other decreases  Perfect Negative Correlation = -1  0 = No correlation  Can be shown with a trend line Understanding R and R2
  • 31. How do I know if variables are correlated R2 = Coefficient of Determination  Tells how likely one variable predicts the other variable  Values between 0 & 1  If R 2 = 0.850, 85% of the total variation in y can be explained by the linear relationship between x and y  R2 is more commonly used Understanding R and R2
  • 32. Some Terminology Independent Variable  These are the variables that you modify  In trend equation they are the X values Dependent Variable  These values depend on the values of the Independent variables.  In trend equation they are the Y values y = 0.0045x + 691.18 y is Living Area x is Sale Price Slope Intercept
  • 34. Modeling Exercise Using Excel Create scatter plot Show Coefficient of determination Create a formula to predict a value
  • 35. What did the Data Tell You Did it support your initial question?  What conclusions can you make?  Make sure they are fact based  Check your bias What is your story?  Is it compelling? • Does x influence y?  Can it support actions to be taken?  If not, is there still some benefit?
  • 36. What did the Data Tell You What recommendations will you make?  Will you stand behind them?  If not, why not?  Can they really be implemented?  What is the value of implementing the recommendation What new questions would you ask?  To clarify your analysis?  Expand on your analysis  Can better questions be asked?
  • 37. And the most important Item Have
  • 39. Timing Introductions – 10 Minutes Overview/Data exploration Lecture – 35 Minutes Exploration Hands-on – 30 Minutes Data Prep Lecture – 20 Minutes Data Prep Hands-on – 25 Minutes Data Modeling Lecture – 20 Minutes Data Modeling – Hand-on – 30 Minutes Questions/Wrap Up – 10 Minutes Total 3:00