Why are we here?
Statistics:
The science of collecting, manipulating, and analyzing empirical
data.
-Statistics enables us to use environmental data to follow
the scientific method.
The scientific method: Example
Step one: Make an observation (cool, doesn't need stats)
Ex: When Mt. Pinatubo erupted in 1991, much less sunlight was available for
plants...
•20 Mt of sulfur dioxide ⟶ increased stratospheric sulfate aerosols,
⟶
decreased sunlight reaching Earth's surface
The scientific method: Example
Step two: Ask a question (also doesn't need stats)
Ex: I wonder if these aerosols could decrease crop yields?
The scientific method: Example
Step three: Form a hypothesis / testable explanation (sounding somewhat
stats-like, but still no pre-req's here)
Ex: Did less sunlight from Mt. Pinatubo's aerosols lead to lower crop yields?
The scientific method: Example
Steps four/five: Analyze data & test the hypothesis ( NEED STATISTICS!! )
Ex: Assemble spatio-temporal dataset, run multivariate linear regression,
test statistical significance of regression parameters
The scientific method: Example
Step six: Report conclusions & what they mean for your question
Ex: Run simulation to consider the implications for solar radiation
management (a form of geo-engineering)
Statistics and the scientific method
•Statistics will enable us to test hypotheses, analyze data,
and draw conclusions about the world from the process
•Otherwise, we'd be stuck at observing and forming
hypotheses
•... and we'd have a lot of unanswered empirical questions!
This course
This course is:
•Designed to build your fundamental statistical toolkit
•Designed to teach you to apply statistics in R
•Designed to show you key univariate, multivariate and modeling
methods that come up frequently in environmental data science
•Designed to show you key spatio-temporal methods that come
up in environmental data science (If enough time left!!).
•Still new! ⟶ We are actively adjusting based on student
⟶
feedback. Your input will help shape the curriculum!
Sample versus population
Consider a potential research question:
 What is the average mercury content in swordfish in the Atlantic
Ocean?
Some definitions
•Population: The entire target population of interest.
•Ex: All swordfish in the Atlantic Ocean
•Census: A data collection including all individuals in the population
•Ex: Collect mercury data for every single swordfish in the Atlantic
Ocean (hard and )
💰
•Sample: A subset of the target population for which we actually
have data
• Ex: 60 tagged swordfish from a government survey in the
Atlantic Ocean
Parameters and statistics
Usually we are interested in a numerical summary of
the population (e.g., mean, slope, intercept, variance)
•Parameter: A numerical summary of the population
• Ex: average mercury content in swordfish in the Atlantic Ocean
•Statistic: A numerical summary of the sample
• Ex: average mercury content of the 60 swordfish collected in a
government survey in the Atlantic Ocean
Parameters and statistics
 We use statistics (from a sample) in hopes of learning about
parameters (from the population)
 This means that every time you do "statistics", you should be
thinking...
 What is the population of interest?
 What is my sample?
 How are they different?
All samples are not created equal
From IMS: Suppose we want to estimate time to graduation for
Duke undergraduates in the last five years using a sample of recent
students.
•Q: Who is the population?
Suppose we take a random sample (i.e., every individual in the
population has the same probability of being selected)
All samples are not created equal
Suppose we ask a nutrition major to pick a few of her friends for
the sample.
•What might go wrong here?
Asked to pick a sample of graduates, a
nutrition major might inadvertently pick
a disproportionate number of
graduates from health-related majors.
Four (random) sampling strategies
Nearly all statistical methods are based on assumptions of
randomness. If data are not collected randomly from the
population, estimates are likely to be biased.
Strategy 1: Simple random sampling
•As simple as it sounds!
•To consider for later classes: What problems might arise if your
(simple random) sample is small?
Four (random) sampling strategies
Strategy 2: Stratified sampling
•More complex to analyze sample to construct estimates of
population parameters (but still possible)
•Helpful when individuals within a strata are quite similar
•Used often as a method to reduce "noise" in your data (we'll
discuss this later)
Four (random) sampling strategies
Strategy 3: Cluster random sampling
•Helpful when individuals within a cluster are quite different from
one another
•Used often when costs of data collection are high per cluster (e.g.,
Demographic and Health Surveys)
•Also more complex to estimate population parameters
Four (random) sampling strategies
Strategy 4: Multi-stage sampling
•Very similar to cluster sampling, just take fewer samples
(randomly)
Study design
When we conduct statistical analyses, where do our samples come
from?
•Experimental studies
• Sample is collected to fit the study's needs
•Observational studies
• Sample exists, design your study to make best use of available
data
Study design
Q: Does sunscreen lower risk of skin cancer?
Study design
Q: Does sunscreen lower risk of skin cancer?
•Experiment
•Randomly sample 50% of individuals and assign them the sunscreen
"treatment", require the other 50% to wear no sunscreen
•Follow individuals for 20 years, compare cancer outcomes
Study design
Q: Does sunscreen lower risk of skin cancer?
Observational study
Collect data on sunscreen use, skin cancer, and sun exposure
Compare cancer rates for individuals with different sunscreen use habits
Types of variables
A variable is a representation of something we care about in a
population (e.g., nitrate concentration of groundwater).
Numerical variables
Object class numeric in R
•Can take on a wide range of possible values
•Makes sense to add, subtract, multiply, etc.
•Examples:
•Height of the tree canopy across the Amazon
•Length of Atlantic swordfish
•Daily average temperature
Types of variables
Numerical variables
Discrete numerical variables take on only a limited set of values,
often counts (e.g., population)
Continuous numerical variables: can take on infinite values
within a range (e.g., arsenic concentration in groundwater)
Types of variables
Categorical variables
Object class factor in R
•Values correspond to one of a fixed number of categories
•Possible values are called levels
•Examples:
•Land use type
•Species of tree
•Age group (e.g., <15, 15-64, 65+) (watch out! continuous
numerical data can often be stored as a categorical variable!)
Types of variables
Categorical variables
Nominal variables are unordered descriptions
Ordinal variables are categories with a natural ordering
Binary variables only take on 0 or 1
Things to do
1- Install R
https://www.r-project.org/
2- Install RStudio
https://posit.co/download/rstudio-desktop/
3- Access to course material on Olat
https://olat.vcrp.de/url/RepositoryEntry/4703814428
Code: ETX3

More Related Content

PPTX
biostatistics-210618023858.pptx bbbbbbbbbb
PPTX
biostats.pptx hjvbuvfyjgvguyjgvfvtfugvghjbk
PPTX
Biostatistics and its application along with problems and solutions
PPTX
Biostatistics
PPTX
2dk9spxsgkmbj3llxgrw-signature-942e20f9f4d90e588b512ceb917b4542d6b0e98ab1d79a...
PPTX
Biostatics
PPTX
BIOSTATISTICS.pptx sidhathab.pptx oral pathology
PPT
Biostatistics ug
biostatistics-210618023858.pptx bbbbbbbbbb
biostats.pptx hjvbuvfyjgvguyjgvfvtfugvghjbk
Biostatistics and its application along with problems and solutions
Biostatistics
2dk9spxsgkmbj3llxgrw-signature-942e20f9f4d90e588b512ceb917b4542d6b0e98ab1d79a...
Biostatics
BIOSTATISTICS.pptx sidhathab.pptx oral pathology
Biostatistics ug

Similar to intro to statistics and data analysis.pptx (20)

PPTX
Biostats in ortho
PDF
2_54248135948895858599595585887869437 2.pdf
PPTX
research methodology and biostatistics.pptx
PDF
PPTX
MAIN 2._biostatistics.pptx biostatistics
PPTX
Biostatistics
PPTX
Biostatistics in Orthodontics -ARATHY.pptx
PPSX
Biostatistics
PPT
Biostatics in orthodontics - kkKiran.ppt
PPTX
research methodolkglhxhlxohxohxoyxphxogy.pptx
PPTX
Bio Statistics.pptx by Dr.REVATHI SIVAKUMAR
PPT
Use of Biostatics in Dentistry /certified fixed orthodontic courses by Indian...
PDF
Lect 1_Biostat.pdf
PPTX
Introduction of Biostatistics
DOCX
Statistics
PPTX
Introduction-to-Statistics.pptx
PPT
Introduction-To-Statistics-18032022-010747pm (1).ppt
PPT
A basic Introduction To Statistics with examples
PPTX
chapter 1.pptx
PPTX
1.1 statistical and critical thinking
Biostats in ortho
2_54248135948895858599595585887869437 2.pdf
research methodology and biostatistics.pptx
MAIN 2._biostatistics.pptx biostatistics
Biostatistics
Biostatistics in Orthodontics -ARATHY.pptx
Biostatistics
Biostatics in orthodontics - kkKiran.ppt
research methodolkglhxhlxohxohxoyxphxogy.pptx
Bio Statistics.pptx by Dr.REVATHI SIVAKUMAR
Use of Biostatics in Dentistry /certified fixed orthodontic courses by Indian...
Lect 1_Biostat.pdf
Introduction of Biostatistics
Statistics
Introduction-to-Statistics.pptx
Introduction-To-Statistics-18032022-010747pm (1).ppt
A basic Introduction To Statistics with examples
chapter 1.pptx
1.1 statistical and critical thinking
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Managing Community Partner Relationships
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Introduction to the R Programming Language
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPT
statistic analysis for study - data collection
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
chrmotography.pptx food anaylysis techni
PPTX
New ISO 27001_2022 standard and the changes
PDF
Transcultural that can help you someday.
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
[EN] Industrial Machine Downtime Prediction
Navigating the Thai Supplements Landscape.pdf
Managing Community Partner Relationships
IMPACT OF LANDSLIDE.....................
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to the R Programming Language
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
SAP 2 completion done . PRESENTATION.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
statistic analysis for study - data collection
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
DU, AIS, Big Data and Data Analytics.ppt
chrmotography.pptx food anaylysis techni
New ISO 27001_2022 standard and the changes
Transcultural that can help you someday.
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Ad

intro to statistics and data analysis.pptx

  • 1. Why are we here? Statistics: The science of collecting, manipulating, and analyzing empirical data. -Statistics enables us to use environmental data to follow the scientific method.
  • 2. The scientific method: Example Step one: Make an observation (cool, doesn't need stats) Ex: When Mt. Pinatubo erupted in 1991, much less sunlight was available for plants... •20 Mt of sulfur dioxide ⟶ increased stratospheric sulfate aerosols, ⟶ decreased sunlight reaching Earth's surface
  • 3. The scientific method: Example Step two: Ask a question (also doesn't need stats) Ex: I wonder if these aerosols could decrease crop yields?
  • 4. The scientific method: Example Step three: Form a hypothesis / testable explanation (sounding somewhat stats-like, but still no pre-req's here) Ex: Did less sunlight from Mt. Pinatubo's aerosols lead to lower crop yields?
  • 5. The scientific method: Example Steps four/five: Analyze data & test the hypothesis ( NEED STATISTICS!! ) Ex: Assemble spatio-temporal dataset, run multivariate linear regression, test statistical significance of regression parameters
  • 6. The scientific method: Example Step six: Report conclusions & what they mean for your question Ex: Run simulation to consider the implications for solar radiation management (a form of geo-engineering)
  • 7. Statistics and the scientific method •Statistics will enable us to test hypotheses, analyze data, and draw conclusions about the world from the process •Otherwise, we'd be stuck at observing and forming hypotheses •... and we'd have a lot of unanswered empirical questions!
  • 8. This course This course is: •Designed to build your fundamental statistical toolkit •Designed to teach you to apply statistics in R •Designed to show you key univariate, multivariate and modeling methods that come up frequently in environmental data science •Designed to show you key spatio-temporal methods that come up in environmental data science (If enough time left!!). •Still new! ⟶ We are actively adjusting based on student ⟶ feedback. Your input will help shape the curriculum!
  • 9. Sample versus population Consider a potential research question:  What is the average mercury content in swordfish in the Atlantic Ocean? Some definitions •Population: The entire target population of interest. •Ex: All swordfish in the Atlantic Ocean •Census: A data collection including all individuals in the population •Ex: Collect mercury data for every single swordfish in the Atlantic Ocean (hard and ) 💰 •Sample: A subset of the target population for which we actually have data • Ex: 60 tagged swordfish from a government survey in the Atlantic Ocean
  • 10. Parameters and statistics Usually we are interested in a numerical summary of the population (e.g., mean, slope, intercept, variance) •Parameter: A numerical summary of the population • Ex: average mercury content in swordfish in the Atlantic Ocean •Statistic: A numerical summary of the sample • Ex: average mercury content of the 60 swordfish collected in a government survey in the Atlantic Ocean
  • 11. Parameters and statistics  We use statistics (from a sample) in hopes of learning about parameters (from the population)  This means that every time you do "statistics", you should be thinking...  What is the population of interest?  What is my sample?  How are they different?
  • 12. All samples are not created equal From IMS: Suppose we want to estimate time to graduation for Duke undergraduates in the last five years using a sample of recent students. •Q: Who is the population? Suppose we take a random sample (i.e., every individual in the population has the same probability of being selected)
  • 13. All samples are not created equal Suppose we ask a nutrition major to pick a few of her friends for the sample. •What might go wrong here? Asked to pick a sample of graduates, a nutrition major might inadvertently pick a disproportionate number of graduates from health-related majors.
  • 14. Four (random) sampling strategies Nearly all statistical methods are based on assumptions of randomness. If data are not collected randomly from the population, estimates are likely to be biased. Strategy 1: Simple random sampling •As simple as it sounds! •To consider for later classes: What problems might arise if your (simple random) sample is small?
  • 15. Four (random) sampling strategies Strategy 2: Stratified sampling •More complex to analyze sample to construct estimates of population parameters (but still possible) •Helpful when individuals within a strata are quite similar •Used often as a method to reduce "noise" in your data (we'll discuss this later)
  • 16. Four (random) sampling strategies Strategy 3: Cluster random sampling •Helpful when individuals within a cluster are quite different from one another •Used often when costs of data collection are high per cluster (e.g., Demographic and Health Surveys) •Also more complex to estimate population parameters
  • 17. Four (random) sampling strategies Strategy 4: Multi-stage sampling •Very similar to cluster sampling, just take fewer samples (randomly)
  • 18. Study design When we conduct statistical analyses, where do our samples come from? •Experimental studies • Sample is collected to fit the study's needs •Observational studies • Sample exists, design your study to make best use of available data
  • 19. Study design Q: Does sunscreen lower risk of skin cancer?
  • 20. Study design Q: Does sunscreen lower risk of skin cancer? •Experiment •Randomly sample 50% of individuals and assign them the sunscreen "treatment", require the other 50% to wear no sunscreen •Follow individuals for 20 years, compare cancer outcomes
  • 21. Study design Q: Does sunscreen lower risk of skin cancer? Observational study Collect data on sunscreen use, skin cancer, and sun exposure Compare cancer rates for individuals with different sunscreen use habits
  • 22. Types of variables A variable is a representation of something we care about in a population (e.g., nitrate concentration of groundwater). Numerical variables Object class numeric in R •Can take on a wide range of possible values •Makes sense to add, subtract, multiply, etc. •Examples: •Height of the tree canopy across the Amazon •Length of Atlantic swordfish •Daily average temperature
  • 23. Types of variables Numerical variables Discrete numerical variables take on only a limited set of values, often counts (e.g., population) Continuous numerical variables: can take on infinite values within a range (e.g., arsenic concentration in groundwater)
  • 24. Types of variables Categorical variables Object class factor in R •Values correspond to one of a fixed number of categories •Possible values are called levels •Examples: •Land use type •Species of tree •Age group (e.g., <15, 15-64, 65+) (watch out! continuous numerical data can often be stored as a categorical variable!)
  • 25. Types of variables Categorical variables Nominal variables are unordered descriptions Ordinal variables are categories with a natural ordering Binary variables only take on 0 or 1
  • 26. Things to do 1- Install R https://www.r-project.org/ 2- Install RStudio https://posit.co/download/rstudio-desktop/ 3- Access to course material on Olat https://olat.vcrp.de/url/RepositoryEntry/4703814428 Code: ETX3