STATA Software Training
July , 2022
AHRI
MNTD DEPARTMENT
EMAGEN PROJECT
Outline
Introduction to STATA
STATA Environment
Entering Data
Data import
Exploring Data
Descriptive Analysis
Data Management
What is STATA?
❖ It is a multi-purpose statistical package to help you
explore, summarize and analyze datasets.
A complete, integrated statistical package that provides data
.
Easy to use but very powerful data analysis software
package
Most operations can be accomplished either via the
, or directly via typed .
The official website is
STATA: ADVANTAGES
• Command syntax is very compact, saving time
• Syntax is consistent across commands, so easier to learn
• Competitive with other software regarding variety of statistical
tools
• Excellent documentation
• Exceptionally strong support for
• Econometric models and methods
• Complex survey data analysis tools
• But, limited to one dataset in memory at a time
• Must open another instance of Stata to open another dataset
The Stata Work Environment
Stata’s work environment includes windows,
menus and Stata-specific task bar.
Windows: The Stata windows give you all the
key information about the data file you are
using, recent commands, and the results
of those commands.
The Stata Work Environment
1. Result Window
Opens automatically
Shows recent commands, output, error messages and
help Keeps only about 300 – 600 lines of the most
recent outputs.
Use the log command to store all outputs in a file.
Text is color coded.
The Stata Work Environment
2. Command Window
It opens automatically
Used to enter a command
3. Variable Window
It opens automatically
Keeps a list of current variables
Left-click to use a variable for the list
The Stata Work Environment
4. Review Window
Keeps a list of all commands typed in a Stata session
Right-click window to export everything into a ‘do’ file to be
able to run exact same commands latter.
Left-click to reuse the command
5. Data Editor
A spreadsheet used to edit and enter data
It needs to be opened
The Stata Work Environment
6. Do-File Editor
A do-file is a text (also called batch) file with a series of
commands to be executed in order by Stata.
It is used to write or edit a stata program
Also great for composing, revising, and saving Stata
commands.
Stata reads and executes whatever commands it
contains.
The Stata Work Environment
To use a do-file:
– Click on Do-File Editor.
– Enter commands.
– Save file with .do extension.
. To execute a do-file:
– Via command: do pathoffile/filename.do.
– Via drop- menu: File → Do …
The Stata Work Environment
7. Viewer
Used to get help on command syntax with complete
list of options and examples
It opens automatically
When we open Stata for the first time, it looks like the
following.
When we open Stata for the first time,
it looks like
When we open Stata for the first time,
it looks like
Menus
Stata displays 8 drop-down menus across the top of the
outer window, from left to right:
File
➢ Open: open a Stata data file (use)
➢ Save/Save as: save the Stata data in memory to disk
➢ Do: execute a do-file
➢ Filename: copy a filename to the command line
➢ Print : print log or graph
Menus
Edit
➢ Copy/Paste: copy text among the Command, Results,
and Log windows
➢ Copy Table : copy table from Results window to
another file
➢ Table copy options: what to do with table lines in
Copy Table
Menus
Data : for stata data management
Graphics: for various kinds of statistical graphs
Statistics: build and run Stata commands from menus
User: menus for user-supplied Stata commands
(download from Internet)
Window: bring a Stata window to the front
Help: Stata command syntax and keyword searches
Stata Tool Bar
• The buttons on the button bar are from left to right (equivalent
command is in bold):
• Open a Stata data file: use
• Save the Stata data in memory to disk: save
• Print a log or graph
• Open a log, or suspend/close an open log: log
• Open a new viewer
Check STATA is Up-to date
• If updating is needed, either do it automatically by connecting
your PC to internet.
• Use Help Check for updates
• Or
• Use files on the CD/DVD, flash that are under STATA
resources.
STATA FILE:
*.dta: the “input file”. this is stata data file .
*.do:the program file that’s act up on the “input file”.
This is the text file containing a list of stata
commands. Save your program as text file with the
“do” extension.
Then to run the program at the stata prompt type do
filename.
*.gph: a file extension used to save stata graphs.
STATA FILE:
*.log: the “output” file. This file echoes
whatever appears on screen in ASCII text
format.
You ask a stata to echo a session by typing “log
using filename” and stata automatically names it
filename.log
When you are done, you type “log close”.
Variable Naming Conventions
1. Variable names can be between 1 & 32
characters
2. Variable names start with a letter or an
underscore (cannot begin with a number).
3. Variable names are case sensitive
E.g., income and INCOME are different.
Variable Naming Conventions
4. Variable names must contain no spaces.
5. Command name must be in lower case.
6. With large data sets, it may be necessary to
increase the memory limit in stata from the
default of 1 megabyte (note that there must be
no data in memory at this stage)
Example set memory 100m(100 megabyte),
100gigabayte ….
Stata Data Types
Each element of data is said to be either type string or
numeric. Strings are stored as str#, for instance,
str1, str2, str3, ..., str244.
Numbers are stored as byte, int, long, float, or
double, with the default being float. byte, int,
and long are said to be of "integer" type.
Display format: varlist %fmt. Example: %9.2f,
%8.0g, %- 18s.
Menus vs. Commands
pull-down menus:- Allows user to get results
without needing to know syntax.
Alternatively, command syntax allows user to
reproduce results easily.
some important commands are:
Data management
• Use: loads a stata data set into memory
• Generate: generates a new variable from another
variable (s)
• gr matrix: scaterplot matrices
DO-FILES doedit open do-file
editor
• Stata do-files are text files where users
can store and run their commands for
reuse, rather than retyping the
commands into the Command window
➢ Do-files are Scripts of Commands
• Reproducibility
• Easier debugging and changing
commands
• It is recommend using a do-file always
when using Stata
• The file extension .do is used for do-
files
Use the command doedit to open
the do-file editor
Or click on the pencil and paper icon
on the toolbar
• The do-file editor is a text file editor specialized for Stata
OPENING THE
DO-FILE
EDITOR
• The do-file editor colors Stata
commands blue
• Comments, which are not executed,
are usually preceded by * and are
colored green
• Words in quotes (file names, string
values) are colored “red”
SYNTAX
HIGHLIGHTIN
G
• To run a command from the do-file,
highlight part or all of the command, and
then hit Ctrl-D (Mac: Shift+Cmd+D) or
the “Execute(do)” icon, the rightmost
icon on the do-file editor toolbar
• Multiple commands can be selected and
executed
RUNNING
COMMANDS FROM
THE DO-FILE
WORKING DIRECTORY
• At the bottom left of the Stata window is
the address of the working directory.
• Stata will load from and save files to here,
unless another directory is specified
• Use the command cd to change the
working directory
Entering Data
• Many Options:
✓Manually enter data into the Stata Data Editor.
✓Copy data into the Data Editor from another
source (ex.: Excel).
✓Importing an ASCII (text) file.
Manually Input Data
• Open the Data Editor by:
Clicking on Data Editor icon (4th from
right on tool menu bar, looks like a data
file).
Via command: edit
Can enter numbers or text (appears red).
To define variable names:
Note: variables are automatically named
var1, var2, …
Manually Input Data
Double-click on top of column to view/edit
“Variable Properties” and change the
name.
Via command:rename oldvarname
newvarname
Eg. rename var1 id
Inputting from the keyboard
input age weight
8 11
9 12
8 10
9 11
10 15
end
• Stata allows data to be entered directly through the keyboard
with the input command, even when another dataset is already in
memory. This can be useful to add data that may not be used in the
ensuing statistical analysis, such as graphing data.
To use input:
• variable names follow input
• the keyword end terminates data entry
• number of rows does not need to be the same as data in memory
Notes on data entry
• There are several things to note about data entry
and the feedback you get from the Data Editor
as you enter data:
Stata does not allow blank columns or rows in
the middle of your dataset.
Notes on data entry
Whenever you enter new variables or
observations, always begin in the first empty
column or row. If you skip columns or rows,
Stata will fill in the intervening columns or rows
with missing values.
Strings and value labels are color coded.
Notes on data entry
To help distinguish between the different types
of variables in the Data Editor, string values
are displayed in Red, value labels are displayed
in blue, and all other values are displayed in
black.
Notes on data entry
You can change the colors for strings
and value labels by right-clicking on
the Data Editor window and selecting
Preferences....
Notes on data entry
A period (‘.’) represents Stata’s system
missing numeric value.
The Cursor Location box both shows
location and is used for navigation.
Quotes around text are unnecessary in
string variables.
Copy Spreadsheet Data
• To copy data into Data Editor from an MS
Excel spreadsheet:
i. Open Spreadsheet with data.
ii. Highlight and copy cells of interest.
iii. Paste in Data Editor (via Edit menu,
right-click, toolbar icon, or keyboard
shortcut) in 1st cell (row and column),
where you want the data to begin.
Cont…
To save data file:
Via drop-menu: File → Save As …
Via command: save pathname/datafilename.dta
IMPORTING EXCEL DATA SETS
• Stata can read in data sets stored in many other
formats
• The command import excel is used to import
Excel data
• An Excel filename is required (with path, if not
located in working directory) after the keyword
using
• Use the sheet() option to open a particular sheet
• Use the firstrow option if variable names are on
the first row of the Excel sheet
* import excel file; change path
below before executing
import excel using
"C:pathmyfile.xlsx",
sheet(“mysheet") firstrow clear
E.g
import excel using
"C:UsersuserDesktopSTATA
TraininingANTSOKIA.xlsx", ///
sheet("Sheet1") firstrow clear
IMPORTING .csv DATA SETS
• Comma-separated values files are also commonly
used to store data
• Use import delimited to read in .csv files
(and files delimited by other characters such as tab
or space)
• The syntax and options are very similar to import
excel
• But no need for sheet() or firstrow options
(first row is assumed to be variable names in .csv
files)
* import csv file; change path
below before executing
import delimited using
"C:pathmyfile.csv", clear
E.G
import delimited using
"C:UsersuserDesktopSTATA
TraininingBEREHET.csv", clear
Using the Menu to Import
EXCEL and .Csv Data
• Because path names can be very long and many
options are often needed, menus are often
used to import data
Select File -> Import and then either
“Excel spreadsheet” or
“Text data(delimited,*.csv, …)”
IMPORTING SPSS
DATA TO STATA
• You can use the command
usespss to read SPSS files in
Stata
For SPSS , you may need to install
it by typing
ssc install usespss
usespss using “c:mydata.sav”
E.G
usespss using
“C:UsersuserDesktopSTATA
TraininingEnsaro.sav”
But, this command works only in 32-bit
Stata for Windows
Opening an Existing Stata Datafile
❖Via drop-menu: File → Open →Scroll to find
data
❖Via command: use
❖The clear command is a default command that
clears the memory before loading the requested
data file.
❖ This is necessary because Stata can have only
one dataset in memory at a time!
VIEWING DATA
GETTING TO
KNOW YOUR DATA
browse open spreadsheet of data
list print data to Stata console
SAMPLE DATASETS FOR TRAINING
• We will use a dataset consisting of
200 observations (rows) and 13
variables (columns)
• Each observation is a student
• Variables
• Demographics – gender(1=male, 2=female),
race, ses(low, middle, high), etc
• Academic test scores
• read, write, math, science, socst
• Go ahead and load the dataset!
* Traininig dataset
use
"C:UsersuserDesktopSTA
TA
Traininingdata_for_traini
ng.dta", clear
BROWSING THE DATASET
• Once the data are loaded, we can view
the dataset as a spreadsheet using the
command browse
• The magnifying glass with spreadsheet
icon also browses the dataset
• Black columns are numeric, red
columns are strings, and blue columns
are numeric with string labels
LISTING OBSERVATIONS
• The list command prints observation to
the Stata console
• Simply issuing “list” will list all observations
and variables
• Not usually recommended except for small
datasets
• Specify variable names to list only those
variables
• We will soon see how to restrict to certain
observations
* list read and write for first 5
observations
li read write in 1/5
+--------------+
| read write |
|--------------|
1. | 57 52 |
2. | 68 59 |
3. | 44 33 |
4. | 63 44 |
5. | 47 52 |
+--------------+
SELECTING OBSERVATIONS
in select by observation number
if select by condition
SELECTING BY OBSERVATION NUMBER WITH
in
• in selects by observation (row) number
• Syntax
• in firstobs/lastobs
• 30/100 – observations 30 through 100
• Negative numbers count from the end
• “L” means last observation
• -10/L – tenth observation from the
last through last observation
* list science for
last 3 observations
li science in -3/L
+---------+
| science |
|---------|
198. | 55 |
199. | 58 |
200. | 53 |
+---------+
SELECTING BY CONDITION WITH if
• if selects observations that meet a
certain condition
• gender == 1 (male)
• math > 50
• if clause usually placed after the
command specification, but before
the comma that precedes the list of
options
* list gender, ses, and math if math >
70
* with clean output
li gender ses math if math > 70, clean
gender ses math
13. 1 high 71
22. 1 middle 75
37. 1 middle 75
55. 1 middle 73
73. 1 middle 71
83. 1 middle 71
97. 2 middle 72
98. 2 high 71
132. 2 low 72
164. 2 low 72
STATA LOGICAL AND RELATIONAL
OPERATORS
• == equal to
• double equals used to check for equality
• <, >, <=, >= greater than, greater than or
equal to, less than, less than or equal to
• ! not
• != not equal
• & and
• | or
* browse gender, ses, and read
* for females (gender=2) who have read > 70
browse gender ses read if gender == 2 & read > 70
Exploring data
describe get variable properties
codebook inspect variable values
summarize summarize distribution
tabulate tabulate frequencies
EXPLORE DATA BEFORE ANALYSIS
• Take the time to explore your data set before embarking on
analysis
• Get to know your sample
• Demographics of subjects
• Distributions of key variables
• Look for possible errors in variables
USE describe TO GET VARIABLE
PROPERTIES
• describe provides the following variable
properties:
• storage type (e.g. byte (integer), float (decimal),
str8 (character string variable of length 8))
• name of value label
• variable label
• describe by itself will describe all
variables
• can restrict to a list of variables (varlist in
Stata lingo)
* get variable properties
describe
Contains data from Contains data from
C:UsersuserDesktopSTATA Traininingdata_for_training.dta
obs: 200
vars: 11 12 Dec 2008
14:38
size: 9,600
--------------------------------------------------------------
-- storage display value
variable name type format label variable label
--------------------------------------------------------------
--gender float %9.0g
id float %9.0g
race float %12.0g rl
ses float %9.0g sl
schtyp float %9.0g
prgtype str8 %9s
read float %9.0g reading score
write float %9.0g writing score
math float %9.0g math score
science float %9.0g science score
socst float %9.0g social studies
score
--------------------------------------------------------------
--
USE codebook TO INSPECT VARIABLE VALUES
For more detailed information about the
values of each variable, use codebook, which
provides the following:
• For all variables
• number of unique and missing values
• For numeric variables
• range, quantiles, means and standard
deviation for continuous variables
• frequencies for discrete variables
• For string variables
• frequencies
• warnings about leading and trailing
blanks
* inspect values of variables read
gender and prgtype
codebook read gender prgtype
DESCRIPTIVE ANALYSIS (SUMMARRIES)
❑Summarizing continuous variable
• The summarize command
calculates a variable’s:
• number of non-missing
observations
• mean
• standard deviation
• min and max
* summarize continuous variables
summarize read math
Variable | Obs Mean Std. Dev. Min Max
----------+-----------------------------------------
read | 200 52.23 10.25294 28 76
math | 200 52.645 9.368448 33 75
* summarize read and math for females
summarize read math if gender == 2
Variable | Obs Mean Std. Dev. Min Max
-----------+---------------------------------------------
-
read | 109 51.73394 10.05783 28 76
math | 109 52.3945 9.151015 33 72
DETAILED SUMMARIES
• Use the detail option with
summary to get more estimates
that characterize the distribution,
such as:
• percentiles (including the median at
50th percentile)
• variance
• skewness
• kurtosis
* detailed summary of read for females
summarize read if gender == 2, detail
reading score
------------------------------------------------------
-------
Percentiles Smallest
1% 34 28
5% 36 34
10% 39 34 Obs
109
25% 44 35 Sum of Wgt.
109
50% 50 Mean
51.73394
Largest Std. Dev.
10.05783
75% 57 71
90% 68 73 Variance
101.16
95% 68 73 Skewness
.3234174
99% 73 76 Kurtosis
2.500028
Summarizing (Tabulating )
Frequencies of Categorical Variables
• tabulate displays counts of each value of a
variable
• useful for variables with a limited number of
levels
• use the nolabel option to display the
underlying numeric values (by removing value
labels)
* tabulate frequencies of ses
tabulate ses
ses | Freq. Percent Cum.
------------+-----------------------------------
low | 47 23.50 23.50
middle | 95 47.50 71.00
high | 58 29.00 100.00
------------+-----------------------------------
Total | 200 100.00
* remove labels
tab ses, nolabel
ses | Freq. Percent Cum.
------------+-----------------------------------
1 | 47 23.50 23.50
2 | 95 47.50 71.00
3 | 58 29.00 100.00
------------+-----------------------------------
Total | 200 100.00
TWO-WAY TABULATIONS
• tabulate can also calculate the
joint frequencies of two variables
• Use the row and col options to
display row and column percentages
• We may have found an error in a
race value (5?)
* with row percentages
tab race ses, row
| ses
race | low middle high | Total
-------------+---------------------------------+----------
hispanic | 9 11 4 | 24
| 37.50 45.83 16.67 | 100.00
-------------+---------------------------------+----------
asian | 3 5 3 | 11
| 27.27 45.45 27.27 | 100.00
-------------+---------------------------------+----------
african-amer | 11 6 3 | 20
| 55.00 30.00 15.00 | 100.00
-------------+---------------------------------+----------
white | 24 71 48 | 143
| 16.78 49.65 33.57 | 100.00
-------------+---------------------------------+----------
5 | 0 2 0 | 2
| 0.00 100.00 0.00 | 100.00
-------------+---------------------------------+----------
Total | 47 95 58 | 200
| 23.50 47.50 29.00 | 100.00
Tabulating By Sort
• First, recode student data
• recode age (18 19 = 1 "18
to 19") /// (20/29 = 2 "20
to 29") /// (30/39 = 3 "30
to 39") (else=.),
generate(agegroups)
label(agegroups)
• bysort gender: tab
agegroups major, col nokey
• with row percentages
DATA VISUALIZATION
histogram histogram
graph box boxplot
graph bar bar plots
scatter scatter plot
graph pie pie chart
DATA VISUALIZATION
(GRAPHICS)
• Data visualization is the representation of data in visual formats
such as graphs
• Graphs help us to gain information about the distributions of variables and
relationships among variables quickly through visual inspection
• Graphs can be used to explore your data, to familiarize yourself
with distributions and associations in your data
• Graphs can also be used to present the results of statistical analysis
HISTOGRAMS
• Histograms plot distributions of
variables by displaying counts of values
that fall into various intervals of the
variable
• Let’s use the auto data file for making
some graphs.
• sysuse auto.dta
• The histogram command can be used
to make a simple histogram of mpg
*histogram of
histogram mileage(mpg)
0
.02
.04
.06
.08
.1
Density
10 20 30 40
Mileage (mpg)
histogram OPTIONS
• Use the option normal with
histogram to overlay a theoretical
normal density
• Use addlabels to display percentage
• Add discrete to specify the midpoint
of each bin labels with respective bar.
• Use the width() option to specify
interval width
* histogram of write with
normal density and intervals of
length
hist rep78, percent discrete
addlabels normal width()
2.899
11.59
43.48
26.09
15.94
0
10
20
30
40
Percent
0 2 4 6
Repair Record 1978
BOXPLOTS
• Boxplots are another popular option
for displaying distributions of
continuous variables
• They display the median, the
interquartile range, (IQR) and
outliers (beyond 1.5*IQR)
• You can request boxplots for
multiple variables on the same plot
* boxplot of all test scores
graph box mpg
10
20
30
40
Mileage
(mpg)
BOXPLOTS
• The boxplot can be done separately
for foreign and domestic cars using
the by( ) or over( ) option.
• graph box mpg, by(foreign) or
• graph box mpg, over(foreign)
• As you can see in the graph above,
there are a pair of outliers in the box
plots produced.
• These can be removed from the box
plot using the noout command in
Stata.
* boxplot of all test scores
graph box mpg, by(foreign)
10
20
30
40
Domestic Foreign
Mileage
(mpg)
Graphs by Car type
BAR GRAPHS TO VISUALIZE FREQUENCIES
• Bar graphs are often used to
visualize frequencies
• graph bar produces bar
graphs in Stata
• For displays of frequencies
(counts) of each level of a
variable, use this syntax:
graph bar (count),
over(variable)
* bar graph of count of car
tpye
graph bar (count),
over(foreign)
0
20
40
60
80
percent
Domestic Foreign
TWO-WAY BAR GRAPHS
• Multiple
over(variable)options can be
specified
• The option asyvars will color the
bars by the first over() variable
* frequencies of gender by major
* asyvars colors bars by major
graph bar (count), over(major)
over(gender) asyvars
0
2
4
6
8
frequency
Female Male
Econ Math
Politics
GROUPED BAR GRAPH USING HISTOGRAM,
DISCRETE
• Side-by-side or grouped bar graphs,
for levels of some grouping variable
• sort groupingvar
• histogram variablename, discrete
by(groupingvar)
• E.g
• histogram rep78, discrete by(foreign)
percent addlabels xlabel(1 "1" "2" 2 3
"3" 4 "4" 5 "5") gap(25) title(“Repair
Record in 1978”) xtitle(“ ”)
4.167
16.67
56.25
18.75
4.167
14.29
42.86 42.86
0
20
40
60
1 2
2 3 4 5 1 2
2 3 4 5
Domestic Foreign
“Repair Record in 1978” “Repair Record in 1978”
Percent
Percent
Percent
“ ”
Graphs by Car type
TWO-WAY SCATTER PLOT
• The Stata graphing command twoway produces layered graphics,
where multiple plots can be overlayed on the same graph
• Each plot should involve a y-variable and an x-variable that appear on
the y-axis and x-axis, respectively
• Syntax (generally): twoway (plottype1 yvar xvar)
(plottype2 yvar xvar)…
• plottype is one of several types of plots available to twoway,
and yvar and xvar are the variables to appear on the y-axis and x-
axis
• See help twoway for a list of the many plottypes available
TWO-WAY SCATTER GRAPH EXAMPLE 1
• A two way scatter plot can be used
to show the relationship between
variable .
• As we would expect, there is a
negative relationship
between mpg and weight.
* layered graph of scatter plot
curve
graph twoway scatter mpg
weight
10
20
30
40
Mileage
(mpg)
2,000 3,000 4,000 5,000
Weight (lbs.)
LAYERED GRAPH
• You can also overlay separate plots by
group to the same graph with
different colors
• Use if to select groups
• the mcolor() option controls
the color of the markers
* layered scatter plots of weight and mpg,
colored by rep78
twoway (scatter weight mpg if rep78 == 3,
mcolor(blue)) ///
(scatter weight mpg if rep78 == 4, mcolor(red))
2,000
3,000
4,000
5,000
Weight
(lbs.)
10 15 20 25 30
Mileage (mpg)
Weight (lbs.) Weight (lbs.)
LAYERED GRAPH
• Layered graph of scatter plot and
lowess plot (best fit curve)
• * layered scatter plots of weight and
mpg, colored by rep78
• graph twoway (scatter mpg weight,
msymbol(o)) (lfit mpg weight),
title("Scatterplot") subtitle("with
Overlay Linear Fit") 10
20
30
40
2,000 3,000 4,000 5,000
Weight (lbs.)
Mileage (mpg) Fitted values
with Overlay Linear Fit
Scatterplot
Pie CHART
• Stata can also produce pie charts.
• The graph pie command with
the over option creates a pie chart
representing the frequency of each
group.
• The plabel option places the value
labels inside each slice of the pie
chart.
graph pie, over(rep78)
plabel(_all name)
title("Repair Record 1978")
1
2
3
4
5
1 2
3 4
5
Repair Record 1978
Pie CHART…
• Stata can also produce pie charts.
• The graph pie command with
the over option creates a pie chart
representing the frequency of each
group.
• The plabel option places the value
labels inside each slice of the pie
chart.
graph pie, over(foreign)
plabel(_all name)title(“Car type")
Domestic
Foreign
Domestic Foreign
Car type
DATA MANAGEMENT
generate create variable
replace replace values of variable
egen extended variable generation
rename rename variable
recode recode variable values
label variable give variable description
label define generate value label set
label value apply value labels to variable
encode convert string variable to
numeric
Creating,
Transforming
and Labeling
Variables
GENERATING VARIABLES
• Variables often do not arrive in the form
that we need
• Use generate (often abbreviated gen) to
create variables, usually from arithmetic
operations on existing variables
• sums/differences/products of variables
• squares of variables
• If an input value to a generated variable is
missing, the result will be missing
* generate a sum of 3 variables
generate total = math + science +
socst
(5 missing values generated)
* it seems 5 missing values were
generated
* let's look at variables
summarize total math science socst
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------
----------
total | 195 156.4564 24.63553 96 213
math | 200 52.645 9.368448 33 75
science | 195 51.66154 9.866026 26 74
socst | 200 52.405 10.73579 26 71
REPLACING VALUES
• Use replace to replace values of
existing variables
• Often used with if to replace values
for a subset of observations
• Here we see the use of the missing
numeric value indicator.
• Missing value for strings is “”
* replace total with just
(math+socst)
* if science is missing
replace total = math + science if
science == .
* no missing totals now
summarize total
Variable | Obs MeanStd. Dev.
Min Max
-------------+-------------
total | 200
155.42 25.47565 74
213
CREATING DUMMY INDICATORS
• It is often necessary to create variables that
are 0/1 indicators for belonging to a
category of another variable, where
0=FALSE and 1=TRUE
• often called dummy variables or indicators
• Remember that Stata often prefers to work
with numeric variables
* create a variable that equals 1 if prgtype
* equals academic, 0 otherwise
gen academic = 0
replace academic = 1 if prgtype == "academic"
tab prgtype academic
| academic
prgtype | 0 1 | Total
-----------+----------------------+----------
academic | 0 105 | 105
general | 45 0 | 45
vocati | 50 0 | 50
-----------+----------------------+----------
Total | 95 105 | 200
EXTENDED GENERATION OF VARIABLES
• egen (extended generate) creates variables using
a wide array of functions, which include:
• statistical functions that accept multiple variables as
arguments
• e.g. means across several variables
• functions that accept a single variable, but do not
involve simple arithmetic operations
• e.g. standardizing a variable (subtract mean and
divide by standard deviation)
• See the help file for egen to see a full list of
available functions
* egen to generate variables with functions
* rowmean returns mean of all non-missing values
egen meantest = rowmean(read math science socst)
summarize meantest read math science socst
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
meantest | 200 52.28042 8.400239 32.5 70.66666
read | 200 52.23 10.25294 28 76
math | 200 52.645 9.368448 33 75
science | 195 51.66154 9.866026 26 74
socst | 200 52.405 10.73579 26 71
* standardize read
egen zread = std(read)
summarize zread
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
zread | 200 -1.84e-09 1 -2.363225 2.31836
RENAMING AND RECODING VARIABLES
• rename changes the name of a variable
• Syntax: rename old_name new_name
• recode changes the values of a variable to
another set of values
• Here we will change the gender variable
(1=male, 2=female) to “female” and will
recode its values to (0=male, 1=female)
• Thus, it will be clear what the coding of female
signifies
* renaming variables
rename gender female
* recode values to 0,1
recode female (1=0)(2=1)
tab female
female | Freq. Percent Cum.
------------+-----------------------------------
0 | 91 45.50 45.50
1 | 109 54.50 100.00
------------+-----------------------------------
Total | 200 100.00
LABELING VARIABLES (1)
• Short variable names make coding more
efficient but can obscure the variable’s
meaning
• Use label variable to give the
variable a longer description
• The variable label will sometimes be used in
output and often in graphs
* labeling variables (description)
label variable math "9th grade math score”
label variable schtyp "public/private school"
* the variable label will be used in some output
histogram math
tab schtyp
LABELING VARIABLES (1)
• Short variable names make coding more
efficient but can obscure the variable’s
meaning
• Use label variable to give the
variable a longer description
• The variable label will sometimes be used in
output and often in graphs
* labeling variables (description)
label variable math "9th grade math score”
label variable schtyp "public/private school"
* the variable label will be used in some output
histogram math
tab schtyp
public/priv |
ate school | Freq. Percent Cum.
------------+-----------------------------------
1 | 168 84.00 84.00
2 | 32 16.00 100.00
------------+-----------------------------------
Total | 200 100.00
LABELING VALUES
• Value labels give text descriptions to the numerical values of a variable.
• To create a new set of value labels use label define
• Syntax: label define labelname # “label”…, where
labelname is the name of the value label set, and (#
“label”…) is a list of numbers, each followed by its label.
• Then, to apply the labels to variables, use label values
• Syntax: label values varlist labelname, where
varlist is one or more variables, and labelname is the value
label set name
* schtyp before labeling values
tab schtyp
public/priv |
ate school | Freq. Percent Cum.
------------+-----------------------------------
1 | 168 84.00 84.00
2 | 32 16.00 100.00
------------+-----------------------------------
Total | 200 100.00
* create and apply labels for schtyp
label define pubpri 1 public 2 private
label values schtyp pubpri
tab schtyp
public/priv |
ate school | Freq. Percent Cum.
------------+-----------------------------------
public | 168 84.00 84.00
private | 32 16.00 100.00
------------+-----------------------------------
Total | 200 100.00
LISTING VALUE LABEL SETS (1)
• label list displays all value label sets
• Remember that describe can be used to
see which value labels have been applied to
which variables
* list all value label set
label list
pubpri:
1 public
2 private
sl:
1 low
2 middle
3 high
rl:
1 hispanic
2 asian
3 african-amer
4 white
LISTING VALUE LABEL SETS (2)
• label list displays all value label
sets
• Remember that describe can be
used to see which value labels have
been applied to which variables
* describe shows which value labels
* have been applied to which variables
describe
-------------------------------------------
--
storage display value
variable name type format label
-------------------------------------------
-
female float %9.0g
id float %9.0g
race float %12.0g rl
ses float %9.0g sl
schtyp float %9.0g pubpri
public/private school
prgtype str8 %9s
read float %9.0g
reading score
Encoding String Variables into Numeric (1)
• encode converts a string variable into a numeric
variable
• remember that some Stata commands require
numeric variables
• encode will use alphabetical order to order the
numeric codes
• encode will convert the original string values
into a set of value labels
• encode will create a new numeric variable,
which must be specified in option
gen(varname)
* encoding string prgtype into
* numeric variable prog
encode prgtype, gen(prog)
* we see that a value label has been
applied to prog
describe prog
storage display
value
variable name type format
label ---------------------
------------------
prog long %8.0g prog
ENCODING STRING VARIABLES INTO
NUMERIC (2)
• remember to use the option nolabel to
remove value labels from tabulate output
• Notice that numbering begins at 1
* we see labels by default in tab
tab prog
prog | Freq. Percent Cum.
------------+-----------------------------------
academic | 105 52.50 52.50
general | 45 22.50 75.00
vocati | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
* use option nolabel to remove the labels
tab prog, nolabel
prog | Freq. Percent Cum.
------------+-----------------------------------
1 | 105 52.50 52.50
2 | 45 22.50 75.00
3 | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
• Dataset operations
order reorder variables
keep keep variables, drop others
drop drop variables, keep others
keep if keep observations, drop others
drop if drop observations, keep others
sort sort by variables, ascending
gsort ascending and descending sort
SHORTCUTS FOR LISTS OF VARIABLES
(1)
• We specify many varlists (lists of
variable names) in Stata, so Stata provides
many shortcuts to avoid excessive typing
• The syntax var_first-var_last
specifies all consecutive variables from
var_first to var_last
• The keyword _all means all variables in
the dataset
* summarize all consecutive variables
* from read to socst
summ read-socst
Variable | Obs Mean Std. Dev. Min
Max
-------------+--------------------------------------------------
-------
read | 200 52.23 10.25294 28
76
write | 200 52.775 9.478586 31
67
math | 200 52.645 9.368448 33
75
science | 195 51.66154 9.866026 26
74
socst | 200 52.405 10.73579 26
71
SHORTCUTS FOR LISTS OF VARIABLES
(2)
• The * symbol in variable names stands for
“one or more characters”
• r* = all variables that start with “r”, followed
by anything
• r*e = all variables that start with “r”, followed
by anything, but ending with “e”
* summarize all variables that begin with r
summ r*
-------------+--------------------------------------------------
-------
race | 200 3.44 1.049719 1
5
read | 200 52.23 10.25294 28
76
* summarize all variables that begin with r
* and end with e
summ r*e
Variable | Obs Mean Std. Dev. Min
Max
-------------+--------------------------------------------------
-------
race | 200 3.44 1.049719 1
5
ORDERING VARIABLES
• Use order to change the ordering of
variables
• Particularly useful for datasets with many
variables
• By default, order will place the variables
listed at the beginning of the dataset
• Use option last to place them at the end
• Or use options before() and after() to
place the variables before or after another
variable
* put id and demographic variables first
order id female race ses schtyp prog
* put old prgtype variable last
order prgtype, last
describe
id race schtyp read math
socst academic zread
female ses prog write science
total meantest prgtype
SAVE YOUR DATA BEFORE MAKING BIG
CHANGES
• We are about to make changes to the
dataset that cannot easily be reversed, so
we should save the data before continuing
• We are going to revert to this saved dataset
later
* save dataset, overwrite existing file
save hs1, replace
KEEPING AND DROPPING VARIABLES
• keep preserves the selected variables and
drops the rest
• Use keep if you want to remove most of the
variables but keep a select few
• drop removes the selected variables and
keeps the rest
• Use drop if you want to remove a few
variables but keep most of them
* drop prgtype from dataset
drop prgtype
describe, simple
id race schtyp read math
socst academic zread
female ses prog write science
total meantest
* keep just id read and math
keep id read math
describe, simple
id read math
KEEPING AND DROPPING
OBSERVATIONS
• Specify if after keep or drop to filter
preserve or remove observations by
condition
• To be clear, keep if and drop if select
observations, while keep and drop select
variables
* keep observation if reading > 30
keep if read > 40
summ read
Variable | Obs Mean Std. Dev. Min
Max
-------------+--------------------------------------------------------
-
read | 178 54.23596 8.96323 41
76
* now drop if write outside range [30,70]
drop if math < 30 | math > 70
summ math
Variable | Obs Mean Std. Dev. Min
Max
-------------+--------------------------------------------------------
-
math | 168 52.68452 8.118243 35
70
SORTING DATA (1)
• Use sort to order the observations by
one or more variables
• sort var1 var2 var3, for example, will
sort first by var1, then by var2, then by
var3, all in ascending order
* sorting
* first look at unsorted
li in 1/5
+-------------------+
| id read math |
|-------------------|
1. | 70 57 41 |
2. | 121 68 53 |
3. | 86 44 54 |
4. | 141 63 47 |
5. | 172 47 57 |
+-------------------+
SORTING DATA (2)
• Use sort to order the observations by
one or more variables
• sort var1 var2 var3, for example, will
sort first by var1, then by var2, then by
var3, all in ascending order
* now sort by read and then math
sort read math
li in 1/5
+-------------------+
| id read math |
|-------------------|
1. | 37 41 40 |
2. | 30 41 42 |
3. | 145 42 38 |
4. | 22 42 39 |
5. | 124 42 41 |
+-------------------+
SORTING DATA (3)
• Use gsort with + or – before each
variable to specify ascending and descending
order, respectively
* sort descending read then ascending math
gsort -read +math
li in 1/5
+-------------------+
| id read math |
|-------------------|
1. | 61 76 60 |
2. | 103 76 64 |
3. | 34 73 57 |
4. | 93 73 62 |
5. | 95 73 71 |
+-------------------+
DATA MANAGEMENT EXERCISES (1)
• Let’s use what we have learned so far to create 3 new datasets
• males with math and reading scores above 70
• females with math and social studies scores above 70
• race and ses for all students with math score above 70 and either reading or social
studies also above 70
• We will want the student id variable in all datasets
DATA MANAGEMENT EXERCISES (2)
• Let’s create a dataset of males with math
and reading scores above 70
• First load the hs1 dataset
• Now restrict observations to males with
math and reading scores above 70
• Now drop all variables except id, female,
math and read
• Print the dataset to screen
• Save the dataset with name “males70”
* first load the hs1 dataset
use hs1, clear
* restrict to males with math and reading > 70
keep if female == 0 & math > 70 & read > 70
* keep only id remale math and read
keep id female math read
* print to screen
li
+----------------------------+
| id female read math |
|----------------------------|
1. | 95 0 73 71 |
2. | 132 0 73 73 |
3. | 68 0 73 71 |
+----------------------------+
* save dataset
save males70, replace
DATA MANAGEMENT EXERCISES (3)
• Now create a dataset of females with math
and social studies scores above 70
• First load the hs1 dataset
• Now restrict observations to females with
math and socst scores above 70
• Now drop all variables except id, female,
math and socst
• Print the dataset to screen
• Save the dataset with name “females70”
* now for females with math and socst above 70
use hs1, clear
keep if female == 1 & math > 70 & socst > 70
* this time keep id female, math, socst
keep id female math socst
li
+-----------------------------+
| id female math socst |
|-----------------------------|
1. | 100 1 71 71 |
+-----------------------------+
save females70, replace
DATA MANAGEMENT EXERCISES (4)
• Finally a dataset of race and ses for students with
math above 70 and either read or socst above 70
• First load the hs1 dataset
• Now restrict observations to everyone with
math above 70 and either read or socst above 70
• Now drop all variables except id, female, race, ses,
read, math, and socst
• Print the dataset to screen
• Save the dataset with name “raceses70”
* id female race ses for students with
* math > 70 and either read > 70 or socst > 70
use hs1, clear
keep if math > 70 & (read > 70 | socst > 70)
* keep id female race ses read math socst
keep id-ses read math socst
li
+-----------------------------------------------------+
| id female race ses read math socst |
|-----------------------------------------------------|
1. | 95 0 white high 73 71 71 |
2. | 132 0 white middle 73 73 66 |
3. | 68 0 white middle 73 71 66 |
4. | 57 1 white middle 71 72 56 |
5. | 100 1 white high 63 71 71 |
+-----------------------------------------------------+
save raceses70, replace
BASIC STATISTICAL ANALYSIS
ANALYSIS OF
CONTINUOUS, NORMALLY
DISTRIBUTED OUTCOMES
ci means confidence intervals for means
ttest t-tests
anova analysis of variance
correlate correlation matrices
regress linear regression
predict model predictions
test test of linear combinations of
coefficients
MEANS AND CONFIDENCE INTERVALS
(1)
• Confidence intervals express a range of
plausible values for a population statistic,
such as the mean of a variable, consistent
with the sample data
• The mean command provides a 95%
confidence interval, as do many other
commands
• We can change the confidence level of the
interval with the ci means command and
the level() option
* many commands provide 95% CI
mean read
Mean estimation Number of obs = 200
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
read | 52.23 .7249921 50.80035 53.65965
--------------------------------------------------------------
MEANS AND CONFIDENCE INTERVALS
(2)
• We can change the confidence level of the interval with the ci means
command and the level() option
* 99% CI for read
ci means read, level(99)
Variable | Obs Mean Std. Err. [99% Conf. Interval]
-------------+---------------------------------------------------------------
read | 200 52.23 .7249921 50.34447 54.11553
T-TESTS TEST WHETHER THE MEANS ARE
DIFFERENT BETWEEN 2 GROUPS
• t-tests test whether the mean of a variable is different between 2 groups
• The t-test assumes that the variable is normally distributed
• The independent samples t-test assumes that the two groups are independent
(uncorrelated)
• Syntax for independent samples t-test:
• ttest var, by(groupvar), where var is the variable whose mean will be
tested for differences between levels of groupvar
INDEPENDENT SAMPLES T-TEST
EXAMPLE
* independent samples t-test
ttest read, by(female)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 91 52.82418 1.101403 10.50671 50.63605 55.0123
1 | 109 51.73394 .9633659 10.05783 49.82439 53.6435
---------+--------------------------------------------------------------------
combined | 200 52.23 .7249921 10.25294 50.80035 53.65965
---------+--------------------------------------------------------------------
diff | 1.090231 1.457507 -1.783998 3.964459
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = 0.7480
Ho: diff = 0 degrees of freedom = 198
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.7723 Pr(|T| > |t|) = 0.4553 Pr(T > t) = 0.2277
PAIRED SAMPLES T-TEST (1)
• The paired-samples (dependent samples) t-test assesses whether the means
of 2 variables are the same when the measurements of the 2 variables are not
independent
• 2 variables measured on the same individual
• one variable measured for parent, the other variable measured for child
• Syntax for paired samples t-test
• t-test var1 == var2
PAIRED SAMPLES T-TEST EXAMPLE
* paired samples t-test
ttest read == write
Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
read | 200 52.23 .7249921 10.25294 50.80035 53.65965
write | 200 52.775 .6702372 9.478586 51.45332 54.09668
---------+--------------------------------------------------------------------
diff | 200 -.545 .6283822 8.886666 -1.784142 .6941424
------------------------------------------------------------------------------
mean(diff) = mean(read - write) t = -0.8673
Ho: mean(diff) = 0 degrees of freedom = 199
Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0
Pr(T < t) = 0.1934 Pr(|T| > |t|) = 0.3868 Pr(T > t) = 0.8066
ANALYSIS OF VARIANCE
• Analysis ofVariance (ANOVA) models traditionally assess whether means of a
continuous variable are different across multiple groups (possibly represented
by multiple categorical variables)
• ANOVA assumes the dependent variable is normally distributed
• ANOVA is not one of Stata’s strengths
• Syntax: anova depvar varlist
• where depvar is the name of the dependent variable, and varlist is a list of
predictors, assumed to be categorical
• If a predictor is to be treated as continuous (ANCOVA model), precede its variable
name with c.
ANOVA EXAMPLE
* 2-way ANOVA of write by female and prog
anova write female prog
Number of obs = 200 R-squared = 0.2408
Root MSE = 8.32211 Adj R-squared = 0.2291
Source | Partial SS df MS F Prob>F
-----------+----------------------------------------------------
Model | 4304.4027 3 1434.8009 20.72 0.0000
|
female | 1128.7049 1 1128.7049 16.30 0.0001
prog | 3128.1889 2 1564.0944 22.58 0.0000
|
Residual | 13574.472 196 69.257512
-----------+----------------------------------------------------
Total | 17878.875 199 89.843593
CORRELATION (1)
• A correlation coefficient quantifies the
linear relationship between two
(continuous) variables on a scale between -1
and 1
• Syntax: correlate varlist
• The output will be a correlation matrix that
shows the pairwise correlation between
each pair of variables
* correlation of write and math
correlate write math
(obs=200)
| write math
-------------+------------------
write | 1.0000
math | 0.6174 1.0000
CORRELATION (2)
• A correlation coefficient quantifies the
linear relationship between two
(continuous) variables on a scale between -1
and 1
• Syntax: correlate varlist
• The output will be a correlation matrix that
shows the pairwise correlation between
each pair of variables
* correlation matrix of 5 variables
corr read write math science socst
(obs=195)
| read write math science socst
-------------+---------------------------------------------
read | 1.0000
write | 0.5960 1.0000
math | 0.6492 0.6203 1.0000
science | 0.6171 0.5671 0.6166 1.0000
socst | 0.6175 0.5996 0.5299 0.4529 1.0000
LINEAR REGRESSION
• Linear regression, or ordinary least squares regression, models the effects of one or more
predictors, which can be continuous or categorical, on a normally-distributed outcome
• Linear regression and ANOVA are actually the same model expressed in different ways
• Syntax: regress depvar varlist, where depvar is the name of the dependent
variable, and varlist is a list of predictors, now assumed to be continuous
• To be safe, precede variables names with i. to denote categorical predictors and c. to denote
continuous predictors
• For categorical predictors with the i. prefix, Stata will automatically create dummy 0/1 indicator
variables and enter all but one (the first, by default) into the regression
LINEAR REGRESSION EXAMPLE
* linear regression of write on continuous
* predictor math and categorical predictor prog
regress write c.math i.prog
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(3, 196) = 44.20
Model | 7214.30058 3 2404.76686 Prob > F = 0.0000
Residual | 10664.5744 196 54.411094 R-squared = 0.4035
-------------+---------------------------------- Adj R-squared = 0.3944
Total | 17878.875 199 89.843593 Root MSE = 7.3764
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .5476883 .0635714 8.62 0.000 .4223166 .6730601
|
prog |
general | -1.248212 1.381794 -0.90 0.367 -3.973304 1.47688
vocati | -3.84865 1.426982 -2.70 0.008 -6.66286 -1.034441
|
_cons | 25.18496 3.677755 6.85 0.000 17.9319 32.43801
------------------------------------------------------------------------------
ESTIMATING STATISTICS BASED ON A
MODEL
• Stata provides excellent support for estimating and testing additional statistics
after a regression model has been run
• Stata refers to these as “postestimation” commands, and they can be used
after most regression models
• Some examples:
• Model predictions: predicted outcomes, residuals, influence statistics, etc.
• Joint tests of coefficients or linear combination of statistics
• Marginal estimates
POSTESTIMATION EXAMPLE 1
• predict: Predicted values of the outcome
• can only be used after running a regression
model
* add variable of predicted values of write
* for each observation
predict predwrite
* look at first 5 predicted values
li predwrite write math female in 1/5
+----------------------------------+
| predwr~e write math female |
|----------------------------------|
1. | 46.39197 52 41 0 |
2. | 50.36379 59 53 1 |
3. | 53.51192 33 54 0 |
4. | 47.07766 44 47 0 |
5. | 56.40319 52 57 0 |
+----------------------------------+
POSTESTIMATION EXAMPLE 2
• test: test linear combination of regression
coefficients and joint tests of coefficients
* test of whether 2 prog coefficients are
* jointly significant
test 2.prog 3.prog
( 1) 2.prog = 0
( 2) 3.prog = 0
F( 2, 196) = 3.66
Prob > F = 0.0276
* test whether 2 prog coefs are different
test 2.prog-3.prog = 0
( 1) 2.prog - 3.prog = 0
F( 1, 196) = 2.88
Prob > F = 0.0914
ANALYSIS OF
CATEGORICAL OUTCOMES
tab …, chi2 chi-square test of
independence
logit logistic regression
CHI-SQUARE TEST OF INDEPENDENCE
• The chi-square test of independence
assesses association between 2 categorical
variables
• Answers the question: Are the category
proportions of one variable the same across
levels of another variable?
• Syntax: tab var1 var2, chi2
* chi square test of independence
tab prog ses, chi2
| ses
prog | low middle high | Total
-----------+---------------------------------+----------
academic | 19 44 42 | 105
general | 16 20 9 | 45
vocati | 12 31 7 | 50
-----------+---------------------------------+----------
Total | 47 95 58 | 200
Pearson chi2(4) = 16.6044 Pr = 0.002
LOGISTIC REGRESSION
• Logistic regression is used to estimate the effect of multiple predictors on a
binary outcome
• Syntax very similar to regress: logit depvar varlist, where
depvar is a binary outcome variable and varlist is a list of predictors
• Add the or option to output the coefficients as odds ratios
LOGISTIC REGRESSION EXAMPLE
* logistic regression of being in academic program
* on female and math score
* coefficients as odds ratios
logit academic i.female c.math, or
Logistic regression Number of obs = 200
LR chi2(2) = 46.85
Prob > chi2 = 0.0000
Log likelihood = -114.95535 Pseudo R2 = 0.1693
------------------------------------------------------------------------------
academic | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.female | 1.144479 .3680227 0.42 0.675 .6093863 2.149429
math | 1.128431 .0229718 5.94 0.000 1.084293 1.174365
_cons | .0018648 .0020288 -5.78 0.000 .0002211 .0157282
------------------------------------------------------------------------------

More Related Content

PDF
An introduction to STATA.pdf
PPTX
introduction-stata.pptx
PPTX
introductions to Stata software power point
PPTX
Statistics Linear Regression Model by Maqsood Asalam
DOCX
Stata claass lecture
PDF
Introduction to STATA(2).pdf
PDF
Introduction to STATA - Ali Rashed
PDF
Stata tutorial
An introduction to STATA.pdf
introduction-stata.pptx
introductions to Stata software power point
Statistics Linear Regression Model by Maqsood Asalam
Stata claass lecture
Introduction to STATA(2).pdf
Introduction to STATA - Ali Rashed
Stata tutorial

Similar to STATA_Training_for_data_science_juniors.pdf (20)

PDF
Stata tutorial university of princeton
PPTX
Tableau Basic Questions
PDF
Spss basics tutorial
PDF
StataTutorial.pdf
PPT
Introduction to Stata
PDF
SPSS introduction Presentation
DOCX
SAS Programming Notes
PPT
Sas - Introduction to working under change management
PPTX
chapter 1 PhD SPSS FINAL LECTURE. -.pptx
PPT
Less06 2 e_testermodule_5
PPTX
INTRODUCTION TO STATA.pptx
PPTX
Chapter -1.pptx
PPTX
Chapter -1.pptx0p0p0pppopooopopppp0ppoooooo
PDF
社會網絡分析UCINET Quick Start Guide
PPTX
rabidminer_Teamddsfa dfasdfasd fadfas.pptx
PDF
Découverte d'Einstein Analytics (Tableau CRM)
PPTX
Tableau interview questions www.bigclasses.com
PDF
InnerSoft STATS - Introduction
DOCX
Topic 4 intro spss_stata 30032012 sy_srini
DOCX
Week 2 Project - STAT 3001Student Name Type your name here.docx
Stata tutorial university of princeton
Tableau Basic Questions
Spss basics tutorial
StataTutorial.pdf
Introduction to Stata
SPSS introduction Presentation
SAS Programming Notes
Sas - Introduction to working under change management
chapter 1 PhD SPSS FINAL LECTURE. -.pptx
Less06 2 e_testermodule_5
INTRODUCTION TO STATA.pptx
Chapter -1.pptx
Chapter -1.pptx0p0p0pppopooopopppp0ppoooooo
社會網絡分析UCINET Quick Start Guide
rabidminer_Teamddsfa dfasdfasd fadfas.pptx
Découverte d'Einstein Analytics (Tableau CRM)
Tableau interview questions www.bigclasses.com
InnerSoft STATS - Introduction
Topic 4 intro spss_stata 30032012 sy_srini
Week 2 Project - STAT 3001Student Name Type your name here.docx
Ad

More from AronMozart1 (20)

PPT
policy and Epirrrrrrrrrrdemiology(1).ppt
PPTX
emergeerrrrrrr444rnce of modern epi.pptx
PPT
1.Introduction to Public Health Medicine.ppt
PPT
7.Hobdbdjdjeowjeje8dhbdhdbdusing 2000.ppt
PPTX
Introduction-20to-20Research-20Methodology.pptx
PPT
Nutritional Assessment(3-ggubeditted).ppt
PPTX
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
PPTX
Currlopment-in-Higher-Education (1).pptx
PPTX
Elegant Bjhesis Defense by Slidesgo.pptx
PPTX
Quality_Assurance_Malaria_Ethihdjdopia.pptx
PPTX
Algokghvjvgjviviviugviugvigvgvrithms.pptx
PPTX
Data analytics problkjhvjhbjbkjbbem.pptx
PPT
Data_Preparation_Modeling_Evaluation.ppt
PPTX
Article_review [Autosavedkhgckhgckg].pptx
PPT
HI-workshop-202igchgcjhgcjhgchgcgcg4.ppt
PPTX
Tirhas Endaleyvljhclghclgfluyfhglvjjh.pptx
PPTX
Morph Presentation for Marketing Theme by Slidesgo.pptx
PPTX
PPT for Electronic communjujujuiaction.pptx
PPTX
Introduction to Djhgchigchg kjfouhvlHIS2.pptx
PPTX
Introduction_Health_Informatics_2023_2024_HINF631_Introduction.pptx
policy and Epirrrrrrrrrrdemiology(1).ppt
emergeerrrrrrr444rnce of modern epi.pptx
1.Introduction to Public Health Medicine.ppt
7.Hobdbdjdjeowjeje8dhbdhdbdusing 2000.ppt
Introduction-20to-20Research-20Methodology.pptx
Nutritional Assessment(3-ggubeditted).ppt
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
Currlopment-in-Higher-Education (1).pptx
Elegant Bjhesis Defense by Slidesgo.pptx
Quality_Assurance_Malaria_Ethihdjdopia.pptx
Algokghvjvgjviviviugviugvigvgvrithms.pptx
Data analytics problkjhvjhbjbkjbbem.pptx
Data_Preparation_Modeling_Evaluation.ppt
Article_review [Autosavedkhgckhgckg].pptx
HI-workshop-202igchgcjhgcjhgchgcgcg4.ppt
Tirhas Endaleyvljhclghclgfluyfhglvjjh.pptx
Morph Presentation for Marketing Theme by Slidesgo.pptx
PPT for Electronic communjujujuiaction.pptx
Introduction to Djhgchigchg kjfouhvlHIS2.pptx
Introduction_Health_Informatics_2023_2024_HINF631_Introduction.pptx
Ad

Recently uploaded (20)

PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Global Data and Analytics Market Outlook Report
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
ai agent creaction with langgraph_presentation_
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
recommendation Project PPT with details attached
PDF
A biomechanical Functional analysis of the masitary muscles in man
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
MBA JAPAN: 2025 the University of Waseda
PPT
Image processing and pattern recognition 2.ppt
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PPTX
Business_Capability_Map_Collection__pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
1 hour to get there before the game is done so you don’t need a car seat for ...
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
IMPACT OF LANDSLIDE.....................
Global Data and Analytics Market Outlook Report
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Session 11 - Data Visualization Storytelling (2).pdf
statsppt this is statistics ppt for giving knowledge about this topic
ai agent creaction with langgraph_presentation_
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
CYBER SECURITY the Next Warefare Tactics
recommendation Project PPT with details attached
A biomechanical Functional analysis of the masitary muscles in man
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
MBA JAPAN: 2025 the University of Waseda
Image processing and pattern recognition 2.ppt
The Data Security Envisioning Workshop provides a summary of an organization...
Business_Capability_Map_Collection__pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt

STATA_Training_for_data_science_juniors.pdf

  • 1. STATA Software Training July , 2022 AHRI MNTD DEPARTMENT EMAGEN PROJECT
  • 2. Outline Introduction to STATA STATA Environment Entering Data Data import Exploring Data Descriptive Analysis Data Management
  • 3. What is STATA? ❖ It is a multi-purpose statistical package to help you explore, summarize and analyze datasets. A complete, integrated statistical package that provides data . Easy to use but very powerful data analysis software package Most operations can be accomplished either via the , or directly via typed . The official website is
  • 4. STATA: ADVANTAGES • Command syntax is very compact, saving time • Syntax is consistent across commands, so easier to learn • Competitive with other software regarding variety of statistical tools • Excellent documentation • Exceptionally strong support for • Econometric models and methods • Complex survey data analysis tools • But, limited to one dataset in memory at a time • Must open another instance of Stata to open another dataset
  • 5. The Stata Work Environment Stata’s work environment includes windows, menus and Stata-specific task bar. Windows: The Stata windows give you all the key information about the data file you are using, recent commands, and the results of those commands.
  • 6. The Stata Work Environment 1. Result Window Opens automatically Shows recent commands, output, error messages and help Keeps only about 300 – 600 lines of the most recent outputs. Use the log command to store all outputs in a file. Text is color coded.
  • 7. The Stata Work Environment 2. Command Window It opens automatically Used to enter a command 3. Variable Window It opens automatically Keeps a list of current variables Left-click to use a variable for the list
  • 8. The Stata Work Environment 4. Review Window Keeps a list of all commands typed in a Stata session Right-click window to export everything into a ‘do’ file to be able to run exact same commands latter. Left-click to reuse the command 5. Data Editor A spreadsheet used to edit and enter data It needs to be opened
  • 9. The Stata Work Environment 6. Do-File Editor A do-file is a text (also called batch) file with a series of commands to be executed in order by Stata. It is used to write or edit a stata program Also great for composing, revising, and saving Stata commands. Stata reads and executes whatever commands it contains.
  • 10. The Stata Work Environment To use a do-file: – Click on Do-File Editor. – Enter commands. – Save file with .do extension. . To execute a do-file: – Via command: do pathoffile/filename.do. – Via drop- menu: File → Do …
  • 11. The Stata Work Environment 7. Viewer Used to get help on command syntax with complete list of options and examples It opens automatically When we open Stata for the first time, it looks like the following.
  • 12. When we open Stata for the first time, it looks like
  • 13. When we open Stata for the first time, it looks like
  • 14. Menus Stata displays 8 drop-down menus across the top of the outer window, from left to right: File ➢ Open: open a Stata data file (use) ➢ Save/Save as: save the Stata data in memory to disk ➢ Do: execute a do-file ➢ Filename: copy a filename to the command line ➢ Print : print log or graph
  • 15. Menus Edit ➢ Copy/Paste: copy text among the Command, Results, and Log windows ➢ Copy Table : copy table from Results window to another file ➢ Table copy options: what to do with table lines in Copy Table
  • 16. Menus Data : for stata data management Graphics: for various kinds of statistical graphs Statistics: build and run Stata commands from menus User: menus for user-supplied Stata commands (download from Internet) Window: bring a Stata window to the front Help: Stata command syntax and keyword searches
  • 17. Stata Tool Bar • The buttons on the button bar are from left to right (equivalent command is in bold): • Open a Stata data file: use • Save the Stata data in memory to disk: save • Print a log or graph • Open a log, or suspend/close an open log: log • Open a new viewer
  • 18. Check STATA is Up-to date • If updating is needed, either do it automatically by connecting your PC to internet. • Use Help Check for updates • Or • Use files on the CD/DVD, flash that are under STATA resources.
  • 19. STATA FILE: *.dta: the “input file”. this is stata data file . *.do:the program file that’s act up on the “input file”. This is the text file containing a list of stata commands. Save your program as text file with the “do” extension. Then to run the program at the stata prompt type do filename. *.gph: a file extension used to save stata graphs.
  • 20. STATA FILE: *.log: the “output” file. This file echoes whatever appears on screen in ASCII text format. You ask a stata to echo a session by typing “log using filename” and stata automatically names it filename.log When you are done, you type “log close”.
  • 21. Variable Naming Conventions 1. Variable names can be between 1 & 32 characters 2. Variable names start with a letter or an underscore (cannot begin with a number). 3. Variable names are case sensitive E.g., income and INCOME are different.
  • 22. Variable Naming Conventions 4. Variable names must contain no spaces. 5. Command name must be in lower case. 6. With large data sets, it may be necessary to increase the memory limit in stata from the default of 1 megabyte (note that there must be no data in memory at this stage) Example set memory 100m(100 megabyte), 100gigabayte ….
  • 23. Stata Data Types Each element of data is said to be either type string or numeric. Strings are stored as str#, for instance, str1, str2, str3, ..., str244. Numbers are stored as byte, int, long, float, or double, with the default being float. byte, int, and long are said to be of "integer" type. Display format: varlist %fmt. Example: %9.2f, %8.0g, %- 18s.
  • 24. Menus vs. Commands pull-down menus:- Allows user to get results without needing to know syntax. Alternatively, command syntax allows user to reproduce results easily. some important commands are: Data management • Use: loads a stata data set into memory • Generate: generates a new variable from another variable (s) • gr matrix: scaterplot matrices
  • 25. DO-FILES doedit open do-file editor • Stata do-files are text files where users can store and run their commands for reuse, rather than retyping the commands into the Command window ➢ Do-files are Scripts of Commands • Reproducibility • Easier debugging and changing commands • It is recommend using a do-file always when using Stata • The file extension .do is used for do- files
  • 26. Use the command doedit to open the do-file editor Or click on the pencil and paper icon on the toolbar • The do-file editor is a text file editor specialized for Stata OPENING THE DO-FILE EDITOR
  • 27. • The do-file editor colors Stata commands blue • Comments, which are not executed, are usually preceded by * and are colored green • Words in quotes (file names, string values) are colored “red” SYNTAX HIGHLIGHTIN G
  • 28. • To run a command from the do-file, highlight part or all of the command, and then hit Ctrl-D (Mac: Shift+Cmd+D) or the “Execute(do)” icon, the rightmost icon on the do-file editor toolbar • Multiple commands can be selected and executed RUNNING COMMANDS FROM THE DO-FILE
  • 29. WORKING DIRECTORY • At the bottom left of the Stata window is the address of the working directory. • Stata will load from and save files to here, unless another directory is specified • Use the command cd to change the working directory
  • 30. Entering Data • Many Options: ✓Manually enter data into the Stata Data Editor. ✓Copy data into the Data Editor from another source (ex.: Excel). ✓Importing an ASCII (text) file.
  • 31. Manually Input Data • Open the Data Editor by: Clicking on Data Editor icon (4th from right on tool menu bar, looks like a data file). Via command: edit Can enter numbers or text (appears red). To define variable names: Note: variables are automatically named var1, var2, …
  • 32. Manually Input Data Double-click on top of column to view/edit “Variable Properties” and change the name. Via command:rename oldvarname newvarname Eg. rename var1 id
  • 33. Inputting from the keyboard input age weight 8 11 9 12 8 10 9 11 10 15 end • Stata allows data to be entered directly through the keyboard with the input command, even when another dataset is already in memory. This can be useful to add data that may not be used in the ensuing statistical analysis, such as graphing data. To use input: • variable names follow input • the keyword end terminates data entry • number of rows does not need to be the same as data in memory
  • 34. Notes on data entry • There are several things to note about data entry and the feedback you get from the Data Editor as you enter data: Stata does not allow blank columns or rows in the middle of your dataset.
  • 35. Notes on data entry Whenever you enter new variables or observations, always begin in the first empty column or row. If you skip columns or rows, Stata will fill in the intervening columns or rows with missing values. Strings and value labels are color coded.
  • 36. Notes on data entry To help distinguish between the different types of variables in the Data Editor, string values are displayed in Red, value labels are displayed in blue, and all other values are displayed in black.
  • 37. Notes on data entry You can change the colors for strings and value labels by right-clicking on the Data Editor window and selecting Preferences....
  • 38. Notes on data entry A period (‘.’) represents Stata’s system missing numeric value. The Cursor Location box both shows location and is used for navigation. Quotes around text are unnecessary in string variables.
  • 39. Copy Spreadsheet Data • To copy data into Data Editor from an MS Excel spreadsheet: i. Open Spreadsheet with data. ii. Highlight and copy cells of interest. iii. Paste in Data Editor (via Edit menu, right-click, toolbar icon, or keyboard shortcut) in 1st cell (row and column), where you want the data to begin.
  • 40. Cont… To save data file: Via drop-menu: File → Save As … Via command: save pathname/datafilename.dta
  • 41. IMPORTING EXCEL DATA SETS • Stata can read in data sets stored in many other formats • The command import excel is used to import Excel data • An Excel filename is required (with path, if not located in working directory) after the keyword using • Use the sheet() option to open a particular sheet • Use the firstrow option if variable names are on the first row of the Excel sheet * import excel file; change path below before executing import excel using "C:pathmyfile.xlsx", sheet(“mysheet") firstrow clear E.g import excel using "C:UsersuserDesktopSTATA TraininingANTSOKIA.xlsx", /// sheet("Sheet1") firstrow clear
  • 42. IMPORTING .csv DATA SETS • Comma-separated values files are also commonly used to store data • Use import delimited to read in .csv files (and files delimited by other characters such as tab or space) • The syntax and options are very similar to import excel • But no need for sheet() or firstrow options (first row is assumed to be variable names in .csv files) * import csv file; change path below before executing import delimited using "C:pathmyfile.csv", clear E.G import delimited using "C:UsersuserDesktopSTATA TraininingBEREHET.csv", clear
  • 43. Using the Menu to Import EXCEL and .Csv Data • Because path names can be very long and many options are often needed, menus are often used to import data Select File -> Import and then either “Excel spreadsheet” or “Text data(delimited,*.csv, …)”
  • 44. IMPORTING SPSS DATA TO STATA • You can use the command usespss to read SPSS files in Stata For SPSS , you may need to install it by typing ssc install usespss usespss using “c:mydata.sav” E.G usespss using “C:UsersuserDesktopSTATA TraininingEnsaro.sav” But, this command works only in 32-bit Stata for Windows
  • 45. Opening an Existing Stata Datafile ❖Via drop-menu: File → Open →Scroll to find data ❖Via command: use ❖The clear command is a default command that clears the memory before loading the requested data file. ❖ This is necessary because Stata can have only one dataset in memory at a time!
  • 46. VIEWING DATA GETTING TO KNOW YOUR DATA browse open spreadsheet of data list print data to Stata console
  • 47. SAMPLE DATASETS FOR TRAINING • We will use a dataset consisting of 200 observations (rows) and 13 variables (columns) • Each observation is a student • Variables • Demographics – gender(1=male, 2=female), race, ses(low, middle, high), etc • Academic test scores • read, write, math, science, socst • Go ahead and load the dataset! * Traininig dataset use "C:UsersuserDesktopSTA TA Traininingdata_for_traini ng.dta", clear
  • 48. BROWSING THE DATASET • Once the data are loaded, we can view the dataset as a spreadsheet using the command browse • The magnifying glass with spreadsheet icon also browses the dataset • Black columns are numeric, red columns are strings, and blue columns are numeric with string labels
  • 49. LISTING OBSERVATIONS • The list command prints observation to the Stata console • Simply issuing “list” will list all observations and variables • Not usually recommended except for small datasets • Specify variable names to list only those variables • We will soon see how to restrict to certain observations * list read and write for first 5 observations li read write in 1/5 +--------------+ | read write | |--------------| 1. | 57 52 | 2. | 68 59 | 3. | 44 33 | 4. | 63 44 | 5. | 47 52 | +--------------+
  • 50. SELECTING OBSERVATIONS in select by observation number if select by condition
  • 51. SELECTING BY OBSERVATION NUMBER WITH in • in selects by observation (row) number • Syntax • in firstobs/lastobs • 30/100 – observations 30 through 100 • Negative numbers count from the end • “L” means last observation • -10/L – tenth observation from the last through last observation * list science for last 3 observations li science in -3/L +---------+ | science | |---------| 198. | 55 | 199. | 58 | 200. | 53 | +---------+
  • 52. SELECTING BY CONDITION WITH if • if selects observations that meet a certain condition • gender == 1 (male) • math > 50 • if clause usually placed after the command specification, but before the comma that precedes the list of options * list gender, ses, and math if math > 70 * with clean output li gender ses math if math > 70, clean gender ses math 13. 1 high 71 22. 1 middle 75 37. 1 middle 75 55. 1 middle 73 73. 1 middle 71 83. 1 middle 71 97. 2 middle 72 98. 2 high 71 132. 2 low 72 164. 2 low 72
  • 53. STATA LOGICAL AND RELATIONAL OPERATORS • == equal to • double equals used to check for equality • <, >, <=, >= greater than, greater than or equal to, less than, less than or equal to • ! not • != not equal • & and • | or * browse gender, ses, and read * for females (gender=2) who have read > 70 browse gender ses read if gender == 2 & read > 70
  • 54. Exploring data describe get variable properties codebook inspect variable values summarize summarize distribution tabulate tabulate frequencies
  • 55. EXPLORE DATA BEFORE ANALYSIS • Take the time to explore your data set before embarking on analysis • Get to know your sample • Demographics of subjects • Distributions of key variables • Look for possible errors in variables
  • 56. USE describe TO GET VARIABLE PROPERTIES • describe provides the following variable properties: • storage type (e.g. byte (integer), float (decimal), str8 (character string variable of length 8)) • name of value label • variable label • describe by itself will describe all variables • can restrict to a list of variables (varlist in Stata lingo) * get variable properties describe Contains data from Contains data from C:UsersuserDesktopSTATA Traininingdata_for_training.dta obs: 200 vars: 11 12 Dec 2008 14:38 size: 9,600 -------------------------------------------------------------- -- storage display value variable name type format label variable label -------------------------------------------------------------- --gender float %9.0g id float %9.0g race float %12.0g rl ses float %9.0g sl schtyp float %9.0g prgtype str8 %9s read float %9.0g reading score write float %9.0g writing score math float %9.0g math score science float %9.0g science score socst float %9.0g social studies score -------------------------------------------------------------- --
  • 57. USE codebook TO INSPECT VARIABLE VALUES For more detailed information about the values of each variable, use codebook, which provides the following: • For all variables • number of unique and missing values • For numeric variables • range, quantiles, means and standard deviation for continuous variables • frequencies for discrete variables • For string variables • frequencies • warnings about leading and trailing blanks * inspect values of variables read gender and prgtype codebook read gender prgtype
  • 58. DESCRIPTIVE ANALYSIS (SUMMARRIES) ❑Summarizing continuous variable • The summarize command calculates a variable’s: • number of non-missing observations • mean • standard deviation • min and max * summarize continuous variables summarize read math Variable | Obs Mean Std. Dev. Min Max ----------+----------------------------------------- read | 200 52.23 10.25294 28 76 math | 200 52.645 9.368448 33 75 * summarize read and math for females summarize read math if gender == 2 Variable | Obs Mean Std. Dev. Min Max -----------+--------------------------------------------- - read | 109 51.73394 10.05783 28 76 math | 109 52.3945 9.151015 33 72
  • 59. DETAILED SUMMARIES • Use the detail option with summary to get more estimates that characterize the distribution, such as: • percentiles (including the median at 50th percentile) • variance • skewness • kurtosis * detailed summary of read for females summarize read if gender == 2, detail reading score ------------------------------------------------------ ------- Percentiles Smallest 1% 34 28 5% 36 34 10% 39 34 Obs 109 25% 44 35 Sum of Wgt. 109 50% 50 Mean 51.73394 Largest Std. Dev. 10.05783 75% 57 71 90% 68 73 Variance 101.16 95% 68 73 Skewness .3234174 99% 73 76 Kurtosis 2.500028
  • 60. Summarizing (Tabulating ) Frequencies of Categorical Variables • tabulate displays counts of each value of a variable • useful for variables with a limited number of levels • use the nolabel option to display the underlying numeric values (by removing value labels) * tabulate frequencies of ses tabulate ses ses | Freq. Percent Cum. ------------+----------------------------------- low | 47 23.50 23.50 middle | 95 47.50 71.00 high | 58 29.00 100.00 ------------+----------------------------------- Total | 200 100.00 * remove labels tab ses, nolabel ses | Freq. Percent Cum. ------------+----------------------------------- 1 | 47 23.50 23.50 2 | 95 47.50 71.00 3 | 58 29.00 100.00 ------------+----------------------------------- Total | 200 100.00
  • 61. TWO-WAY TABULATIONS • tabulate can also calculate the joint frequencies of two variables • Use the row and col options to display row and column percentages • We may have found an error in a race value (5?) * with row percentages tab race ses, row | ses race | low middle high | Total -------------+---------------------------------+---------- hispanic | 9 11 4 | 24 | 37.50 45.83 16.67 | 100.00 -------------+---------------------------------+---------- asian | 3 5 3 | 11 | 27.27 45.45 27.27 | 100.00 -------------+---------------------------------+---------- african-amer | 11 6 3 | 20 | 55.00 30.00 15.00 | 100.00 -------------+---------------------------------+---------- white | 24 71 48 | 143 | 16.78 49.65 33.57 | 100.00 -------------+---------------------------------+---------- 5 | 0 2 0 | 2 | 0.00 100.00 0.00 | 100.00 -------------+---------------------------------+---------- Total | 47 95 58 | 200 | 23.50 47.50 29.00 | 100.00
  • 62. Tabulating By Sort • First, recode student data • recode age (18 19 = 1 "18 to 19") /// (20/29 = 2 "20 to 29") /// (30/39 = 3 "30 to 39") (else=.), generate(agegroups) label(agegroups) • bysort gender: tab agegroups major, col nokey • with row percentages
  • 63. DATA VISUALIZATION histogram histogram graph box boxplot graph bar bar plots scatter scatter plot graph pie pie chart
  • 64. DATA VISUALIZATION (GRAPHICS) • Data visualization is the representation of data in visual formats such as graphs • Graphs help us to gain information about the distributions of variables and relationships among variables quickly through visual inspection • Graphs can be used to explore your data, to familiarize yourself with distributions and associations in your data • Graphs can also be used to present the results of statistical analysis
  • 65. HISTOGRAMS • Histograms plot distributions of variables by displaying counts of values that fall into various intervals of the variable • Let’s use the auto data file for making some graphs. • sysuse auto.dta • The histogram command can be used to make a simple histogram of mpg *histogram of histogram mileage(mpg) 0 .02 .04 .06 .08 .1 Density 10 20 30 40 Mileage (mpg)
  • 66. histogram OPTIONS • Use the option normal with histogram to overlay a theoretical normal density • Use addlabels to display percentage • Add discrete to specify the midpoint of each bin labels with respective bar. • Use the width() option to specify interval width * histogram of write with normal density and intervals of length hist rep78, percent discrete addlabels normal width() 2.899 11.59 43.48 26.09 15.94 0 10 20 30 40 Percent 0 2 4 6 Repair Record 1978
  • 67. BOXPLOTS • Boxplots are another popular option for displaying distributions of continuous variables • They display the median, the interquartile range, (IQR) and outliers (beyond 1.5*IQR) • You can request boxplots for multiple variables on the same plot * boxplot of all test scores graph box mpg 10 20 30 40 Mileage (mpg)
  • 68. BOXPLOTS • The boxplot can be done separately for foreign and domestic cars using the by( ) or over( ) option. • graph box mpg, by(foreign) or • graph box mpg, over(foreign) • As you can see in the graph above, there are a pair of outliers in the box plots produced. • These can be removed from the box plot using the noout command in Stata. * boxplot of all test scores graph box mpg, by(foreign) 10 20 30 40 Domestic Foreign Mileage (mpg) Graphs by Car type
  • 69. BAR GRAPHS TO VISUALIZE FREQUENCIES • Bar graphs are often used to visualize frequencies • graph bar produces bar graphs in Stata • For displays of frequencies (counts) of each level of a variable, use this syntax: graph bar (count), over(variable) * bar graph of count of car tpye graph bar (count), over(foreign) 0 20 40 60 80 percent Domestic Foreign
  • 70. TWO-WAY BAR GRAPHS • Multiple over(variable)options can be specified • The option asyvars will color the bars by the first over() variable * frequencies of gender by major * asyvars colors bars by major graph bar (count), over(major) over(gender) asyvars 0 2 4 6 8 frequency Female Male Econ Math Politics
  • 71. GROUPED BAR GRAPH USING HISTOGRAM, DISCRETE • Side-by-side or grouped bar graphs, for levels of some grouping variable • sort groupingvar • histogram variablename, discrete by(groupingvar) • E.g • histogram rep78, discrete by(foreign) percent addlabels xlabel(1 "1" "2" 2 3 "3" 4 "4" 5 "5") gap(25) title(“Repair Record in 1978”) xtitle(“ ”) 4.167 16.67 56.25 18.75 4.167 14.29 42.86 42.86 0 20 40 60 1 2 2 3 4 5 1 2 2 3 4 5 Domestic Foreign “Repair Record in 1978” “Repair Record in 1978” Percent Percent Percent “ ” Graphs by Car type
  • 72. TWO-WAY SCATTER PLOT • The Stata graphing command twoway produces layered graphics, where multiple plots can be overlayed on the same graph • Each plot should involve a y-variable and an x-variable that appear on the y-axis and x-axis, respectively • Syntax (generally): twoway (plottype1 yvar xvar) (plottype2 yvar xvar)… • plottype is one of several types of plots available to twoway, and yvar and xvar are the variables to appear on the y-axis and x- axis • See help twoway for a list of the many plottypes available
  • 73. TWO-WAY SCATTER GRAPH EXAMPLE 1 • A two way scatter plot can be used to show the relationship between variable . • As we would expect, there is a negative relationship between mpg and weight. * layered graph of scatter plot curve graph twoway scatter mpg weight 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.)
  • 74. LAYERED GRAPH • You can also overlay separate plots by group to the same graph with different colors • Use if to select groups • the mcolor() option controls the color of the markers * layered scatter plots of weight and mpg, colored by rep78 twoway (scatter weight mpg if rep78 == 3, mcolor(blue)) /// (scatter weight mpg if rep78 == 4, mcolor(red)) 2,000 3,000 4,000 5,000 Weight (lbs.) 10 15 20 25 30 Mileage (mpg) Weight (lbs.) Weight (lbs.)
  • 75. LAYERED GRAPH • Layered graph of scatter plot and lowess plot (best fit curve) • * layered scatter plots of weight and mpg, colored by rep78 • graph twoway (scatter mpg weight, msymbol(o)) (lfit mpg weight), title("Scatterplot") subtitle("with Overlay Linear Fit") 10 20 30 40 2,000 3,000 4,000 5,000 Weight (lbs.) Mileage (mpg) Fitted values with Overlay Linear Fit Scatterplot
  • 76. Pie CHART • Stata can also produce pie charts. • The graph pie command with the over option creates a pie chart representing the frequency of each group. • The plabel option places the value labels inside each slice of the pie chart. graph pie, over(rep78) plabel(_all name) title("Repair Record 1978") 1 2 3 4 5 1 2 3 4 5 Repair Record 1978
  • 77. Pie CHART… • Stata can also produce pie charts. • The graph pie command with the over option creates a pie chart representing the frequency of each group. • The plabel option places the value labels inside each slice of the pie chart. graph pie, over(foreign) plabel(_all name)title(“Car type") Domestic Foreign Domestic Foreign Car type
  • 78. DATA MANAGEMENT generate create variable replace replace values of variable egen extended variable generation rename rename variable recode recode variable values label variable give variable description label define generate value label set label value apply value labels to variable encode convert string variable to numeric Creating, Transforming and Labeling Variables
  • 79. GENERATING VARIABLES • Variables often do not arrive in the form that we need • Use generate (often abbreviated gen) to create variables, usually from arithmetic operations on existing variables • sums/differences/products of variables • squares of variables • If an input value to a generated variable is missing, the result will be missing * generate a sum of 3 variables generate total = math + science + socst (5 missing values generated) * it seems 5 missing values were generated * let's look at variables summarize total math science socst Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------- ---------- total | 195 156.4564 24.63553 96 213 math | 200 52.645 9.368448 33 75 science | 195 51.66154 9.866026 26 74 socst | 200 52.405 10.73579 26 71
  • 80. REPLACING VALUES • Use replace to replace values of existing variables • Often used with if to replace values for a subset of observations • Here we see the use of the missing numeric value indicator. • Missing value for strings is “” * replace total with just (math+socst) * if science is missing replace total = math + science if science == . * no missing totals now summarize total Variable | Obs MeanStd. Dev. Min Max -------------+------------- total | 200 155.42 25.47565 74 213
  • 81. CREATING DUMMY INDICATORS • It is often necessary to create variables that are 0/1 indicators for belonging to a category of another variable, where 0=FALSE and 1=TRUE • often called dummy variables or indicators • Remember that Stata often prefers to work with numeric variables * create a variable that equals 1 if prgtype * equals academic, 0 otherwise gen academic = 0 replace academic = 1 if prgtype == "academic" tab prgtype academic | academic prgtype | 0 1 | Total -----------+----------------------+---------- academic | 0 105 | 105 general | 45 0 | 45 vocati | 50 0 | 50 -----------+----------------------+---------- Total | 95 105 | 200
  • 82. EXTENDED GENERATION OF VARIABLES • egen (extended generate) creates variables using a wide array of functions, which include: • statistical functions that accept multiple variables as arguments • e.g. means across several variables • functions that accept a single variable, but do not involve simple arithmetic operations • e.g. standardizing a variable (subtract mean and divide by standard deviation) • See the help file for egen to see a full list of available functions * egen to generate variables with functions * rowmean returns mean of all non-missing values egen meantest = rowmean(read math science socst) summarize meantest read math science socst Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- meantest | 200 52.28042 8.400239 32.5 70.66666 read | 200 52.23 10.25294 28 76 math | 200 52.645 9.368448 33 75 science | 195 51.66154 9.866026 26 74 socst | 200 52.405 10.73579 26 71 * standardize read egen zread = std(read) summarize zread Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- zread | 200 -1.84e-09 1 -2.363225 2.31836
  • 83. RENAMING AND RECODING VARIABLES • rename changes the name of a variable • Syntax: rename old_name new_name • recode changes the values of a variable to another set of values • Here we will change the gender variable (1=male, 2=female) to “female” and will recode its values to (0=male, 1=female) • Thus, it will be clear what the coding of female signifies * renaming variables rename gender female * recode values to 0,1 recode female (1=0)(2=1) tab female female | Freq. Percent Cum. ------------+----------------------------------- 0 | 91 45.50 45.50 1 | 109 54.50 100.00 ------------+----------------------------------- Total | 200 100.00
  • 84. LABELING VARIABLES (1) • Short variable names make coding more efficient but can obscure the variable’s meaning • Use label variable to give the variable a longer description • The variable label will sometimes be used in output and often in graphs * labeling variables (description) label variable math "9th grade math score” label variable schtyp "public/private school" * the variable label will be used in some output histogram math tab schtyp
  • 85. LABELING VARIABLES (1) • Short variable names make coding more efficient but can obscure the variable’s meaning • Use label variable to give the variable a longer description • The variable label will sometimes be used in output and often in graphs * labeling variables (description) label variable math "9th grade math score” label variable schtyp "public/private school" * the variable label will be used in some output histogram math tab schtyp public/priv | ate school | Freq. Percent Cum. ------------+----------------------------------- 1 | 168 84.00 84.00 2 | 32 16.00 100.00 ------------+----------------------------------- Total | 200 100.00
  • 86. LABELING VALUES • Value labels give text descriptions to the numerical values of a variable. • To create a new set of value labels use label define • Syntax: label define labelname # “label”…, where labelname is the name of the value label set, and (# “label”…) is a list of numbers, each followed by its label. • Then, to apply the labels to variables, use label values • Syntax: label values varlist labelname, where varlist is one or more variables, and labelname is the value label set name * schtyp before labeling values tab schtyp public/priv | ate school | Freq. Percent Cum. ------------+----------------------------------- 1 | 168 84.00 84.00 2 | 32 16.00 100.00 ------------+----------------------------------- Total | 200 100.00 * create and apply labels for schtyp label define pubpri 1 public 2 private label values schtyp pubpri tab schtyp public/priv | ate school | Freq. Percent Cum. ------------+----------------------------------- public | 168 84.00 84.00 private | 32 16.00 100.00 ------------+----------------------------------- Total | 200 100.00
  • 87. LISTING VALUE LABEL SETS (1) • label list displays all value label sets • Remember that describe can be used to see which value labels have been applied to which variables * list all value label set label list pubpri: 1 public 2 private sl: 1 low 2 middle 3 high rl: 1 hispanic 2 asian 3 african-amer 4 white
  • 88. LISTING VALUE LABEL SETS (2) • label list displays all value label sets • Remember that describe can be used to see which value labels have been applied to which variables * describe shows which value labels * have been applied to which variables describe ------------------------------------------- -- storage display value variable name type format label ------------------------------------------- - female float %9.0g id float %9.0g race float %12.0g rl ses float %9.0g sl schtyp float %9.0g pubpri public/private school prgtype str8 %9s read float %9.0g reading score
  • 89. Encoding String Variables into Numeric (1) • encode converts a string variable into a numeric variable • remember that some Stata commands require numeric variables • encode will use alphabetical order to order the numeric codes • encode will convert the original string values into a set of value labels • encode will create a new numeric variable, which must be specified in option gen(varname) * encoding string prgtype into * numeric variable prog encode prgtype, gen(prog) * we see that a value label has been applied to prog describe prog storage display value variable name type format label --------------------- ------------------ prog long %8.0g prog
  • 90. ENCODING STRING VARIABLES INTO NUMERIC (2) • remember to use the option nolabel to remove value labels from tabulate output • Notice that numbering begins at 1 * we see labels by default in tab tab prog prog | Freq. Percent Cum. ------------+----------------------------------- academic | 105 52.50 52.50 general | 45 22.50 75.00 vocati | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00 * use option nolabel to remove the labels tab prog, nolabel prog | Freq. Percent Cum. ------------+----------------------------------- 1 | 105 52.50 52.50 2 | 45 22.50 75.00 3 | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00
  • 91. • Dataset operations order reorder variables keep keep variables, drop others drop drop variables, keep others keep if keep observations, drop others drop if drop observations, keep others sort sort by variables, ascending gsort ascending and descending sort
  • 92. SHORTCUTS FOR LISTS OF VARIABLES (1) • We specify many varlists (lists of variable names) in Stata, so Stata provides many shortcuts to avoid excessive typing • The syntax var_first-var_last specifies all consecutive variables from var_first to var_last • The keyword _all means all variables in the dataset * summarize all consecutive variables * from read to socst summ read-socst Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------- ------- read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 science | 195 51.66154 9.866026 26 74 socst | 200 52.405 10.73579 26 71
  • 93. SHORTCUTS FOR LISTS OF VARIABLES (2) • The * symbol in variable names stands for “one or more characters” • r* = all variables that start with “r”, followed by anything • r*e = all variables that start with “r”, followed by anything, but ending with “e” * summarize all variables that begin with r summ r* -------------+-------------------------------------------------- ------- race | 200 3.44 1.049719 1 5 read | 200 52.23 10.25294 28 76 * summarize all variables that begin with r * and end with e summ r*e Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------- ------- race | 200 3.44 1.049719 1 5
  • 94. ORDERING VARIABLES • Use order to change the ordering of variables • Particularly useful for datasets with many variables • By default, order will place the variables listed at the beginning of the dataset • Use option last to place them at the end • Or use options before() and after() to place the variables before or after another variable * put id and demographic variables first order id female race ses schtyp prog * put old prgtype variable last order prgtype, last describe id race schtyp read math socst academic zread female ses prog write science total meantest prgtype
  • 95. SAVE YOUR DATA BEFORE MAKING BIG CHANGES • We are about to make changes to the dataset that cannot easily be reversed, so we should save the data before continuing • We are going to revert to this saved dataset later * save dataset, overwrite existing file save hs1, replace
  • 96. KEEPING AND DROPPING VARIABLES • keep preserves the selected variables and drops the rest • Use keep if you want to remove most of the variables but keep a select few • drop removes the selected variables and keeps the rest • Use drop if you want to remove a few variables but keep most of them * drop prgtype from dataset drop prgtype describe, simple id race schtyp read math socst academic zread female ses prog write science total meantest * keep just id read and math keep id read math describe, simple id read math
  • 97. KEEPING AND DROPPING OBSERVATIONS • Specify if after keep or drop to filter preserve or remove observations by condition • To be clear, keep if and drop if select observations, while keep and drop select variables * keep observation if reading > 30 keep if read > 40 summ read Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- - read | 178 54.23596 8.96323 41 76 * now drop if write outside range [30,70] drop if math < 30 | math > 70 summ math Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- - math | 168 52.68452 8.118243 35 70
  • 98. SORTING DATA (1) • Use sort to order the observations by one or more variables • sort var1 var2 var3, for example, will sort first by var1, then by var2, then by var3, all in ascending order * sorting * first look at unsorted li in 1/5 +-------------------+ | id read math | |-------------------| 1. | 70 57 41 | 2. | 121 68 53 | 3. | 86 44 54 | 4. | 141 63 47 | 5. | 172 47 57 | +-------------------+
  • 99. SORTING DATA (2) • Use sort to order the observations by one or more variables • sort var1 var2 var3, for example, will sort first by var1, then by var2, then by var3, all in ascending order * now sort by read and then math sort read math li in 1/5 +-------------------+ | id read math | |-------------------| 1. | 37 41 40 | 2. | 30 41 42 | 3. | 145 42 38 | 4. | 22 42 39 | 5. | 124 42 41 | +-------------------+
  • 100. SORTING DATA (3) • Use gsort with + or – before each variable to specify ascending and descending order, respectively * sort descending read then ascending math gsort -read +math li in 1/5 +-------------------+ | id read math | |-------------------| 1. | 61 76 60 | 2. | 103 76 64 | 3. | 34 73 57 | 4. | 93 73 62 | 5. | 95 73 71 | +-------------------+
  • 101. DATA MANAGEMENT EXERCISES (1) • Let’s use what we have learned so far to create 3 new datasets • males with math and reading scores above 70 • females with math and social studies scores above 70 • race and ses for all students with math score above 70 and either reading or social studies also above 70 • We will want the student id variable in all datasets
  • 102. DATA MANAGEMENT EXERCISES (2) • Let’s create a dataset of males with math and reading scores above 70 • First load the hs1 dataset • Now restrict observations to males with math and reading scores above 70 • Now drop all variables except id, female, math and read • Print the dataset to screen • Save the dataset with name “males70” * first load the hs1 dataset use hs1, clear * restrict to males with math and reading > 70 keep if female == 0 & math > 70 & read > 70 * keep only id remale math and read keep id female math read * print to screen li +----------------------------+ | id female read math | |----------------------------| 1. | 95 0 73 71 | 2. | 132 0 73 73 | 3. | 68 0 73 71 | +----------------------------+ * save dataset save males70, replace
  • 103. DATA MANAGEMENT EXERCISES (3) • Now create a dataset of females with math and social studies scores above 70 • First load the hs1 dataset • Now restrict observations to females with math and socst scores above 70 • Now drop all variables except id, female, math and socst • Print the dataset to screen • Save the dataset with name “females70” * now for females with math and socst above 70 use hs1, clear keep if female == 1 & math > 70 & socst > 70 * this time keep id female, math, socst keep id female math socst li +-----------------------------+ | id female math socst | |-----------------------------| 1. | 100 1 71 71 | +-----------------------------+ save females70, replace
  • 104. DATA MANAGEMENT EXERCISES (4) • Finally a dataset of race and ses for students with math above 70 and either read or socst above 70 • First load the hs1 dataset • Now restrict observations to everyone with math above 70 and either read or socst above 70 • Now drop all variables except id, female, race, ses, read, math, and socst • Print the dataset to screen • Save the dataset with name “raceses70” * id female race ses for students with * math > 70 and either read > 70 or socst > 70 use hs1, clear keep if math > 70 & (read > 70 | socst > 70) * keep id female race ses read math socst keep id-ses read math socst li +-----------------------------------------------------+ | id female race ses read math socst | |-----------------------------------------------------| 1. | 95 0 white high 73 71 71 | 2. | 132 0 white middle 73 73 66 | 3. | 68 0 white middle 73 71 66 | 4. | 57 1 white middle 71 72 56 | 5. | 100 1 white high 63 71 71 | +-----------------------------------------------------+ save raceses70, replace
  • 106. ANALYSIS OF CONTINUOUS, NORMALLY DISTRIBUTED OUTCOMES ci means confidence intervals for means ttest t-tests anova analysis of variance correlate correlation matrices regress linear regression predict model predictions test test of linear combinations of coefficients
  • 107. MEANS AND CONFIDENCE INTERVALS (1) • Confidence intervals express a range of plausible values for a population statistic, such as the mean of a variable, consistent with the sample data • The mean command provides a 95% confidence interval, as do many other commands • We can change the confidence level of the interval with the ci means command and the level() option * many commands provide 95% CI mean read Mean estimation Number of obs = 200 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ read | 52.23 .7249921 50.80035 53.65965 --------------------------------------------------------------
  • 108. MEANS AND CONFIDENCE INTERVALS (2) • We can change the confidence level of the interval with the ci means command and the level() option * 99% CI for read ci means read, level(99) Variable | Obs Mean Std. Err. [99% Conf. Interval] -------------+--------------------------------------------------------------- read | 200 52.23 .7249921 50.34447 54.11553
  • 109. T-TESTS TEST WHETHER THE MEANS ARE DIFFERENT BETWEEN 2 GROUPS • t-tests test whether the mean of a variable is different between 2 groups • The t-test assumes that the variable is normally distributed • The independent samples t-test assumes that the two groups are independent (uncorrelated) • Syntax for independent samples t-test: • ttest var, by(groupvar), where var is the variable whose mean will be tested for differences between levels of groupvar
  • 110. INDEPENDENT SAMPLES T-TEST EXAMPLE * independent samples t-test ttest read, by(female) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 91 52.82418 1.101403 10.50671 50.63605 55.0123 1 | 109 51.73394 .9633659 10.05783 49.82439 53.6435 ---------+-------------------------------------------------------------------- combined | 200 52.23 .7249921 10.25294 50.80035 53.65965 ---------+-------------------------------------------------------------------- diff | 1.090231 1.457507 -1.783998 3.964459 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 0.7480 Ho: diff = 0 degrees of freedom = 198 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.7723 Pr(|T| > |t|) = 0.4553 Pr(T > t) = 0.2277
  • 111. PAIRED SAMPLES T-TEST (1) • The paired-samples (dependent samples) t-test assesses whether the means of 2 variables are the same when the measurements of the 2 variables are not independent • 2 variables measured on the same individual • one variable measured for parent, the other variable measured for child • Syntax for paired samples t-test • t-test var1 == var2
  • 112. PAIRED SAMPLES T-TEST EXAMPLE * paired samples t-test ttest read == write Paired t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | 200 52.23 .7249921 10.25294 50.80035 53.65965 write | 200 52.775 .6702372 9.478586 51.45332 54.09668 ---------+-------------------------------------------------------------------- diff | 200 -.545 .6283822 8.886666 -1.784142 .6941424 ------------------------------------------------------------------------------ mean(diff) = mean(read - write) t = -0.8673 Ho: mean(diff) = 0 degrees of freedom = 199 Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 Pr(T < t) = 0.1934 Pr(|T| > |t|) = 0.3868 Pr(T > t) = 0.8066
  • 113. ANALYSIS OF VARIANCE • Analysis ofVariance (ANOVA) models traditionally assess whether means of a continuous variable are different across multiple groups (possibly represented by multiple categorical variables) • ANOVA assumes the dependent variable is normally distributed • ANOVA is not one of Stata’s strengths • Syntax: anova depvar varlist • where depvar is the name of the dependent variable, and varlist is a list of predictors, assumed to be categorical • If a predictor is to be treated as continuous (ANCOVA model), precede its variable name with c.
  • 114. ANOVA EXAMPLE * 2-way ANOVA of write by female and prog anova write female prog Number of obs = 200 R-squared = 0.2408 Root MSE = 8.32211 Adj R-squared = 0.2291 Source | Partial SS df MS F Prob>F -----------+---------------------------------------------------- Model | 4304.4027 3 1434.8009 20.72 0.0000 | female | 1128.7049 1 1128.7049 16.30 0.0001 prog | 3128.1889 2 1564.0944 22.58 0.0000 | Residual | 13574.472 196 69.257512 -----------+---------------------------------------------------- Total | 17878.875 199 89.843593
  • 115. CORRELATION (1) • A correlation coefficient quantifies the linear relationship between two (continuous) variables on a scale between -1 and 1 • Syntax: correlate varlist • The output will be a correlation matrix that shows the pairwise correlation between each pair of variables * correlation of write and math correlate write math (obs=200) | write math -------------+------------------ write | 1.0000 math | 0.6174 1.0000
  • 116. CORRELATION (2) • A correlation coefficient quantifies the linear relationship between two (continuous) variables on a scale between -1 and 1 • Syntax: correlate varlist • The output will be a correlation matrix that shows the pairwise correlation between each pair of variables * correlation matrix of 5 variables corr read write math science socst (obs=195) | read write math science socst -------------+--------------------------------------------- read | 1.0000 write | 0.5960 1.0000 math | 0.6492 0.6203 1.0000 science | 0.6171 0.5671 0.6166 1.0000 socst | 0.6175 0.5996 0.5299 0.4529 1.0000
  • 117. LINEAR REGRESSION • Linear regression, or ordinary least squares regression, models the effects of one or more predictors, which can be continuous or categorical, on a normally-distributed outcome • Linear regression and ANOVA are actually the same model expressed in different ways • Syntax: regress depvar varlist, where depvar is the name of the dependent variable, and varlist is a list of predictors, now assumed to be continuous • To be safe, precede variables names with i. to denote categorical predictors and c. to denote continuous predictors • For categorical predictors with the i. prefix, Stata will automatically create dummy 0/1 indicator variables and enter all but one (the first, by default) into the regression
  • 118. LINEAR REGRESSION EXAMPLE * linear regression of write on continuous * predictor math and categorical predictor prog regress write c.math i.prog Source | SS df MS Number of obs = 200 -------------+---------------------------------- F(3, 196) = 44.20 Model | 7214.30058 3 2404.76686 Prob > F = 0.0000 Residual | 10664.5744 196 54.411094 R-squared = 0.4035 -------------+---------------------------------- Adj R-squared = 0.3944 Total | 17878.875 199 89.843593 Root MSE = 7.3764 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .5476883 .0635714 8.62 0.000 .4223166 .6730601 | prog | general | -1.248212 1.381794 -0.90 0.367 -3.973304 1.47688 vocati | -3.84865 1.426982 -2.70 0.008 -6.66286 -1.034441 | _cons | 25.18496 3.677755 6.85 0.000 17.9319 32.43801 ------------------------------------------------------------------------------
  • 119. ESTIMATING STATISTICS BASED ON A MODEL • Stata provides excellent support for estimating and testing additional statistics after a regression model has been run • Stata refers to these as “postestimation” commands, and they can be used after most regression models • Some examples: • Model predictions: predicted outcomes, residuals, influence statistics, etc. • Joint tests of coefficients or linear combination of statistics • Marginal estimates
  • 120. POSTESTIMATION EXAMPLE 1 • predict: Predicted values of the outcome • can only be used after running a regression model * add variable of predicted values of write * for each observation predict predwrite * look at first 5 predicted values li predwrite write math female in 1/5 +----------------------------------+ | predwr~e write math female | |----------------------------------| 1. | 46.39197 52 41 0 | 2. | 50.36379 59 53 1 | 3. | 53.51192 33 54 0 | 4. | 47.07766 44 47 0 | 5. | 56.40319 52 57 0 | +----------------------------------+
  • 121. POSTESTIMATION EXAMPLE 2 • test: test linear combination of regression coefficients and joint tests of coefficients * test of whether 2 prog coefficients are * jointly significant test 2.prog 3.prog ( 1) 2.prog = 0 ( 2) 3.prog = 0 F( 2, 196) = 3.66 Prob > F = 0.0276 * test whether 2 prog coefs are different test 2.prog-3.prog = 0 ( 1) 2.prog - 3.prog = 0 F( 1, 196) = 2.88 Prob > F = 0.0914
  • 122. ANALYSIS OF CATEGORICAL OUTCOMES tab …, chi2 chi-square test of independence logit logistic regression
  • 123. CHI-SQUARE TEST OF INDEPENDENCE • The chi-square test of independence assesses association between 2 categorical variables • Answers the question: Are the category proportions of one variable the same across levels of another variable? • Syntax: tab var1 var2, chi2 * chi square test of independence tab prog ses, chi2 | ses prog | low middle high | Total -----------+---------------------------------+---------- academic | 19 44 42 | 105 general | 16 20 9 | 45 vocati | 12 31 7 | 50 -----------+---------------------------------+---------- Total | 47 95 58 | 200 Pearson chi2(4) = 16.6044 Pr = 0.002
  • 124. LOGISTIC REGRESSION • Logistic regression is used to estimate the effect of multiple predictors on a binary outcome • Syntax very similar to regress: logit depvar varlist, where depvar is a binary outcome variable and varlist is a list of predictors • Add the or option to output the coefficients as odds ratios
  • 125. LOGISTIC REGRESSION EXAMPLE * logistic regression of being in academic program * on female and math score * coefficients as odds ratios logit academic i.female c.math, or Logistic regression Number of obs = 200 LR chi2(2) = 46.85 Prob > chi2 = 0.0000 Log likelihood = -114.95535 Pseudo R2 = 0.1693 ------------------------------------------------------------------------------ academic | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 1.144479 .3680227 0.42 0.675 .6093863 2.149429 math | 1.128431 .0229718 5.94 0.000 1.084293 1.174365 _cons | .0018648 .0020288 -5.78 0.000 .0002211 .0157282 ------------------------------------------------------------------------------