Introduction to STATA - Ali Rashed

CAPMAS
International Statistics Day
20-10-2015
Egyptian Economic Census
Workshop 2012/2013
Introduction to STATA
By Ali Rashed
Population Council
17th – 19th October 2015

 STATA is a complete, integrated statistical
package that provides everything you
need for data analysis, data management,
and graphics.
 STATA is not sold in pieces, which means
you get everything you need in one
package without annual license fees.
 Fast, Accurate, and Easy to use:
WHY STATA

 You can access all of STATA’s data
management, statistical, and analysis features
from the menus and associated dialogs.
 Command syntax: a simple and consistent
 Online help & a topical index built into the
online help system
 All analyses can be reproduced and
documented for publication and review.
WHY STATA, Cont.

 Run Stata, Open a data set, describe its contents and Exit:
 Run Stata program from the “Start” button
 “use” Command:
 Open a Stata data set from the “File” pull-down Menu
 Example:
 cd “D:My DocumentsTraining CoursesUNDP Jordan June 2011Jordan
LMPS2010 “
 Use "JLMPS indiv public v1_ 0.dta", clear
 “describe” Command
 dir and cd commands work just like in DOS
 STATA commands are case-sensitive.
Type in small letters
Opening (Using) a data set

 Note the FOUR main Windows:
1. Command: to issue the commands to Stata
2. Result: to see the results
3. Variables: Shows the list of variables of the data set
active in memory: Click on a variable name to put it
into the command window
4. Review: Keeps track of the commands issued, so
each command you type is displayed here.
 Click on a command to put it into the command window for
editing
 Double-click on a command to execute it directly
The STATA Display

 You can resize these 4 windows
independently, and you can
 resize the outer window as well. To save
your window size
 changes, click on Edit, Preferences, Save
Preferences Set
Main FOUR Windows, Cont.

File types:
 xxxx.do files → txt files with your
commands, for future reference and editing
 xxxx.log files → txt files with your output,
for future reference and printing
 xxxx.dta files → data files in Stata format
 xxxx.gph files → graph files in Stata format
 xxxx.ado files → programs in Stata
STATA Files Types

Introduction to STATA - Ali Rashed

 Log File: For good documentation of operations and
output
 Variable storage type:
 byte : variable stored in one byte
 int: variable stored in 2 bytes
 long: 4 bytes (for variables with 9 digits or more)
 Float: 4 bytes (7 digits of accuracy )
 double: 8 bytes (16 digits of accuracy )
 “compress” command (Reduce the storage type to
minimum storage necessary)
 set memory 500m,perm
Describing Data

 Commands
 summarize (or summarize x y z)
 provides summary statistics for all or a subset of variables
 remember SATATA commands are case-sensitive
you can always use abbreviations if they are
not ambiguous
 e.g. sum x
 summarize by subgroup
 sort groupvar
 bysort groupvar: sum varname
Summarizing

 in qualifier
 Defines range of observation that command
applies to
 Examples:
list in 5/10
list gov pubpriv frame sector4d sector2d in 4/l
(the letter l refers to last)
Edit command

Specifying Subsets of the Data

if qualifier
 Defines observations that satisfy a certain
condition
 Example:
 sum empl weight totprod totsales totwage outputfc
totva netva netindtax profit1 population if pubpriv ==1
 sum empl weight totprod totsales totwage outputfc
totva netva netindtax profit1 population hhcount if
profit1 >=0 & profit1<200
 count if profit1<0 & pubpriv ==1 //How Many?//
 tab gov if profit1<0 & pubpriv ==1
 tab sector4d if profit1<0 & pubpriv ==1
 tab sector2d if profit1<0 & pubpriv ==1

 == is equal to
 != is not equal to ( ~= also works)
 > is greater than
 < is less than
 >= is greater than or equal to
 <= is less than or equal to
Logical operators

 “generate” command is used to create new
variables
 “replace” command is used to modify an existing
variable
 Examples:
sum profit1 // Net Profit
gen LnProfit=ln(profit1)
generate durEstablished= 2015-firmage
sum durEstablished
replace durEstablished =. if durEstablished <0
recode durEstablished (min/0 =. )
Transforming Variables

Basic descriptive commands
• describe or d
Gives a summary of the current data file:
•Number of observations/variables
•Data file size
•List of variables (name, type, label value)
• codebook
– Variables summary:
•Type, range, values, frequency
• List or l
– Display the values of the variables for each
observation

Basic data set management
• Sort
– Sort the data set
– Examples: sort gov or sort gov sector2d
• Keeping variables
– Examples:
• keep id gov pubpriv sector2d profit1 : will only keep these variables
• Dropping variables
– Examples:
• drop gov pubpriv frame : will drop variables these variables
• drop gov- prjs : will drop all variables from gov to prjs
• drop w* : will drop all variables beginning with q

Creation of variables
• Command: generate (or gen)
• Create string variables
• gen str10 cityname= « Cairo"
• Create numeric variables
• gen Net_Profit=profit1- netindtax (type float by default)
• gen byte Sales_Per_worker= totsales / saleworktot
• Change a variable type:
• gen str7 cluster=substr(id,7,12)
• edit id cluster
• gen str4 year="2015"
• destring year, replace
• Rename variables
– Ren oldname new name: ren id firm_id
• Recode variable values
• for var profit1 netindtax : recode X (min/0=.)

Variables Labels and Values
• Labelling variable names
•label var gov "Governorate"
•label var profit1 « Net Profit in ,000"
• Labelling variables values (2 steps)
•label def yesno 0 "No" 2 "Yes"
•label val public yesno
• Changing label values
•label def yesno 1 "Yes" 2 "No", modify
•label val public yesno

Identify and Delete duplicated
observations
• duplicates list id
• duplicates report id
• duplicates browse
• duplicates tag id,gen(tag)
• duplicates drop id
• duplicates drop id, force

 tabulate command produces frequency cross-tabs of
one or two variables
 tabu gov
 tabu gov sector2d
 tabu gov sector2d,col
 tabu gov sector2d, col row missing nolabel nofreq
 tab1 varlist - performs one-way tables for varlist (tab1 gov
sector4d sector2d )
 tab2 varlist - performs all possible 2-way tables for varlist
(tab2 age sector2d sector4d)
 Table Command
Tabulation

 Several types of weights
- fweight or frequency weights: are weights that indicate the
number of duplicated observations
- aweight or analytical weights: are weights that are inversely
proportional to the variance of an observation.
- iweights or importance weights: are weights that indicate
the "importance" of the observation in some vague sense.
- pweight or probability weight: or sampling weights, are
weights that denote the inverse of the probability that the
observation is included due to the sampling design.
Using Weights

 EXAMPLES
 Frequency weights
 tabu gov sector2d [fweight=int(weight)],
ro co
 Analytical wegihts
 tabu gov sector2d [aweights=weight]
Using Weights

 To add observations from two files with the same
variables
 append command
 To add variables from two files with similar
observations
 merge
 To add variables from two files with different
observations (e.g. individuals and household)
 merge idvar
Combining 2 or More STATA Files

 Merging by unique id allows you to combine variables from
two different STATA data sets
 Examples
 Merging an individual’s employment variables to his/her
demographic characteristics
 Merging the parent’s info to the individual’s demographic file.
 Merging information on a parent who is present in the
household to an individual’s demographic file
 Merging community information to the individual or
household level files
Merging Files

 The objective is to match observations that share a
unique id from two files
 The master file: the file to merge into
 The using file: the file to merge from
 Examples with two files containing indiv. information
 open the file containing the variables you need
 use filename, clear
 keep the unique id and the variables you need
 keep indid hhid gov pn varnames
Match Merge

 sort by unique id
 Sort id
 save under new name
 save temp1
 use master data set
 use “ORIGINAL FILE.dta”, clear
 sort by unique id
 sort id
 merge by unique id
 merge id using temp1
Match Merge (2)

 checking how successful your merge was
 tabu _merge
 _merge==3 observ in both master and using
 _merge==2 observations in using but not in master
 _merge==1 observation in master but not in using
 drop _merge
 update option
 substitute missing values in master with nonmissing values in
using for same variables
 replace option
 replaces any value in master with non-missing value in using
Match Merge (3)

 1- Merging individual-level data into individual level files
 2- Merging household level data into individual-level file
 3- Merging individual-level data into household-level file
Types of Match Merge

 On-line help is one of the most useful aspects of
STATA
 Now connected to STATA Corp web site through the
net
 Help menu
 search
 stata commands
 Stata Technical Bulletin
Using STATA’s on-line help

 What’s new in STATA
 STATA is web-aware
 use data sets over the web
 example: use
http://www.stata.com/manual/oddeven.dta,clear
 updates
 update query
 check out help menu
For Advanced Users

 Stata can accept data in several forms.
 Stata Editor:
 Enter a small data set consisting of 6 observations, and three
variables, where var1 is the name of individual, var2 is his
income, and var3 is his/her consumption.
 Then, “list”, “describe”, and “save”.
 Stata can read ASCII (text) file,
 Delimited ASCII, data separated by : spaces, comma, tab.
 Fixed length ASCII file
 Utilities to transform data sets from one form (say SPSS,
Excel, etc.) into all other forms (STAT/Transfer).
Inputting and Reading Data

 ASCII delimited files are text files where data are
separated by delimiters
 If missing observations are spaces, then delimiter
should not be a space, use comma instead
 For space delimited data, the command to use is:
 infile x y z using data.txt
 x y z should be names equal in number to the variables in
each record
 if x y z is omitted, STATA assigns v1 v2 v3
 describe
 compress
 infile assumes numeric format unless otherwise
specified
 Assume x is a string (alphanumeric) variable
 infile str10 x y z
Reading Delimited ASCII files

 Another common format is comma or tab delimited
data
 Variables names are assumed to be in first row, also
comma or tab delimited
 No need to identify string variables in comma or tab
delimited files
 The appropriate STATA command is
 insheet using filename.csv, comma
 insheet using filename.txt, tab
 A utility program such as STAT/TRANSFER can be
used to read most data formats, including SPSS,
Excel, SAS, Dbase, Access, etc.
Reading Delimited ASCII files

 Fixed format ASCII files has no separators
between variables but each variable always
appears in the same positions
 This is how data typically come from data
entry packages
 Two ways of doing it:
 Without data dictionary
infix rectyp 1-2 gov 3-4 qism 5-6 psu 7-9 urbrur 10
hhgov 11-14 hhpsu 15-16 using rec02.dat
 With data dictionary
Prepare dictionary file using text editor as
explained in handout
Reading Fixed Format ASCII files

Using STATA Graphs
graph twoway scatterplots, line plots, etc.
graph matrix scatterplot matrices
graph bar bar charts
graph dot dot charts
graph box box-and-whisker plots
graph pie pie charts
histogram
graph save
graph use
graph display
graph combine
graph export

Macros
• A macro is a shorthand—one thing standing for
another. For instance:
• local list "age weight sex"
• regress outcome `list'
is the same as
• regress outcome age weight sex
• local or global?
What is the difference? Which one should I use?
Global can get you into a mess
Better to stick with local variables rather than get in over
your head

Introduction to STATA - Ali Rashed

More Related Content

What's hot (20)

Similar to Introduction to STATA - Ali Rashed (20)

More from Economic Research Forum (20)

Recently uploaded (20)

Introduction to STATA - Ali Rashed