SlideShare a Scribd company logo
November 4, 2009
Introduction to SAS
LISA Short Course Series
Mark Seiss, Dept. of Statistics
Reference Material
The Little SAS Book – Delwiche and Slaughter
SAS Programming I: Essentials
SAS Programming II: Manipulating Data with the
DATA Step
Presentation and Data
http://www.lisa.stat.vt.edu/?q=node/167
Presentation Outline
1. Introduction to the SAS Environment
2. Working With SAS Data Sets
3. Summary Procedures
4. Basic Statistical Analysis Procedures
Presentation Outline
Questions/Comments
Introduction to the
SAS Environment
1. SAS Programs
2. SAS Data Sets and Data Libraries
2. Creating SAS Data Sets
SAS Programs
• File extension - .sas
• Editor window has four uses:
• Access and edit existing SAS programs
• Write new SAS programs
• Submitting SAS programs for execution
• Saving SAS programs
• SAS program – sequence of steps that the user submits for
execution
• Submitting SAS programs
• Entire program
• Selection of the program
SAS Programs
• Syntax Rules for SAS statements
• Free-format – can use upper or lower case
• Usually begin with an identifying keyword
• Can span multiple lines
• Always end with a semicolon
• Multiple statements can be on the same line
• Errors
• Misspelled key words
• Missing or invalid punctuation (missing semi-colon common)
• Invalid options
• Indicated in the Log window
SAS Programs
• 2 Basic steps in SAS programs:
• Data Steps
• Typically used to create SAS datasets and manipulate data,
• Begins with DATA statement
• Proc Steps
• Typically used to process SAS data sets
• Begins with PROC statement
• The end of the data or proc steps are indicated by:
• RUN statement – most steps
• QUIT statement – some steps
• Beginning of another step (DATA or PROC statement)
SAS Programs
• Output generated from SAS program – 2 Windows
• SAS log
• Information about the processing of the SAS program
• Includes any warnings or error messages
• Accumulated in the order the data and procedure steps are
submitted
• SAS output
• Reports generated by the SAS procedures
• Accumulates output in the order it is generated
SAS Data Sets and Data Libraries
• SAS Data Set
• Specifically structured file that contains data values.
• File extension - .sas7bdat
• Rows and Columns format – similar to Excel
• Columns – variables in the table corresponding to fields of data
• Rows – single record or observation
• Two types of variables
• Character – contain any value (letters, numbers, symbols, etc.)
• Numeric – floating point numbers
• Located in SAS Data Libraries
SAS Data Sets and Data Libraries
• SAS Data Libraries
• Contain SAS data sets
• Identified by assigning a library reference name – libref
• Temporary
• Work library
• SAS data files are deleted when session ends
• Library reference name not necessary
• Permanent
• SAS data sets are saved after session ends
• SASUSER library
• You can create and access your own libraries
SAS Data Sets and Data Libraries
• SAS Data Libraries cont.
• Assigning library references
• Syntax
LIBNAME libref ‘SAS-data-library’;
• Rules for Library References
• 8 characters or less
• Must begin with letter or underscore
• Other characters are letters, numbers, or under scores
SAS Data Sets and Data Libraries
• SAS Data Libraries cont.
• Identifying SAS data sets within SAS Data Libraries
libref.filename
• Accessing SAS data sets within SAS Data Libraries
Example: DATA new_data_set;
set libref.filename;
run;
• Creating SAS data sets within SAS Data Libraries
Example: DATA libref.filename;
set old_data_set;
run;
Creating SAS Data Sets
• Creating a SAS data sets from raw data
• 4 methods
1. Importing existing raw data in SAS program
2. Manually entering raw data in SAS program
3. Importing existing data sets using Import menu option
4. Manually entering raw data using Table Editor
Creating SAS Data Sets
• Importing existing raw data in SAS program
1. Start Data step and name the SAS data set to be created
(include SAS Data library to be stored in)
DATA libref.SAS-data-set;
2. Identify the file that contains the raw data file (.dat file)
INFILE ‘raw-data-filename’;
3. Provide instruction on how to read data from raw data file
INPUT input-specifications;
Creating SAS Data Sets
• Input Specifications
• Specifies the names of the SAS variables in the new data set
• Specifies whether the SAS variables are character or numeric
• Identifies the locations of the variables in the raw data file
• List Input
• Column Input
• Formatted Input
• Mixed Input
Creating SAS Data Sets
• List Input
• Used when raw data is separated by spaces
• All data in a row must be read in
• All missing data must be indicated by period
• Simple character data – no embedded spaces, no lengths greater
than 8
• INPUT statement
• Simply list variables after the INPUT keyword in the order they
appear on file.
• If variables are character format, place a $ after the variable name
• Example) INPUT Name $ City $ Age Height Weight Sex $;
Creating SAS Data Sets
• Column Input
• Used when raw data file does not have delimiters between values
(large data sets)
• Each variable’s values are found in the same columns in each row
• Numeric data must be standard – numbers, decimals, signs, and
scientific notation only
• Advantages
• No spaces required
• Missing values left blank
• Character data can have embedded spaces
• Ability to skip unwanted variables
Creating SAS Data Sets
• Column Input cont.
• INPUT Statement
• Numeric variables – list variable name then list column or range
of columns where the variable is found on the raw data file
• Character variables – list variable name, dollar sign, and then
column or range of columns
• Example) INPUT Name $ 1-10 Age 26-28 Sex $ 35;
Creating SAS Data Sets
• Formatted Input
• Appropriate for reading:
• Data in fixed columns
• Standard and nonstandard character and numeric data
• Calendar values to be converted to SAS date value
• Read data in using SAS informats
• Instruction that SAS uses to read in data values
• General forms
– Character - $informatw.
– Numeric – informatw.d
– Date – informatw.
Creating SAS Data Sets
• Formatted Input cont.
• Character Informats
• $w. – character string with a width of w, trims leading blanks
• $charw. – character string with a width of w, does not trim leading
or trailing blanks
• Numeric Informats
• w.d – standard numeric data with width w and d numbers after
the decimal
– Raw Data Value = 1234567  informat = 8.2  SAS Data Value = 12345.67
• COMMAw.d – numeric data with embedded commas
– Raw Data Value =1,000,001  informat=COMMA10.
SAS Data Value=1000001
Creating SAS Data Sets
• Formatted Input cont.
• SAS date values
• Stored as special numeric number data
• Number of days between January 1, 1960 and the specified data
• Informats are used to read and convert the dates
Raw Data Value Informat
11/04/2009 MMDDYY10.
11/04/09 MMDDYY8.
04NOV2009 Date9.
04/11/2009 DDMMYY10.
Creating SAS Data Sets
• Formatted Input cont.
• Columns read are determined by the starting point and width of the
informat
Example:
INPUT Name $10. Age 3. Height 5.1 BirthDate MMDDYY10.;
- Name – Character of length 10, columns 1-10
- Age – Numeric with length 3, columns 11-13
- Height – Numeric with length 5 (including decimal) and one
decimal place (120.9 for instance), columns 14-18
- Birthdate – Date format MMDDYY (11-04-2009 for instance),
columns 19 - 28
Creating SAS Data Sets
• Formatted Input cont.
• Pointer controls
• +n moves pointer n positions
• @n moves pointer to column n
Example:
INPUT Flight 3. +4 Date mmddyy8. @20 Destination $3.;
- Flight - Number of length 3, columns 1 through 3
- Date – Date format mmddyy (11/04/09) of length 8, columns 8 through 15
- Destination – Character of length 3, columns 20 through 22
Creating SAS Data Sets
• Mixed Formatted Input Styles
• Mix and match the previous 3 input styles
• Example:
Raw Data: Great Smoky Mountains NC/TN 1926 520,269
INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9.;
- Parkname - Character of length 22, columns 1 through 22
- State - Character, separated by spaces
- Year - Numeric, separated by spaces
- Acreage - Numeric with informat COMMA9., starts column 40
Creating SAS Data Sets
• Manually Entering Raw Data Files in SAS program
1. Start Data step and name the SAS data set to be created
DATA library.SAS-data-set;
2. Provide instructions on how to read data from raw data file
INPUT input-specifications;
3. Manually enter raw data
DATALINES;
<Raw Data>
Creating SAS Data Sets
• Manually Entering Raw Data Files in SAS program
Example:
Data uspresidents;
INPUT President $ Party $ Number;
DATALINES;
Adams F 2
Lincoln R 16
Grant R 18
Kennedy D 35
;
Run;
Creating SAS Data Sets
• Using the import data menu option
1. File  Import Data
2. Standard data source  select the file format
3. Specify file location or Browse to select file
4. Create name for the new SAS data set and specify location
Creating SAS Data Sets
• Compatible file formats
• Microsoft Excel Spreadsheets
• Microsoft Access Databases
• Comma Separate Files (.csv)
• Tab Delimited Files (.txt)
• dBASE Files (.dbf)
• JMP data sets
• SPSS Files
• Lotus Spreadsheets
• Stata Files
• Paradox Files
Creating SAS Data Sets
• Enter raw data directly into a SAS data set
1. Tools  Table Editor
2. Enter data manually into table
- Observations in each row
- Variables in each column
3. Left Click Column  Column Attributes
- Variable Name, Variable Label, Type – Character/Numeric,
Format, Informat
Note: Informats determine how raw data is read. Formats
determine how variable is displayed.
4. Close window  Save Changes – Yes
 Specify File name and directory
Introduction to the
SAS Environment
Questions/Comments
Working With SAS Data Sets
1. Data Set Manipulation
2. Data Set Processing
3. Combining Data Sets
A. Concatenating/Appending
B. Merging
Data Set Manipulation
• Create a new SAS data set using an existing SAS data set as
input
• Specify name of the new SAS data set after the DATA statement
• Use SET statement to identify SAS data set being read
• Syntax:
DATA output_data_set;
SET input_data_set;
<additional SAS statements>;
RUN;
• By default the SET statement reads all observations and variables
from the input data set into the output data set.
Data Set Manipulation
• Assignment Statements
• Evaluate an expression
• Assign resulting value to a variable
• General Form: variable = expression;
• Example: miles_per_hour = distance/time;
• SAS Functions
• Perform arithmetic functions, compute simple statistics, manipulate
dates, etc.
• General Form: variable=function_name(argument1, argument2,…);
• Example: Time_worked = sum(Day1,Day2, Day3, Day4, Day5);
Data Set Manipulation
• Selecting Variables
• Use DROP and KEEP to determine which variables are written to
new SAS data set.
• 2 Ways
• DROP and KEEP as statements
– Form: DROP = Variable1 Variable2;
KEEP = Variable3 Variable4 Variable5;
• DROP and KEEP options in SET statement
– Form: SET input_data_set (KEEP=Var1);
Data Set Manipulation
• Conditional Processing
• Uses IF-THEN-ELSE logic
• General Form: IF <expression1> THEN <statement>;
ELSE IF <expression2> THEN <statement>;
ELSE <statement>;
• <expression> is a true/false statement, such as:
• Day1=Day2, Day1 > Day2, Day1 < Day2
• Day1+Day2=10
• Sum(day1,day2)=10
• Day1=5 and Day2=5
Data Set Manipulation
• Conditional Processing
Symbolic Mnemonic Example
= EQ IF region=‘Spain’;
~= or ^= NE IF region ne ‘Spain’;
> GT IF rainfall > 20;
< LT IF rainfall lt 20;
>= GE IF rainfall ge 20;
<= LE IF rainfall <= 20;
& AND IF rainfall ge 20 & temp < 90;
| or ! OR IF rainfall ge 20 OR temp < 90;
IS NOT
MISSING
IF region IS NOT MISSING;
BETWEEN AND IF region BETWEEN ‘Plain’ AND ‘Spain’;
CONTAINS IF region CONTAINS ‘ain’;
IN IF region IN (‘Rain’, ‘Spain’, ‘Plain’);
Data Set Manipulation
• Conditional Processing cont.
• If <expression1> is true, <statement> is processed
• ELSE IF and ELSE are only processed if <expression1> is false
• Only one statement specified using this form
• Use DO and END statements to execute group of statements
• General Form: IF <expression> THEN DO;
<statements>;
END;
ELSE DO;
<statements>;
END;
Data Set Manipulation
• Subsetting Rows (Observations)
• We will look at two ways
• Using IF statement
• Using WHERE option in SET statement
• IF statement
• Only writes observations to the new data set in which an
expression is true;
• General Form: IF <expression>;
• Example: IF career = ‘Teacher’;
IF sex ne ‘M’;
• In the second example, only observations where sex is not equal
to ‘M’ will be written to the output data set
Data Set Manipulation
• Subsetting Rows (Observations) cont.
• Where Option in SET statement
• Use option to only read rows from the input data set in which the
expression is true
• General Form: SET input_data_set (where=(<expression>));
• Example: SET vacation (where=(destination=‘Bermuda’));
• Only observations where the destination equals ‘Bermuda’ will be
read from the input data set
• Comparison
• Resulting output data set is equivalent
• IF statement – all rows read from the input data set
• Where option – only rows where expression is true are read from
input data set
• Difference in processing time when working with big data sets
Data Set Manipulation
• PROC SORT sorts data according to specified variables
• General Form: PROC SORT DATA=input_data_set <options>;
BY Variable1 Variable2;
RUN;
• Sorts data according to Variable1 and then Variable2;
• By default, SAS sorts data in ascending order
• Number low to high
• A to Z
• Use DESCENDING statement for numbers high to low and letters Z to A
• BY City DESCENDING Population;
• SAS sorts data first by city A to Z and then Population high to low
Data Set Manipulation
• Some Options
• NODUPKEY
• Eliminates observations that have the same values for the BY
variables
• OUT=output_data_set
• By default, PROC SORT replaces the input data set with the
sorted data set
• Using this option, PROC SORT creates a newly sorted data set
and the input data set remains unchanged
Data Set Processing
• Data Set Processing
• DATA steps read in data from existing data sets or raw data files one
row at a time, like a loop
• DATA step reads data from the input data set in the following way:
1. Read in current row from input data set to Program Data
Vector (PDV)
2. Process SAS statements
3. PDV to output data set
4. Set current row to the next row in the input data set
5. Iterate to Step 1
• One row at a time is processed
• Thus we cannot simply add the value of a variable in one row to the
value in another row
Data Set Processing
• Data Set Processing – Example
• Let the following be the input data set dfwlax:
Flight Date Dest FirstClass Economy
439 14955 LAX 20 137
921 14955 DFW 15 131
114 14956 LAX 15 85
982 14956 DFW 5 196
439 14957 LAX 14 116
982 14957 DFW 20 166
Data Set Processing
• Data Set Processing – Example
• Consider the following submitted code:
DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
Current  SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 . .
Flight Date Dest FirstClass Economy Total FirstClassFull
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Current  Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 .
Flight Date Dest FirstClass Economy Total FirstClassFull
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
Current IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Flight Date Dest FirstClass Economy Total FirstClassFull
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
Current RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
Current DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 . .
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
Current SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
921 14955 DFW 15 131 . .
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Current Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
921 14955 DFW 15 131 146 .
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
Current ELSE FirstClassFull=0;
RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
921 14955 DFW 15 131 146 0
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
Data Set Processing
• Data Set Processing – Example
• Execution of the Data Step
DATA onboard;
SET dfwlax;
Total=FirstClass+Economy;
IF FirstClass=20 then FirstClassFull=1;
ELSE FirstClassFull=0;
Current RUN;
PDV
Onboard
Flight Date Dest FirstClass Economy Total FirstClassFull
921 14955 DFW 15 131 146 0
Flight Date Dest FirstClass Economy Total FirstClassFull
439 14955 LAX 20 137 157 1
921 14955 DFW 15 131 146 0
Combining Data Sets
• Concatenating (or Appending)
• Stacks each data set upon the other
• If one data set does not have a variable that the other datasets
do, the variable in the new data set is set to missing for the
observations from that data set.
• General Form: DATA output_data_set;
SET data1 data2;
run;
• PROC APPEND may also be used
Combining Data Sets
• Merging Data Sets
• One-to-One Match Merge
• A single record in a data set corresponds to a single record in all
other data sets
• Example: Patient and Billing Information
• One-to-Many Match Merge
• Matching one observation from one data set to multiple
observations in other data sets
• Example: County and State Information
• Note: Data must be sorted before merging can be done
(PROC SORT)
Combining Data Sets
• One-to-One Match Merge
• Usually need at least one common variable between data sets –
matching purposes
• For the example, a patient ID would be needed
• Do not need common variable if all data sets are in exactly the same
order
• General Form: DATA output_data_set;
MERGE input_data_set1 input_data_set2;
By variable1 variable2;
RUN;
Combining Data Sets
• One-to-One Match Merge
• Example:
Performance Goals
Code:
DATA compare;
MERGE performance goals;
BY month;
difference=sales-goal;
RUN;
Month Sales
1 8223
2 6034
3 4220
Month Goal
1 9000
2 6000
3 5000
Combining Data Sets
• One-to-One Match Merge
• Example cont.:
Compare
Month Sales Goal Difference
1 8223 9000 -777
2 6034 6000 34
3 4220 5000 -780
Combining Data Sets
• One-to-Many Match Merge
• Requires at least one common variable in the data sets for matching
purposes
• For the example, State information is in both the state and county
files
• If two data sets have variables with the same name, the variables in
the second data set will overwrite the variable in the first.
• General Form: DATA output_data_set;
MERGE Data1 Data2 Data3;
BY Variable1 Variable2;
RUN:
Combining Data Sets
• One-to-Many Match Merge
• Example:
Videos Adjustment
Code:
DATA prices;
MERGE videos adjustment
BY category;
NewPrice=(1-adjustment)*sales;
RUN;
Category Sales
Aerobics 12.99
Aerobics 13.99
Aerobics 13.99
Step 12.99
Step 12.99
Weights 15.99
Category Adjustment
Aerobics .20
Step .30
Weights .25
Combining Data Sets
• One-to-One Many Merge
• Example cont.:
Videos
Category Sales Adjustment NewPrice
Aerobics 12.99 .20 10.39
Aerobics 13.99 .20 11.19
Aerobics 13.99 .20 11.19
Step 12.99 .30 9.09
Step 12.99 .30 9.09
Weights 15.99 .25 11.99
Working With SAS Data Sets
Questions/Comments
Summary Procedures
1. Print Procedure
2. Plot Procedure
3. Univariate Procedure
4. Means Procedure
5. Freq Procedure
Print Procedure
• PROC PRINT is used to print data to the output window
• By default, prints all observations and variables in the SAS data set
• General Form: PROC PRINT DATA=input_data_set <options>
<optional SAS statements>;
RUN;
• Some Options
• input_data_set (obs=n) - Specifies the number of observations to
be printed in the output
• NOOBS - Suppresses printing observation number
• LABEL - Prints the labels instead of variable
names
Print Procedure
• Optional SAS statements
• BY variable1 variable2 variable3;
• Starts a new section of output for every new value of the BY
variables
• ID variable1 variable2 variable3;
• Prints ID variables on the left hand side of the page and
suppresses the printing of the observation numbers
• SUM variable1 variable2 variable3;
• Prints sum of listed variables at the bottom of the output
• VAR variable1 variable2 variable3;
• Prints only listed variables in the output
Plot Procedure
• Used to create basic scatter plots of the data
• Use PROC GPLOT or PROC SGPLOT for more sophisticated plots
• General Form: PROC PLOT DATA=input_data_set;
PLOT vertical_variable *
horizontal_variable/<options>;
RUN;
• By default, SAS uses letters to mark points on plots
• A for a single observation, B for two observations at the same point,
etc.
• To specify a different character to represent a point
• PLOT vertical_variable * horizontal variable = ‘*’;
Plot Procedure
• To specify a third variable to use to mark points
• PLOT vertical_variable * horizontal_variable = third_variable;
• To plot more than one variable on the vertical axis
• PLOT vertical_variable1 * horizontal_variable=‘2’
vertical_variable2 * horizontal_variable=‘1’/OVERLAY;
Univariate Procedure
• PROC UNIVARIATE is used to examine the distribution of data
• Produces summary statistics for a single variable
• Includes mean, median, mode, standard deviation, skewness,
kurtosis, quantiles, etc.
• General Form: PROC UNIVARIATE DATA=input_data_set <options>;
VAR variable1 variable2 variable3;
RUN ;
• If the variable statement is not used, summary statistics will be produced
for all numeric variables in the input data set.
Univariate Procedure
• Options include:
• PLOT – produces Stem-and-leaf plot, Box plot, and Normal
probability plot;
• NORMAL – produces tests of Normality
Means Procedure
• Similar to the Univariate procedure
• General Form: PROC MEANS DATA=input_data_set options;
<Optional SAS statements>;
RUN;
• With no options or optional SAS statements, the Means procedure will
print out the number of non-missing values, mean, standard deviation,
minimum, and maximum for all numeric variables in the input data set
Means Procedure
• Options
• Statistics Available
• Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to
specify different alpha level.
CLM Two-Sided Confidence Limits RANGE Range
CSS Corrected Sum of Squares SKEWNESS Skewness
CV Coefficient of Variation STDDEV Standard Deviation
KURTOSIS Kurtosis STDERR Standard Error of Mean
LCLM Lower Confidence Limit SUM Sum
MAX Maximum Value SUMWGT Sum of Weight Variables
MEAN Mean UCLM Upper Confidence Limit
MIN Minimum Value USS Uncorrected Sum of Squares
N Number Non-missing Values VAR Variance
NMISS Number Missing Values PROBT Probability for Student’s t
MEDIAN (or P50) Median T Student’s t
Q1 (P25) 25% Quantile Q3 (P75) 75% Quantile
P1 1% Quantile P5 5% Quantile
P10 10% Quantile P90 90% Quantile
P95 95% Quantile P99 99% Quantile
Means Procedure
• Optional SAS Statements
• VAR Variable1 Variable2;
• Specifies which numeric variables statistics will be produced for
• BY Variable1 Variable2;
• Calculates statistics for each combination of the BY variables
• Output out=output_data_set;
• Creates data set with the default statistics
FREQ Procedure
• PROC FREQ is used to generate frequency tables
• Most common usage is create table showing the distribution of categorical
variables
• General Form: PROC FREQ DATA=input_data_set;
TABLE variable1*variable2*variable3/<options>;
RUN;
• Options
• LIST – prints cross tabulations in list format rather than grid
• MISSING – specifies that missing values should be included in the tabulations
• OUT=output_data_set – creates a data set containing frequencies, list format
• NOPRINT – suppress printing in the output window
• Use BY statement to get percentages within each category of a variable
Summary Procedures
Questions/Comments
Statistical Analysis Procedures
1. Correlation – PROC CORR
2. Regression – PROC REG
3. Analysis of Variance – PROC ANOVA
4. Chi-square Test of Association – PROC FREQ
5. General Linear Models – PROC GENMOD
CORR Procedure
• PROC CORR is used to calculate the correlations between variables
• Correlation coefficient measures the linear relationship between two variables
• Values Range from -1 to 1
• Negative correlation - as one variable increases the other decreases
• Positive correlation – as one variable increases the other increases
• 0 – no linear relationship between the two variables
• 1 – perfect positive linear relationship
• -1 – perfect negative linear relationship
• General Form: PROC CORR DATA=input_data_set <options>
VAR Variable1 Variable2;
With Variable3;
RUN;
CORR Procedure
• If the VAR and WITH statements are not used, correlation is computed
for all pairs of numeric variables
• Options include
• SPEARMAN – computes Spearman’s rank correlations
• KENDALL – computes Kendall’s Tau coefficients
• HOEFFDING – computes Hoeffding’s D statistic
REG Procedure
• PROC REG is used to fit linear regression models by least squares estimation
• One of many SAS procedures that can perform regression analysis
• Only continuous independent variables (Use GENMOD for categorical variables)
• General Form:
PROC REG DATA=input_data_set <options>
MODEL dependent=independent1 independent2/<options>;
<optional statements>;
RUN;
• PROC REG statement options include
• PCOMIT=m - performs principle component estimation with m principle
components
• CORR – displays correlation matrix for independent variables in the model
REG Procedure
• MODEL statement options include
• SELECTION=
• Specifies a model selection procedure be conducted –
FORWARD, BACKWARD, and STEPWISE
• ADJRSQ - Computes the Adjusted R-Square
• MSE – Computes the Mean Square Error
• COLLIN – performs collinearity analysis
• CLB – computes confidence limits for parameter estimates
• ALPHA=
• Sets significance value for confidence and prediction intervals
and tests
REG Procedure
• Optional statements include
• PLOT Dependent*Independent1 – generates plot of data
ANOVA Procedure
• PROC ANOVA performs analysis of variance
• Designed for balanced data (PROC GLM used for unbalance data)
• Can handle nested and crossed effects and repeated measures
• General Form: PROC ANOVA DATA=input_data_set <options>;
CLASS independent1 independent2;
MODEL dependent=independent1 independent2;
<optional statements>;
Run;
• Class statement must come before model statement, used to define
classification variables
ANOVA Procedure
• Useful PROC ANOVA statement option – OUTSTAT=output_data_set
• Generates output data set that contains sums of squares,
degrees of freedom, statistics, and p-values for each effect in the
model
• Useful optional statement – MEANS independent1/<comparison type>
• Used to perform multiple comparisons analysis
• Set <comparison type> to:
• TUKEY – Tukey’s studentized range test
• BON – Bonferroni t test
• T – pairwise t tests
• Duncan – Duncan’s multiple-range test
• Scheffe – Scheffe’s multiple comparison procedure
FREQ Procedure
• PROC FREQ can also be used to perform analysis with categorical data
• General Form: PROC FREQ DATA=input_data_set;
TABLE variable1 variable2/<options>;
RUN;
• TABLE statement options include:
• AGREE – Tests and measures of classification agreement including McNemar’s test,
Bowker’s test, Cochran’s Q test, and Kappa statistics
• CHISQ - Chi-square test of homogeneity and measures of association
• MEASURE - Measures of association include Pearson and Spearman correlation,
gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds ratios, risk
ratios, and confidence intervals
GENMOD Procedure
• PROC GENMOD is used to estimate linear models in which the response
is not necessarily normal
• Logistic and Poisson regression are examples of generalized linear
models
• General Form:
PROC GENMOD DATA=input_data_set;
CLASS independent1;
MODEL dependent = independent1 independent2/
dist= <option>
link=<option>;
run;
GENMOD Procedure
• DIST = - specifies the distribution of the response variable
• LINK= - specifies the link function from the linear predictor to the mean of
the response
• Example – Logistic Regression
• DIST = binomial
• LINK = logit
• Example – Poisson Regression
• DIST = poisson
• LINK = log
Statistical Analysis Procedures
Questions/Comments

More Related Content

PDF
SAS Internal Training
PDF
Table of Contents - Practical Business Analytics using SAS
PPTX
SAS basics Step by step learning
PPT
SAS - overview of SAS
PDF
Basics of SAS
PPT
INTRODUCTION TO SAS
PPTX
Sas Statistical Analysis System
SAS Internal Training
Table of Contents - Practical Business Analytics using SAS
SAS basics Step by step learning
SAS - overview of SAS
Basics of SAS
INTRODUCTION TO SAS
Sas Statistical Analysis System

What's hot (20)

PDF
SAS Training session - By Pratima
PPTX
SAS Mainframe -Program-Tips
PPSX
SAS TRAINING
PPT
SAS BASICS
DOCX
Learn SAS Programming
PDF
Introduction to SAS
PDF
Base SAS Full Sample Paper
DOC
Introduction to SAS
PPT
Basics Of SAS Programming Language
DOCX
SAS Programming Notes
PPT
Sas Enterprise Guide A Revolutionary Tool
PDF
500+ SAP ABAP INTERVIEW QUESTIONS WITH ANSWERS
PPT
Base sas 2 sas windowing environment
PDF
Sap abap material
PPT
Sas training in hyderabad
PDF
Introduction to-sas-1211594349119006-8
PPTX
PPTX
How to Get SAS software for Free
PPTX
Data stage
SAS Training session - By Pratima
SAS Mainframe -Program-Tips
SAS TRAINING
SAS BASICS
Learn SAS Programming
Introduction to SAS
Base SAS Full Sample Paper
Introduction to SAS
Basics Of SAS Programming Language
SAS Programming Notes
Sas Enterprise Guide A Revolutionary Tool
500+ SAP ABAP INTERVIEW QUESTIONS WITH ANSWERS
Base sas 2 sas windowing environment
Sap abap material
Sas training in hyderabad
Introduction to-sas-1211594349119006-8
How to Get SAS software for Free
Data stage
Ad

Similar to Sas short course_presentation_11-4-09 (20)

PPT
Sas classes in mumbai
PPT
BASE SAS Training presentation of coding
PPTX
Introducción al Software Analítico SAS
PDF
I need help with Applied Statistics and the SAS Programming Language.pdf
PPTX
SAS_Overview_Short.pptx
PDF
Sas summary guide
PDF
Analytics with SAS
PPT
SAS Online Training by Real Time Working Professionals in USA,UK,India,Middle...
PPT
Prog1 chap1 and chap 2
PDF
Introduction To Sas
PPT
8323 Stats - Lesson 1 - 03 Introduction To Sas 2008
PDF
Set, merge, and update
PDF
Introduction to sas
PDF
SAS Online Training
PDF
Sas language reference concepts
PDF
SAS cheat sheet
PDF
Sas cheat
PPTX
sas.pptxnbhjghjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
PPTX
BAS 150 Lesson 3 Lecture
PDF
SAS for Beginners
Sas classes in mumbai
BASE SAS Training presentation of coding
Introducción al Software Analítico SAS
I need help with Applied Statistics and the SAS Programming Language.pdf
SAS_Overview_Short.pptx
Sas summary guide
Analytics with SAS
SAS Online Training by Real Time Working Professionals in USA,UK,India,Middle...
Prog1 chap1 and chap 2
Introduction To Sas
8323 Stats - Lesson 1 - 03 Introduction To Sas 2008
Set, merge, and update
Introduction to sas
SAS Online Training
Sas language reference concepts
SAS cheat sheet
Sas cheat
sas.pptxnbhjghjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj
BAS 150 Lesson 3 Lecture
SAS for Beginners
Ad

Recently uploaded (20)

PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
5 Stages of group development guide.pptx
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
Types of control:Qualitative vs Quantitative
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
MSPs in 10 Words - Created by US MSP Network
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PPTX
Principles of Marketing, Industrial, Consumers,
Belch_12e_PPT_Ch18_Accessible_university.pptx
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Probability Distribution, binomial distribution, poisson distribution
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
5 Stages of group development guide.pptx
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
New Microsoft PowerPoint Presentation - Copy.pptx
Laughter Yoga Basic Learning Workshop Manual
Types of control:Qualitative vs Quantitative
WRN_Investor_Presentation_August 2025.pdf
COST SHEET- Tender and Quotation unit 2.pdf
ICG2025_ICG 6th steering committee 30-8-24.pptx
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Roadmap Map-digital Banking feature MB,IB,AB
MSPs in 10 Words - Created by US MSP Network
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Principles of Marketing, Industrial, Consumers,

Sas short course_presentation_11-4-09

  • 1. November 4, 2009 Introduction to SAS LISA Short Course Series Mark Seiss, Dept. of Statistics
  • 2. Reference Material The Little SAS Book – Delwiche and Slaughter SAS Programming I: Essentials SAS Programming II: Manipulating Data with the DATA Step Presentation and Data http://www.lisa.stat.vt.edu/?q=node/167
  • 3. Presentation Outline 1. Introduction to the SAS Environment 2. Working With SAS Data Sets 3. Summary Procedures 4. Basic Statistical Analysis Procedures
  • 5. Introduction to the SAS Environment 1. SAS Programs 2. SAS Data Sets and Data Libraries 2. Creating SAS Data Sets
  • 6. SAS Programs • File extension - .sas • Editor window has four uses: • Access and edit existing SAS programs • Write new SAS programs • Submitting SAS programs for execution • Saving SAS programs • SAS program – sequence of steps that the user submits for execution • Submitting SAS programs • Entire program • Selection of the program
  • 7. SAS Programs • Syntax Rules for SAS statements • Free-format – can use upper or lower case • Usually begin with an identifying keyword • Can span multiple lines • Always end with a semicolon • Multiple statements can be on the same line • Errors • Misspelled key words • Missing or invalid punctuation (missing semi-colon common) • Invalid options • Indicated in the Log window
  • 8. SAS Programs • 2 Basic steps in SAS programs: • Data Steps • Typically used to create SAS datasets and manipulate data, • Begins with DATA statement • Proc Steps • Typically used to process SAS data sets • Begins with PROC statement • The end of the data or proc steps are indicated by: • RUN statement – most steps • QUIT statement – some steps • Beginning of another step (DATA or PROC statement)
  • 9. SAS Programs • Output generated from SAS program – 2 Windows • SAS log • Information about the processing of the SAS program • Includes any warnings or error messages • Accumulated in the order the data and procedure steps are submitted • SAS output • Reports generated by the SAS procedures • Accumulates output in the order it is generated
  • 10. SAS Data Sets and Data Libraries • SAS Data Set • Specifically structured file that contains data values. • File extension - .sas7bdat • Rows and Columns format – similar to Excel • Columns – variables in the table corresponding to fields of data • Rows – single record or observation • Two types of variables • Character – contain any value (letters, numbers, symbols, etc.) • Numeric – floating point numbers • Located in SAS Data Libraries
  • 11. SAS Data Sets and Data Libraries • SAS Data Libraries • Contain SAS data sets • Identified by assigning a library reference name – libref • Temporary • Work library • SAS data files are deleted when session ends • Library reference name not necessary • Permanent • SAS data sets are saved after session ends • SASUSER library • You can create and access your own libraries
  • 12. SAS Data Sets and Data Libraries • SAS Data Libraries cont. • Assigning library references • Syntax LIBNAME libref ‘SAS-data-library’; • Rules for Library References • 8 characters or less • Must begin with letter or underscore • Other characters are letters, numbers, or under scores
  • 13. SAS Data Sets and Data Libraries • SAS Data Libraries cont. • Identifying SAS data sets within SAS Data Libraries libref.filename • Accessing SAS data sets within SAS Data Libraries Example: DATA new_data_set; set libref.filename; run; • Creating SAS data sets within SAS Data Libraries Example: DATA libref.filename; set old_data_set; run;
  • 14. Creating SAS Data Sets • Creating a SAS data sets from raw data • 4 methods 1. Importing existing raw data in SAS program 2. Manually entering raw data in SAS program 3. Importing existing data sets using Import menu option 4. Manually entering raw data using Table Editor
  • 15. Creating SAS Data Sets • Importing existing raw data in SAS program 1. Start Data step and name the SAS data set to be created (include SAS Data library to be stored in) DATA libref.SAS-data-set; 2. Identify the file that contains the raw data file (.dat file) INFILE ‘raw-data-filename’; 3. Provide instruction on how to read data from raw data file INPUT input-specifications;
  • 16. Creating SAS Data Sets • Input Specifications • Specifies the names of the SAS variables in the new data set • Specifies whether the SAS variables are character or numeric • Identifies the locations of the variables in the raw data file • List Input • Column Input • Formatted Input • Mixed Input
  • 17. Creating SAS Data Sets • List Input • Used when raw data is separated by spaces • All data in a row must be read in • All missing data must be indicated by period • Simple character data – no embedded spaces, no lengths greater than 8 • INPUT statement • Simply list variables after the INPUT keyword in the order they appear on file. • If variables are character format, place a $ after the variable name • Example) INPUT Name $ City $ Age Height Weight Sex $;
  • 18. Creating SAS Data Sets • Column Input • Used when raw data file does not have delimiters between values (large data sets) • Each variable’s values are found in the same columns in each row • Numeric data must be standard – numbers, decimals, signs, and scientific notation only • Advantages • No spaces required • Missing values left blank • Character data can have embedded spaces • Ability to skip unwanted variables
  • 19. Creating SAS Data Sets • Column Input cont. • INPUT Statement • Numeric variables – list variable name then list column or range of columns where the variable is found on the raw data file • Character variables – list variable name, dollar sign, and then column or range of columns • Example) INPUT Name $ 1-10 Age 26-28 Sex $ 35;
  • 20. Creating SAS Data Sets • Formatted Input • Appropriate for reading: • Data in fixed columns • Standard and nonstandard character and numeric data • Calendar values to be converted to SAS date value • Read data in using SAS informats • Instruction that SAS uses to read in data values • General forms – Character - $informatw. – Numeric – informatw.d – Date – informatw.
  • 21. Creating SAS Data Sets • Formatted Input cont. • Character Informats • $w. – character string with a width of w, trims leading blanks • $charw. – character string with a width of w, does not trim leading or trailing blanks • Numeric Informats • w.d – standard numeric data with width w and d numbers after the decimal – Raw Data Value = 1234567  informat = 8.2  SAS Data Value = 12345.67 • COMMAw.d – numeric data with embedded commas – Raw Data Value =1,000,001  informat=COMMA10. SAS Data Value=1000001
  • 22. Creating SAS Data Sets • Formatted Input cont. • SAS date values • Stored as special numeric number data • Number of days between January 1, 1960 and the specified data • Informats are used to read and convert the dates Raw Data Value Informat 11/04/2009 MMDDYY10. 11/04/09 MMDDYY8. 04NOV2009 Date9. 04/11/2009 DDMMYY10.
  • 23. Creating SAS Data Sets • Formatted Input cont. • Columns read are determined by the starting point and width of the informat Example: INPUT Name $10. Age 3. Height 5.1 BirthDate MMDDYY10.; - Name – Character of length 10, columns 1-10 - Age – Numeric with length 3, columns 11-13 - Height – Numeric with length 5 (including decimal) and one decimal place (120.9 for instance), columns 14-18 - Birthdate – Date format MMDDYY (11-04-2009 for instance), columns 19 - 28
  • 24. Creating SAS Data Sets • Formatted Input cont. • Pointer controls • +n moves pointer n positions • @n moves pointer to column n Example: INPUT Flight 3. +4 Date mmddyy8. @20 Destination $3.; - Flight - Number of length 3, columns 1 through 3 - Date – Date format mmddyy (11/04/09) of length 8, columns 8 through 15 - Destination – Character of length 3, columns 20 through 22
  • 25. Creating SAS Data Sets • Mixed Formatted Input Styles • Mix and match the previous 3 input styles • Example: Raw Data: Great Smoky Mountains NC/TN 1926 520,269 INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9.; - Parkname - Character of length 22, columns 1 through 22 - State - Character, separated by spaces - Year - Numeric, separated by spaces - Acreage - Numeric with informat COMMA9., starts column 40
  • 26. Creating SAS Data Sets • Manually Entering Raw Data Files in SAS program 1. Start Data step and name the SAS data set to be created DATA library.SAS-data-set; 2. Provide instructions on how to read data from raw data file INPUT input-specifications; 3. Manually enter raw data DATALINES; <Raw Data>
  • 27. Creating SAS Data Sets • Manually Entering Raw Data Files in SAS program Example: Data uspresidents; INPUT President $ Party $ Number; DATALINES; Adams F 2 Lincoln R 16 Grant R 18 Kennedy D 35 ; Run;
  • 28. Creating SAS Data Sets • Using the import data menu option 1. File  Import Data 2. Standard data source  select the file format 3. Specify file location or Browse to select file 4. Create name for the new SAS data set and specify location
  • 29. Creating SAS Data Sets • Compatible file formats • Microsoft Excel Spreadsheets • Microsoft Access Databases • Comma Separate Files (.csv) • Tab Delimited Files (.txt) • dBASE Files (.dbf) • JMP data sets • SPSS Files • Lotus Spreadsheets • Stata Files • Paradox Files
  • 30. Creating SAS Data Sets • Enter raw data directly into a SAS data set 1. Tools  Table Editor 2. Enter data manually into table - Observations in each row - Variables in each column 3. Left Click Column  Column Attributes - Variable Name, Variable Label, Type – Character/Numeric, Format, Informat Note: Informats determine how raw data is read. Formats determine how variable is displayed. 4. Close window  Save Changes – Yes  Specify File name and directory
  • 31. Introduction to the SAS Environment Questions/Comments
  • 32. Working With SAS Data Sets 1. Data Set Manipulation 2. Data Set Processing 3. Combining Data Sets A. Concatenating/Appending B. Merging
  • 33. Data Set Manipulation • Create a new SAS data set using an existing SAS data set as input • Specify name of the new SAS data set after the DATA statement • Use SET statement to identify SAS data set being read • Syntax: DATA output_data_set; SET input_data_set; <additional SAS statements>; RUN; • By default the SET statement reads all observations and variables from the input data set into the output data set.
  • 34. Data Set Manipulation • Assignment Statements • Evaluate an expression • Assign resulting value to a variable • General Form: variable = expression; • Example: miles_per_hour = distance/time; • SAS Functions • Perform arithmetic functions, compute simple statistics, manipulate dates, etc. • General Form: variable=function_name(argument1, argument2,…); • Example: Time_worked = sum(Day1,Day2, Day3, Day4, Day5);
  • 35. Data Set Manipulation • Selecting Variables • Use DROP and KEEP to determine which variables are written to new SAS data set. • 2 Ways • DROP and KEEP as statements – Form: DROP = Variable1 Variable2; KEEP = Variable3 Variable4 Variable5; • DROP and KEEP options in SET statement – Form: SET input_data_set (KEEP=Var1);
  • 36. Data Set Manipulation • Conditional Processing • Uses IF-THEN-ELSE logic • General Form: IF <expression1> THEN <statement>; ELSE IF <expression2> THEN <statement>; ELSE <statement>; • <expression> is a true/false statement, such as: • Day1=Day2, Day1 > Day2, Day1 < Day2 • Day1+Day2=10 • Sum(day1,day2)=10 • Day1=5 and Day2=5
  • 37. Data Set Manipulation • Conditional Processing Symbolic Mnemonic Example = EQ IF region=‘Spain’; ~= or ^= NE IF region ne ‘Spain’; > GT IF rainfall > 20; < LT IF rainfall lt 20; >= GE IF rainfall ge 20; <= LE IF rainfall <= 20; & AND IF rainfall ge 20 & temp < 90; | or ! OR IF rainfall ge 20 OR temp < 90; IS NOT MISSING IF region IS NOT MISSING; BETWEEN AND IF region BETWEEN ‘Plain’ AND ‘Spain’; CONTAINS IF region CONTAINS ‘ain’; IN IF region IN (‘Rain’, ‘Spain’, ‘Plain’);
  • 38. Data Set Manipulation • Conditional Processing cont. • If <expression1> is true, <statement> is processed • ELSE IF and ELSE are only processed if <expression1> is false • Only one statement specified using this form • Use DO and END statements to execute group of statements • General Form: IF <expression> THEN DO; <statements>; END; ELSE DO; <statements>; END;
  • 39. Data Set Manipulation • Subsetting Rows (Observations) • We will look at two ways • Using IF statement • Using WHERE option in SET statement • IF statement • Only writes observations to the new data set in which an expression is true; • General Form: IF <expression>; • Example: IF career = ‘Teacher’; IF sex ne ‘M’; • In the second example, only observations where sex is not equal to ‘M’ will be written to the output data set
  • 40. Data Set Manipulation • Subsetting Rows (Observations) cont. • Where Option in SET statement • Use option to only read rows from the input data set in which the expression is true • General Form: SET input_data_set (where=(<expression>)); • Example: SET vacation (where=(destination=‘Bermuda’)); • Only observations where the destination equals ‘Bermuda’ will be read from the input data set • Comparison • Resulting output data set is equivalent • IF statement – all rows read from the input data set • Where option – only rows where expression is true are read from input data set • Difference in processing time when working with big data sets
  • 41. Data Set Manipulation • PROC SORT sorts data according to specified variables • General Form: PROC SORT DATA=input_data_set <options>; BY Variable1 Variable2; RUN; • Sorts data according to Variable1 and then Variable2; • By default, SAS sorts data in ascending order • Number low to high • A to Z • Use DESCENDING statement for numbers high to low and letters Z to A • BY City DESCENDING Population; • SAS sorts data first by city A to Z and then Population high to low
  • 42. Data Set Manipulation • Some Options • NODUPKEY • Eliminates observations that have the same values for the BY variables • OUT=output_data_set • By default, PROC SORT replaces the input data set with the sorted data set • Using this option, PROC SORT creates a newly sorted data set and the input data set remains unchanged
  • 43. Data Set Processing • Data Set Processing • DATA steps read in data from existing data sets or raw data files one row at a time, like a loop • DATA step reads data from the input data set in the following way: 1. Read in current row from input data set to Program Data Vector (PDV) 2. Process SAS statements 3. PDV to output data set 4. Set current row to the next row in the input data set 5. Iterate to Step 1 • One row at a time is processed • Thus we cannot simply add the value of a variable in one row to the value in another row
  • 44. Data Set Processing • Data Set Processing – Example • Let the following be the input data set dfwlax: Flight Date Dest FirstClass Economy 439 14955 LAX 20 137 921 14955 DFW 15 131 114 14956 LAX 15 85 982 14956 DFW 5 196 439 14957 LAX 14 116 982 14957 DFW 20 166
  • 45. Data Set Processing • Data Set Processing – Example • Consider the following submitted code: DATA onboard; SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN;
  • 46. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; Current  SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 . . Flight Date Dest FirstClass Economy Total FirstClassFull
  • 47. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Current  Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 . Flight Date Dest FirstClass Economy Total FirstClassFull
  • 48. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Total=FirstClass+Economy; Current IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1 Flight Date Dest FirstClass Economy Total FirstClassFull
  • 49. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; Current RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1 Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1
  • 50. Data Set Processing • Data Set Processing – Example • Execution of the Data Step Current DATA onboard; SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 . . Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1
  • 51. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; Current SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 921 14955 DFW 15 131 . . Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1
  • 52. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Current Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 921 14955 DFW 15 131 146 . Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1
  • 53. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; Current ELSE FirstClassFull=0; RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 921 14955 DFW 15 131 146 0 Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1
  • 54. Data Set Processing • Data Set Processing – Example • Execution of the Data Step DATA onboard; SET dfwlax; Total=FirstClass+Economy; IF FirstClass=20 then FirstClassFull=1; ELSE FirstClassFull=0; Current RUN; PDV Onboard Flight Date Dest FirstClass Economy Total FirstClassFull 921 14955 DFW 15 131 146 0 Flight Date Dest FirstClass Economy Total FirstClassFull 439 14955 LAX 20 137 157 1 921 14955 DFW 15 131 146 0
  • 55. Combining Data Sets • Concatenating (or Appending) • Stacks each data set upon the other • If one data set does not have a variable that the other datasets do, the variable in the new data set is set to missing for the observations from that data set. • General Form: DATA output_data_set; SET data1 data2; run; • PROC APPEND may also be used
  • 56. Combining Data Sets • Merging Data Sets • One-to-One Match Merge • A single record in a data set corresponds to a single record in all other data sets • Example: Patient and Billing Information • One-to-Many Match Merge • Matching one observation from one data set to multiple observations in other data sets • Example: County and State Information • Note: Data must be sorted before merging can be done (PROC SORT)
  • 57. Combining Data Sets • One-to-One Match Merge • Usually need at least one common variable between data sets – matching purposes • For the example, a patient ID would be needed • Do not need common variable if all data sets are in exactly the same order • General Form: DATA output_data_set; MERGE input_data_set1 input_data_set2; By variable1 variable2; RUN;
  • 58. Combining Data Sets • One-to-One Match Merge • Example: Performance Goals Code: DATA compare; MERGE performance goals; BY month; difference=sales-goal; RUN; Month Sales 1 8223 2 6034 3 4220 Month Goal 1 9000 2 6000 3 5000
  • 59. Combining Data Sets • One-to-One Match Merge • Example cont.: Compare Month Sales Goal Difference 1 8223 9000 -777 2 6034 6000 34 3 4220 5000 -780
  • 60. Combining Data Sets • One-to-Many Match Merge • Requires at least one common variable in the data sets for matching purposes • For the example, State information is in both the state and county files • If two data sets have variables with the same name, the variables in the second data set will overwrite the variable in the first. • General Form: DATA output_data_set; MERGE Data1 Data2 Data3; BY Variable1 Variable2; RUN:
  • 61. Combining Data Sets • One-to-Many Match Merge • Example: Videos Adjustment Code: DATA prices; MERGE videos adjustment BY category; NewPrice=(1-adjustment)*sales; RUN; Category Sales Aerobics 12.99 Aerobics 13.99 Aerobics 13.99 Step 12.99 Step 12.99 Weights 15.99 Category Adjustment Aerobics .20 Step .30 Weights .25
  • 62. Combining Data Sets • One-to-One Many Merge • Example cont.: Videos Category Sales Adjustment NewPrice Aerobics 12.99 .20 10.39 Aerobics 13.99 .20 11.19 Aerobics 13.99 .20 11.19 Step 12.99 .30 9.09 Step 12.99 .30 9.09 Weights 15.99 .25 11.99
  • 63. Working With SAS Data Sets Questions/Comments
  • 64. Summary Procedures 1. Print Procedure 2. Plot Procedure 3. Univariate Procedure 4. Means Procedure 5. Freq Procedure
  • 65. Print Procedure • PROC PRINT is used to print data to the output window • By default, prints all observations and variables in the SAS data set • General Form: PROC PRINT DATA=input_data_set <options> <optional SAS statements>; RUN; • Some Options • input_data_set (obs=n) - Specifies the number of observations to be printed in the output • NOOBS - Suppresses printing observation number • LABEL - Prints the labels instead of variable names
  • 66. Print Procedure • Optional SAS statements • BY variable1 variable2 variable3; • Starts a new section of output for every new value of the BY variables • ID variable1 variable2 variable3; • Prints ID variables on the left hand side of the page and suppresses the printing of the observation numbers • SUM variable1 variable2 variable3; • Prints sum of listed variables at the bottom of the output • VAR variable1 variable2 variable3; • Prints only listed variables in the output
  • 67. Plot Procedure • Used to create basic scatter plots of the data • Use PROC GPLOT or PROC SGPLOT for more sophisticated plots • General Form: PROC PLOT DATA=input_data_set; PLOT vertical_variable * horizontal_variable/<options>; RUN; • By default, SAS uses letters to mark points on plots • A for a single observation, B for two observations at the same point, etc. • To specify a different character to represent a point • PLOT vertical_variable * horizontal variable = ‘*’;
  • 68. Plot Procedure • To specify a third variable to use to mark points • PLOT vertical_variable * horizontal_variable = third_variable; • To plot more than one variable on the vertical axis • PLOT vertical_variable1 * horizontal_variable=‘2’ vertical_variable2 * horizontal_variable=‘1’/OVERLAY;
  • 69. Univariate Procedure • PROC UNIVARIATE is used to examine the distribution of data • Produces summary statistics for a single variable • Includes mean, median, mode, standard deviation, skewness, kurtosis, quantiles, etc. • General Form: PROC UNIVARIATE DATA=input_data_set <options>; VAR variable1 variable2 variable3; RUN ; • If the variable statement is not used, summary statistics will be produced for all numeric variables in the input data set.
  • 70. Univariate Procedure • Options include: • PLOT – produces Stem-and-leaf plot, Box plot, and Normal probability plot; • NORMAL – produces tests of Normality
  • 71. Means Procedure • Similar to the Univariate procedure • General Form: PROC MEANS DATA=input_data_set options; <Optional SAS statements>; RUN; • With no options or optional SAS statements, the Means procedure will print out the number of non-missing values, mean, standard deviation, minimum, and maximum for all numeric variables in the input data set
  • 72. Means Procedure • Options • Statistics Available • Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to specify different alpha level. CLM Two-Sided Confidence Limits RANGE Range CSS Corrected Sum of Squares SKEWNESS Skewness CV Coefficient of Variation STDDEV Standard Deviation KURTOSIS Kurtosis STDERR Standard Error of Mean LCLM Lower Confidence Limit SUM Sum MAX Maximum Value SUMWGT Sum of Weight Variables MEAN Mean UCLM Upper Confidence Limit MIN Minimum Value USS Uncorrected Sum of Squares N Number Non-missing Values VAR Variance NMISS Number Missing Values PROBT Probability for Student’s t MEDIAN (or P50) Median T Student’s t Q1 (P25) 25% Quantile Q3 (P75) 75% Quantile P1 1% Quantile P5 5% Quantile P10 10% Quantile P90 90% Quantile P95 95% Quantile P99 99% Quantile
  • 73. Means Procedure • Optional SAS Statements • VAR Variable1 Variable2; • Specifies which numeric variables statistics will be produced for • BY Variable1 Variable2; • Calculates statistics for each combination of the BY variables • Output out=output_data_set; • Creates data set with the default statistics
  • 74. FREQ Procedure • PROC FREQ is used to generate frequency tables • Most common usage is create table showing the distribution of categorical variables • General Form: PROC FREQ DATA=input_data_set; TABLE variable1*variable2*variable3/<options>; RUN; • Options • LIST – prints cross tabulations in list format rather than grid • MISSING – specifies that missing values should be included in the tabulations • OUT=output_data_set – creates a data set containing frequencies, list format • NOPRINT – suppress printing in the output window • Use BY statement to get percentages within each category of a variable
  • 76. Statistical Analysis Procedures 1. Correlation – PROC CORR 2. Regression – PROC REG 3. Analysis of Variance – PROC ANOVA 4. Chi-square Test of Association – PROC FREQ 5. General Linear Models – PROC GENMOD
  • 77. CORR Procedure • PROC CORR is used to calculate the correlations between variables • Correlation coefficient measures the linear relationship between two variables • Values Range from -1 to 1 • Negative correlation - as one variable increases the other decreases • Positive correlation – as one variable increases the other increases • 0 – no linear relationship between the two variables • 1 – perfect positive linear relationship • -1 – perfect negative linear relationship • General Form: PROC CORR DATA=input_data_set <options> VAR Variable1 Variable2; With Variable3; RUN;
  • 78. CORR Procedure • If the VAR and WITH statements are not used, correlation is computed for all pairs of numeric variables • Options include • SPEARMAN – computes Spearman’s rank correlations • KENDALL – computes Kendall’s Tau coefficients • HOEFFDING – computes Hoeffding’s D statistic
  • 79. REG Procedure • PROC REG is used to fit linear regression models by least squares estimation • One of many SAS procedures that can perform regression analysis • Only continuous independent variables (Use GENMOD for categorical variables) • General Form: PROC REG DATA=input_data_set <options> MODEL dependent=independent1 independent2/<options>; <optional statements>; RUN; • PROC REG statement options include • PCOMIT=m - performs principle component estimation with m principle components • CORR – displays correlation matrix for independent variables in the model
  • 80. REG Procedure • MODEL statement options include • SELECTION= • Specifies a model selection procedure be conducted – FORWARD, BACKWARD, and STEPWISE • ADJRSQ - Computes the Adjusted R-Square • MSE – Computes the Mean Square Error • COLLIN – performs collinearity analysis • CLB – computes confidence limits for parameter estimates • ALPHA= • Sets significance value for confidence and prediction intervals and tests
  • 81. REG Procedure • Optional statements include • PLOT Dependent*Independent1 – generates plot of data
  • 82. ANOVA Procedure • PROC ANOVA performs analysis of variance • Designed for balanced data (PROC GLM used for unbalance data) • Can handle nested and crossed effects and repeated measures • General Form: PROC ANOVA DATA=input_data_set <options>; CLASS independent1 independent2; MODEL dependent=independent1 independent2; <optional statements>; Run; • Class statement must come before model statement, used to define classification variables
  • 83. ANOVA Procedure • Useful PROC ANOVA statement option – OUTSTAT=output_data_set • Generates output data set that contains sums of squares, degrees of freedom, statistics, and p-values for each effect in the model • Useful optional statement – MEANS independent1/<comparison type> • Used to perform multiple comparisons analysis • Set <comparison type> to: • TUKEY – Tukey’s studentized range test • BON – Bonferroni t test • T – pairwise t tests • Duncan – Duncan’s multiple-range test • Scheffe – Scheffe’s multiple comparison procedure
  • 84. FREQ Procedure • PROC FREQ can also be used to perform analysis with categorical data • General Form: PROC FREQ DATA=input_data_set; TABLE variable1 variable2/<options>; RUN; • TABLE statement options include: • AGREE – Tests and measures of classification agreement including McNemar’s test, Bowker’s test, Cochran’s Q test, and Kappa statistics • CHISQ - Chi-square test of homogeneity and measures of association • MEASURE - Measures of association include Pearson and Spearman correlation, gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds ratios, risk ratios, and confidence intervals
  • 85. GENMOD Procedure • PROC GENMOD is used to estimate linear models in which the response is not necessarily normal • Logistic and Poisson regression are examples of generalized linear models • General Form: PROC GENMOD DATA=input_data_set; CLASS independent1; MODEL dependent = independent1 independent2/ dist= <option> link=<option>; run;
  • 86. GENMOD Procedure • DIST = - specifies the distribution of the response variable • LINK= - specifies the link function from the linear predictor to the mean of the response • Example – Logistic Regression • DIST = binomial • LINK = logit • Example – Poisson Regression • DIST = poisson • LINK = log