1. Data Entry, Coding and Cleaning
Objectives
Be able to:
define variables
Understand basic coding principles
Cleaning data
2. Coding
Coding – process of translating information
gathered from questionnaires or other
sources into something that can be analyzed
Involves assigning a value to the information
given—often value is given to a level
Coding can make data more consistent:
– Example: Question = Sex
– Answers = Male, Female, M, or F
– Coding will avoid such inconsistencies
3. Coding Systems
Common coding systems (code and label) for
dichotomous variables:
– 0=No 1=Yes
(1 = value assigned, Yes= label of value)
– OR: 1=No 2=Yes
When you assign a value you must also make it
clear what that value means
– In first example above, 1=Yes but in second example 1=No
– As long as it is clear how the data are coded, either is fine
You can make it clear by creating a data dictionary to
accompany the dataset
4. Coding: Dummy Variables
A “dummy” variable is any variable that is coded to
have 2 levels (yes/no, male/female, etc.)
Dummy variables may be used to represent more
complicated variables
– Example: # of cigarettes smoked per week with answers
total 75 different responses ranging from 0 cigarettes to 3
packs per week
– Can be recoded as a dummy variable:
1=smokes (at all) 0=non-smoker
This type of coding is useful in later stages of
analysis
5. Coding:
Attaching Labels to Values
Many analysis software packages allow you to attach
a label to the variable values
Example: Label 0’s as male and 1’s as female
Makes reading data output easier:
Without label: Variable SEX Frequency Percent
0 21
60%
1 14
40%
With label: Variable SEX Frequency Percent
Male 21
60%
Female 14
6. Coding- Ordinal Variables
Coding process is similar with other categorical
variables
Example: variable EDUCATION, possible coding:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate
Could be coded in reverse order (0=college
graduate, 3=did not graduate high school)
For this ordinal categorical variable we want to be
consistent with numbering because the value of the
code assigned has significance
7. Coding – Ordinal Variables
(cont.)
Example of bad coding:
0 = Some college or post-high school education
1 = High school graduate
2 = College graduate
3 = Did not graduate from high school
Data has an inherent order but coding does
not follow that order—NOT appropriate
coding for an ordinal categorical variable
8. Coding: Nominal Variables
For coding nominal variables, order makes
no difference
Example: variable RESIDE
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest
Order does not matter, no ordered value
associated with each response
9. Coding: Continuous Variables
Creating categories from a continuous variable (ex.
age) is common
May break down a continuous variable into chosen
categories by creating an ordinal categorical variable
Example: variable = AGECAT
1 = 0–9 years old
2 = 10–19 years old
3 = 20–39 years old
4 = 40–59 years old
5 = 60 years or older
10. Coding:
Continuous Variables (cont.)
May need to code responses from fill-in-the-blank
and open-ended questions
– Example: “Why did you choose not to see a doctor about
this illness?”
One approach is to group together responses with
similar themes
– Example: “didn’t feel sick enough to see a doctor”,
“symptoms stopped,” and “illness didn’t last very long”
– Could all be grouped together as “illness was not severe”
Also need to code for “don’t know” responses”
– Typically, “don’t know” is coded as 9
11. Coding Tip
1. All data should be numeric. Even though it is possible
to use letters or words (string variables) as data, it is not
desirable to do so.
2. Each variable for each case or participant must occupy
the same column in the Data Editor.
3. All values (codes) for a variable must be mutually
exclusive.(otherwise item should be divided into a separate variable for
each possible response choice, with one value of each variable (usually 1)
corresponding to yes).
4. Apply any coding rules consistently for all participants.
16. Data Cleaning
One of the first steps in analyzing data is to
“clean” it of any obvious data entry errors:
– Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
– Value entered that doesn’t exist for variable?
Example: 2 entered where 1=male, 0=female
– Missing values?
Did the person not give an answer? Was answer
accidentally not entered into the database?
17. Data Cleaning
Exploratory data analysis is helpful in data cleaning
stage(will be discussed next)
Use code book Menu is SPSS version 20 and above
(we will see)
Editor's Notes
#2:Once you have gathered your questionnaire or other data, you may choose to code the data for entry into a database. Coding is the process of translating the information gathered from questionnaires or other investigations into something that can be analyzed, usually using a computer program. Coding involves assigning a value to the information given in a questionnaire, and often that value is given a label. In addition, coding can make the data more consistent. For example, if you have the question “Sex?” you might end up with the answers “Male”, “Female”, or “M”, “F”, etc. Coding will avoid such inconsistencies.
#3:A common coding system (code and label) for dichotomous variables is the following:
0 = No 1 = Yes,
where the number 1 is the value assigned, and Yes is the label or meaning of that value.
Some like to use a system of ones and twos, where
1 = No 2 = Yes.
This brings out an important point in coding. When you assign a value to a piece of information, you must also make it clear what that value means. In the first example given above, 1 = Yes, but in the second example, 1 = No. Either way is fine, as long as it is clear how the data are coded. You can make it clear by creating a data dictionary as a separate file to accompany the dataset.
#4:Dichotomous variables can also be dummy variables. A “dummy” variable is any variable that is coded to have 2 levels, like the yes/no variables and male/female variables above. They can also be used to represent or stand in for more complicated variables. This is especially useful when you have many values that are more meaningful when analyzed in terms of a yes/no response.
For example, you may have collected data on the number of cigarettes smoked per week, with 75 different responses ranging from no cigarettes at all to 3 packs a week, but you can recode these data as a dummy variable: 1 = Smokes (at all), 0 = Non-smoker. You could also do this for education (1 = Any post-high school education, 0 = No post-high school education), food consumption (1 = Ate item at all during time period, 0 = Did not eat the item), and many other variables. This type of coding is useful in the later stages of analysis.
#5:Many analysis software packages allow you to attach a label to variable values. Then the computer automatically labels the 0’s as male and the 1’s as female, which makes your life much easier when you are looking at the output, as shown in the example below:
Without label: Variable SEX Frequency Percent
0 21 60%
1 14 40%
With label: Variable SEX Frequency Percent
Male 21 60%
Female 14 40%
#6:The coding process is similar with other categorical variables. For the variable EDUCATION mentioned above, we might code as follows:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate.
Note that for this ordinal categorical variable, we need to be consistent with the numbering, because the value of the code assigned has significance. The higher the code, the more educated the respondent is. We could have also coded this variable in reverse order, so that 0 = College graduate, and 3 = Did not graduate from high school. In this case, the higher the code, the less educated the respondent is. Either way is fine, as long as we remember the coding when interpreting the analysis.
#7:The following is an example of bad coding:
0 = Some college or post-high school education
1 = High school graduate
2 = College graduate
3 = Did not graduate from high school
The data we are trying to code has an inherent order, but the coding in this example does not follow that order. This is not appropriate coding for an ordinal categorical variable.
#8:For a nominal categorical variable, however, the order makes no difference. Although we code each category with a number, the number does not represent a numerical value. For example, using the variable RESIDE mentioned earlier,
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest.
It doesn’t matter what order we use for these categories. Midwest can be coded as 4, 2 or 5, because there is not an ordered value associated with each response.
#9:Coding continuous variables is straightforward. If someone gives his or her age as 37 years, you enter it into the database as 37. But what if you decided that you would rather use age categories instead of the data you collected in years?
Creating categories from a continuous variable is common, and can easily be done using analysis software. With a software package, you can break down a continuous variable such as age into categories by creating an ordinal categorical variable, such as the following:
AGECAT
1 = 0–9 years old
2 = 10–19 years old
3 = 20–39 years old
4 = 40–59 years old
5 = 60 years or older.
#10:You may also need to code responses from fill-in-the-blank and open-ended questions. With an open-ended question such as “Why did you choose not to see a doctor about this illness?”, respondents will all answer somewhat differently. Also, you may give response choices for a particular question but offer an “other (specify)” option as well, where respondents can write whatever response they choose. These types of open-ended questions can be a lot of work to analyze. One way to analyze the information is to group together responses with similar themes. For the question above, responses of “didn’t feel sick enough to see a doctor,” “symptoms stopped,” and “the illness didn’t last very long,” could all be grouped together as “the illness was not severe.”
You will also need to code “don’t know” responses. Typically, “don’t know” is coded as 9.
#16:Data Cleaning
One of the first steps in analyzing data is to look at the data and “clean” it of any obvious errors due to incorrect data entry.
If there are outliers (really high or really low numbers), are those numbers correct? An age value of 110 years could be an error for someone who was really 10 or 11 (or 101!).
Was a value entered that doesn’t exist for the variable? For example, if 1 = male and 0 = female, and the number “2” was entered, there is clearly an error.
If there are missing values, did the person not give an answer, or was it accidentally not entered into the database?
#17:Data Cleaning
One of the first steps in analyzing data is to look at the data and “clean” it of any obvious errors due to incorrect data entry.
If there are outliers (really high or really low numbers), are those numbers correct? An age value of 110 years could be an error for someone who was really 10 or 11 (or 101!).
Was a value entered that doesn’t exist for the variable? For example, if 1 = male and 0 = female, and the number “2” was entered, there is clearly an error.
If there are missing values, did the person not give an answer, or was it accidentally not entered into the database?