SlideShare a Scribd company logo
Data Entry, Coding and Cleaning
Objectives
Be able to:
 define variables
 Understand basic coding principles
 Cleaning data
Coding
 Coding – process of translating information
gathered from questionnaires or other
sources into something that can be analyzed
 Involves assigning a value to the information
given—often value is given to a level
 Coding can make data more consistent:
– Example: Question = Sex
– Answers = Male, Female, M, or F
– Coding will avoid such inconsistencies
Coding Systems
 Common coding systems (code and label) for
dichotomous variables:
– 0=No 1=Yes
(1 = value assigned, Yes= label of value)
– OR: 1=No 2=Yes
 When you assign a value you must also make it
clear what that value means
– In first example above, 1=Yes but in second example 1=No
– As long as it is clear how the data are coded, either is fine
 You can make it clear by creating a data dictionary to
accompany the dataset
Coding: Dummy Variables
 A “dummy” variable is any variable that is coded to
have 2 levels (yes/no, male/female, etc.)
 Dummy variables may be used to represent more
complicated variables
– Example: # of cigarettes smoked per week with answers
total 75 different responses ranging from 0 cigarettes to 3
packs per week
– Can be recoded as a dummy variable:
1=smokes (at all) 0=non-smoker
 This type of coding is useful in later stages of
analysis
Coding:
Attaching Labels to Values
 Many analysis software packages allow you to attach
a label to the variable values
Example: Label 0’s as male and 1’s as female
 Makes reading data output easier:
Without label: Variable SEX Frequency Percent
0 21
60%
1 14
40%
With label: Variable SEX Frequency Percent
Male 21
60%
Female 14
Coding- Ordinal Variables
 Coding process is similar with other categorical
variables
 Example: variable EDUCATION, possible coding:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate
 Could be coded in reverse order (0=college
graduate, 3=did not graduate high school)
 For this ordinal categorical variable we want to be
consistent with numbering because the value of the
code assigned has significance
Coding – Ordinal Variables
(cont.)
 Example of bad coding:
0 = Some college or post-high school education
1 = High school graduate
2 = College graduate
3 = Did not graduate from high school
 Data has an inherent order but coding does
not follow that order—NOT appropriate
coding for an ordinal categorical variable
Coding: Nominal Variables
 For coding nominal variables, order makes
no difference
 Example: variable RESIDE
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest
 Order does not matter, no ordered value
associated with each response
Coding: Continuous Variables
 Creating categories from a continuous variable (ex.
age) is common
 May break down a continuous variable into chosen
categories by creating an ordinal categorical variable
 Example: variable = AGECAT
1 = 0–9 years old
2 = 10–19 years old
3 = 20–39 years old
4 = 40–59 years old
5 = 60 years or older
Coding:
Continuous Variables (cont.)
 May need to code responses from fill-in-the-blank
and open-ended questions
– Example: “Why did you choose not to see a doctor about
this illness?”
 One approach is to group together responses with
similar themes
– Example: “didn’t feel sick enough to see a doctor”,
“symptoms stopped,” and “illness didn’t last very long”
– Could all be grouped together as “illness was not severe”
 Also need to code for “don’t know” responses”
– Typically, “don’t know” is coded as 9
Coding Tip
1. All data should be numeric. Even though it is possible
to use letters or words (string variables) as data, it is not
desirable to do so.
2. Each variable for each case or participant must occupy
the same column in the Data Editor.
3. All values (codes) for a variable must be mutually
exclusive.(otherwise item should be divided into a separate variable for
each possible response choice, with one value of each variable (usually 1)
corresponding to yes).
4. Apply any coding rules consistently for all participants.
Coding Tip
Coding Tip
Coding Tip
Coding Tip
Data Cleaning
 One of the first steps in analyzing data is to
“clean” it of any obvious data entry errors:
– Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
– Value entered that doesn’t exist for variable?
Example: 2 entered where 1=male, 0=female
– Missing values?
Did the person not give an answer? Was answer
accidentally not entered into the database?
Data Cleaning
 Exploratory data analysis is helpful in data cleaning
stage(will be discussed next)
 Use code book Menu is SPSS version 20 and above
(we will see)

More Related Content

PPT
MELJUN CORTES research seminar_1__data_analysis_basics_slides_2nd_updates
PPT
MELJUN CORTES research seminar_1_data_analysis_basics
PPT
MELJUN CORTES research seminar_1__data_analysis_basics_slides
PPTX
Basics of Data Analysis
PPT
PlanningAnalysis.pptsaeqweqefqeqeqeerwew
PPTX
DATA PROCESSING_Bus 221(0).pptxDATA PROCESSING_Bus 221(0).pptx
PPTX
RSS 2012 Data Entry SPSS
PPTX
Week 2 measures of disease occurence
MELJUN CORTES research seminar_1__data_analysis_basics_slides_2nd_updates
MELJUN CORTES research seminar_1_data_analysis_basics
MELJUN CORTES research seminar_1__data_analysis_basics_slides
Basics of Data Analysis
PlanningAnalysis.pptsaeqweqefqeqeqeerwew
DATA PROCESSING_Bus 221(0).pptxDATA PROCESSING_Bus 221(0).pptx
RSS 2012 Data Entry SPSS
Week 2 measures of disease occurence

Similar to Coding and Data Entry and data cleaning.pptx (20)

PPTX
Data Analysis.pptx
DOCX
Transforming data to fit the applicationIntroducti.docx
PPT
Business Research Methods. data collection preparation and analysis
PPT
Abdm4064 week 11 data analysis
PDF
Spss series - data entry and coding
PPT
Data collection & management
PPT
Basic Introduction of SPSS software_presentation
PPTX
ANALYSIS OF DATA.pptx
PPT
Spss beginners
PPTX
Analyzing and Interpreting quantitative data.pptx
PPT
Data Preparation and Processing
PPT
lecture 1 applied econometrics and economic modeling
PPTX
Data analysis copy
PPT
Poli_399_Tutorial_Week_Three_-_Sept_29th_(2)
PPTX
Mba2216 week 11 data analysis part 01
PDF
PROCESSING OF DATA IN RESEARCH METHODOLOGY.pptx (1).pdf
PPTX
Introduction to Data Management in Human Ecology
PPTX
Presentation.pptx
PPTX
Biostatistic 2.pptx
PPT
Introduction to spss: define variables
Data Analysis.pptx
Transforming data to fit the applicationIntroducti.docx
Business Research Methods. data collection preparation and analysis
Abdm4064 week 11 data analysis
Spss series - data entry and coding
Data collection & management
Basic Introduction of SPSS software_presentation
ANALYSIS OF DATA.pptx
Spss beginners
Analyzing and Interpreting quantitative data.pptx
Data Preparation and Processing
lecture 1 applied econometrics and economic modeling
Data analysis copy
Poli_399_Tutorial_Week_Three_-_Sept_29th_(2)
Mba2216 week 11 data analysis part 01
PROCESSING OF DATA IN RESEARCH METHODOLOGY.pptx (1).pdf
Introduction to Data Management in Human Ecology
Presentation.pptx
Biostatistic 2.pptx
Introduction to spss: define variables
Ad

Recently uploaded (20)

PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Introduction to Building Materials
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Trump Administration's workforce development strategy
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Practical Manual AGRO-233 Principles and Practices of Natural Farming
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Introduction to pro and eukaryotes and differences.pptx
Share_Module_2_Power_conflict_and_negotiation.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
AI-driven educational solutions for real-life interventions in the Philippine...
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Virtual and Augmented Reality in Current Scenario
Paper A Mock Exam 9_ Attempt review.pdf.
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Weekly quiz Compilation Jan -July 25.pdf
Introduction to Building Materials
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Trump Administration's workforce development strategy
Ad

Coding and Data Entry and data cleaning.pptx

  • 1. Data Entry, Coding and Cleaning Objectives Be able to:  define variables  Understand basic coding principles  Cleaning data
  • 2. Coding  Coding – process of translating information gathered from questionnaires or other sources into something that can be analyzed  Involves assigning a value to the information given—often value is given to a level  Coding can make data more consistent: – Example: Question = Sex – Answers = Male, Female, M, or F – Coding will avoid such inconsistencies
  • 3. Coding Systems  Common coding systems (code and label) for dichotomous variables: – 0=No 1=Yes (1 = value assigned, Yes= label of value) – OR: 1=No 2=Yes  When you assign a value you must also make it clear what that value means – In first example above, 1=Yes but in second example 1=No – As long as it is clear how the data are coded, either is fine  You can make it clear by creating a data dictionary to accompany the dataset
  • 4. Coding: Dummy Variables  A “dummy” variable is any variable that is coded to have 2 levels (yes/no, male/female, etc.)  Dummy variables may be used to represent more complicated variables – Example: # of cigarettes smoked per week with answers total 75 different responses ranging from 0 cigarettes to 3 packs per week – Can be recoded as a dummy variable: 1=smokes (at all) 0=non-smoker  This type of coding is useful in later stages of analysis
  • 5. Coding: Attaching Labels to Values  Many analysis software packages allow you to attach a label to the variable values Example: Label 0’s as male and 1’s as female  Makes reading data output easier: Without label: Variable SEX Frequency Percent 0 21 60% 1 14 40% With label: Variable SEX Frequency Percent Male 21 60% Female 14
  • 6. Coding- Ordinal Variables  Coding process is similar with other categorical variables  Example: variable EDUCATION, possible coding: 0 = Did not graduate from high school 1 = High school graduate 2 = Some college or post-high school education 3 = College graduate  Could be coded in reverse order (0=college graduate, 3=did not graduate high school)  For this ordinal categorical variable we want to be consistent with numbering because the value of the code assigned has significance
  • 7. Coding – Ordinal Variables (cont.)  Example of bad coding: 0 = Some college or post-high school education 1 = High school graduate 2 = College graduate 3 = Did not graduate from high school  Data has an inherent order but coding does not follow that order—NOT appropriate coding for an ordinal categorical variable
  • 8. Coding: Nominal Variables  For coding nominal variables, order makes no difference  Example: variable RESIDE 1 = Northeast 2 = South 3 = Northwest 4 = Midwest 5 = Southwest  Order does not matter, no ordered value associated with each response
  • 9. Coding: Continuous Variables  Creating categories from a continuous variable (ex. age) is common  May break down a continuous variable into chosen categories by creating an ordinal categorical variable  Example: variable = AGECAT 1 = 0–9 years old 2 = 10–19 years old 3 = 20–39 years old 4 = 40–59 years old 5 = 60 years or older
  • 10. Coding: Continuous Variables (cont.)  May need to code responses from fill-in-the-blank and open-ended questions – Example: “Why did you choose not to see a doctor about this illness?”  One approach is to group together responses with similar themes – Example: “didn’t feel sick enough to see a doctor”, “symptoms stopped,” and “illness didn’t last very long” – Could all be grouped together as “illness was not severe”  Also need to code for “don’t know” responses” – Typically, “don’t know” is coded as 9
  • 11. Coding Tip 1. All data should be numeric. Even though it is possible to use letters or words (string variables) as data, it is not desirable to do so. 2. Each variable for each case or participant must occupy the same column in the Data Editor. 3. All values (codes) for a variable must be mutually exclusive.(otherwise item should be divided into a separate variable for each possible response choice, with one value of each variable (usually 1) corresponding to yes). 4. Apply any coding rules consistently for all participants.
  • 16. Data Cleaning  One of the first steps in analyzing data is to “clean” it of any obvious data entry errors: – Outliers? (really high or low numbers) Example: Age = 110 (really 10 or 11?) – Value entered that doesn’t exist for variable? Example: 2 entered where 1=male, 0=female – Missing values? Did the person not give an answer? Was answer accidentally not entered into the database?
  • 17. Data Cleaning  Exploratory data analysis is helpful in data cleaning stage(will be discussed next)  Use code book Menu is SPSS version 20 and above (we will see)

Editor's Notes

  • #2: Once you have gathered your questionnaire or other data, you may choose to code the data for entry into a database. Coding is the process of translating the information gathered from questionnaires or other investigations into something that can be analyzed, usually using a computer program. Coding involves assigning a value to the information given in a questionnaire, and often that value is given a label. In addition, coding can make the data more consistent. For example, if you have the question “Sex?” you might end up with the answers “Male”, “Female”, or “M”, “F”, etc. Coding will avoid such inconsistencies.    
  • #3: A common coding system (code and label) for dichotomous variables is the following: 0 = No 1 = Yes, where the number 1 is the value assigned, and Yes is the label or meaning of that value. Some like to use a system of ones and twos, where 1 = No 2 = Yes. This brings out an important point in coding. When you assign a value to a piece of information, you must also make it clear what that value means. In the first example given above, 1 = Yes, but in the second example, 1 = No. Either way is fine, as long as it is clear how the data are coded. You can make it clear by creating a data dictionary as a separate file to accompany the dataset.   
  • #4: Dichotomous variables can also be dummy variables. A “dummy” variable is any variable that is coded to have 2 levels, like the yes/no variables and male/female variables above. They can also be used to represent or stand in for more complicated variables. This is especially useful when you have many values that are more meaningful when analyzed in terms of a yes/no response. For example, you may have collected data on the number of cigarettes smoked per week, with 75 different responses ranging from no cigarettes at all to 3 packs a week, but you can recode these data as a dummy variable: 1 = Smokes (at all), 0 = Non-smoker. You could also do this for education (1 = Any post-high school education, 0 = No post-high school education), food consumption (1 = Ate item at all during time period, 0 = Did not eat the item), and many other variables. This type of coding is useful in the later stages of analysis.     
  • #5: Many analysis software packages allow you to attach a label to variable values. Then the computer automatically labels the 0’s as male and the 1’s as female, which makes your life much easier when you are looking at the output, as shown in the example below: Without label: Variable SEX Frequency Percent 0 21 60% 1 14 40% With label: Variable SEX Frequency Percent Male 21 60% Female 14 40%  
  • #6: The coding process is similar with other categorical variables. For the variable EDUCATION mentioned above, we might code as follows: 0 = Did not graduate from high school 1 = High school graduate 2 = Some college or post-high school education 3 = College graduate.  Note that for this ordinal categorical variable, we need to be consistent with the numbering, because the value of the code assigned has significance. The higher the code, the more educated the respondent is. We could have also coded this variable in reverse order, so that 0 = College graduate, and 3 = Did not graduate from high school. In this case, the higher the code, the less educated the respondent is. Either way is fine, as long as we remember the coding when interpreting the analysis.  
  • #7: The following is an example of bad coding: 0 = Some college or post-high school education 1 = High school graduate 2 = College graduate 3 = Did not graduate from high school The data we are trying to code has an inherent order, but the coding in this example does not follow that order. This is not appropriate coding for an ordinal categorical variable.  
  • #8: For a nominal categorical variable, however, the order makes no difference. Although we code each category with a number, the number does not represent a numerical value. For example, using the variable RESIDE mentioned earlier, 1 = Northeast 2 = South 3 = Northwest 4 = Midwest 5 = Southwest. It doesn’t matter what order we use for these categories. Midwest can be coded as 4, 2 or 5, because there is not an ordered value associated with each response.  
  • #9: Coding continuous variables is straightforward. If someone gives his or her age as 37 years, you enter it into the database as 37. But what if you decided that you would rather use age categories instead of the data you collected in years? Creating categories from a continuous variable is common, and can easily be done using analysis software. With a software package, you can break down a continuous variable such as age into categories by creating an ordinal categorical variable, such as the following: AGECAT 1 = 0–9 years old 2 = 10–19 years old 3 = 20–39 years old 4 = 40–59 years old 5 = 60 years or older.    
  • #10: You may also need to code responses from fill-in-the-blank and open-ended questions. With an open-ended question such as “Why did you choose not to see a doctor about this illness?”, respondents will all answer somewhat differently. Also, you may give response choices for a particular question but offer an “other (specify)” option as well, where respondents can write whatever response they choose. These types of open-ended questions can be a lot of work to analyze. One way to analyze the information is to group together responses with similar themes. For the question above, responses of “didn’t feel sick enough to see a doctor,” “symptoms stopped,” and “the illness didn’t last very long,” could all be grouped together as “the illness was not severe.” You will also need to code “don’t know” responses. Typically, “don’t know” is coded as 9.      
  • #16: Data Cleaning One of the first steps in analyzing data is to look at the data and “clean” it of any obvious errors due to incorrect data entry. If there are outliers (really high or really low numbers), are those numbers correct? An age value of 110 years could be an error for someone who was really 10 or 11 (or 101!). Was a value entered that doesn’t exist for the variable? For example, if 1 = male and 0 = female, and the number “2” was entered, there is clearly an error. If there are missing values, did the person not give an answer, or was it accidentally not entered into the database?    
  • #17: Data Cleaning One of the first steps in analyzing data is to look at the data and “clean” it of any obvious errors due to incorrect data entry. If there are outliers (really high or really low numbers), are those numbers correct? An age value of 110 years could be an error for someone who was really 10 or 11 (or 101!). Was a value entered that doesn’t exist for the variable? For example, if 1 = male and 0 = female, and the number “2” was entered, there is clearly an error. If there are missing values, did the person not give an answer, or was it accidentally not entered into the database?