SlideShare a Scribd company logo
Detecting Bad DataCARMA Research ModuleJeff Stanton
May 18-20, 2006Internet Data Collection Methods (Day 2-2)Sources of Data Problems in Online StudiesTechnical errors:Programming errors: Not common, but damaging when they occurServer errors: Can halt the collection of dataTransmission errors: Uncommon and usually isolated to one record or fieldResponse fraud:Inadvertent multiple response and malicious multiple responseMissing dataIntentionally malicious patterns of response leading to outliers or self-contradictory data
Response FraudDeindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research processParticipant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of qualityMinimal frauds: skipping questions, not thinking through the answersMaximal frauds: A robot that randomly answers May 18-20, 2006Internet Data Collection Methods (Day 2-3)
Duplicate DetectionFingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columnsCreate a new variable that contains this unique “checksum” value for each row/caseSort the dataset on the checksumCreate a lag difference variable that subtracts the checksum for each neighboring rowSort on the lag variable and investigate all cases of zero or small differencesMay 18-20, 2006Internet Data Collection Methods (Day 2-4)
May 18-20, 2006Internet Data Collection Methods (Day 2-5)Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subjectCreate new variables for the univariate summaries (mean, sd, skew, kurt, max, min)Sort the cases by the mean valueLook for extreme outliers on the high and low endsSort the cases by standard deviation, skewness, kurtosis, maximum, minimumLook for anomalies and trace them back to the original data for that subject
May 18-20, 2006Internet Data Collection Methods (Day 2-6)Multivariate Outlier DetectionUse Mahalanobis distance to detect outliersRegress a set of related items on an arbitrary dependent variableSort by Mahalanobis distance: Larger distances are suggestive of outliersUse autocorrelation to detect unusual data patternsFlip the data: Cases become variables and variables become casesRun an autocorrelation functionLook at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags)I have provided example SPSS code in the utilities area of the LMS for each of these tests
May 18-20, 2006Internet Data Collection Methods (Day 2-7)Mahalanobis
May 18-20, 2006Internet Data Collection Methods (Day 2-8)Plot, Sort, and Examine
May 18-20, 2006Internet Data Collection Methods (Day 2-9)An ACF Indicating No Pattern
May 18-20, 2006Internet Data Collection Methods (Day 2-10)An ACF with a Suspicious Pattern
May 18-20, 2006Internet Data Collection Methods (Day 2-11)Common Missing Data Mitigation TechniquesItem imputationFor composite scales expressed as the average of a set of items, ignore any missing that appear on a small subsetMean substitutionSuppresses variabilityTime series imputationMean of neighboring points; suppresses spikesRegression imputation, works well for highly intercorrelated variablesFull information maximum likelihood imputationAvailable in some SEM programs
May 18-20, 2006Internet Data Collection Methods (Day 2-12)Excel TipsYour friend the “fill” functionThe power of “Paste Special”Sorting: Click on Data/Sort
May 18-20, 2006Internet Data Collection Methods (Day 2-13)Excel Statistical Formulas=find(<find text>, <within text>, <start>)Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start>Example: =find(“=“, “fish=head”, 1)=Len(<string>)Returns the number of characters in a stringExample =Len(“Ouch”)=Right(<string>,<length>)Returns the rightmost <length> characters in stringExample: =Right(“fishhead“,4)=Left(<string>,<length>) works similarly=average(value, value…)Gives the arithmetic mean of a collection of cells and/or numeric values=stdev(value, value…) // stdevp(value, value…)Gives the sample/population standard deviation of a collection of cells and/or numeric values=sum(value, value…)Gives the sum of a collection of cells and/or numeric values=correl(vector1, vector2)Gives the pearson correlation between two vectors=if(<test>,<value if true>,<value if false>)Makes a logical test and returns a different value depending on whether the test is true or falseExample =if(1=1, “Yes!”, “No…”)
May 18-20, 2006Internet Data Collection Methods (Day 2-14)Summary of Bad Data ProblemsMultiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…Unmotivated responding: participant uses same option over and over againMalicious patterns: Participate enters some unusually regular pattern of responsesThere are at least five errors of these kinds in the exercise dataset (see below)

More Related Content

PDF
Scibite - We Do.
PDF
Predictive analytics for E-commerce
PDF
Logistic Regression In Data Science
PDF
5 Benefits of Predictive Analytics for E-Commerce
PDF
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
PDF
Linear Regression With R
PDF
What are we doing with our data
PDF
Is one enough? Data warehousing for biomedical research
Scibite - We Do.
Predictive analytics for E-commerce
Logistic Regression In Data Science
5 Benefits of Predictive Analytics for E-Commerce
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Linear Regression With R
What are we doing with our data
Is one enough? Data warehousing for biomedical research

What's hot (20)

PDF
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
PPT
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
Association Mining
PDF
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
PDF
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
PDF
Beyond Proofs of Concept for Biomedical AI
PDF
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
PDF
resume_LangZhou
PDF
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
PPTX
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
PPTX
Elsevier’s Healthcare Knowledge Graph
PPT
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
PDF
Machine learning for java developers
PDF
In-Memory Data Management for Systems Medicine
PPTX
New developments in delivering public access to data from the National Center...
PPTX
Big Data & ML for Clinical Data
PPTX
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
PDF
Open interoperability standards, tools and services at EMBL-EBI
PPTX
Bioinformatics
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Association Mining
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Beyond Proofs of Concept for Biomedical AI
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
resume_LangZhou
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Elsevier’s Healthcare Knowledge Graph
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
Machine learning for java developers
In-Memory Data Management for Systems Medicine
New developments in delivering public access to data from the National Center...
Big Data & ML for Clinical Data
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
Open interoperability standards, tools and services at EMBL-EBI
Bioinformatics
Ad

Viewers also liked (8)

PPTX
From the classroom to the workplace: how data skills develop better social re...
PPTX
The Cost Of Bad (And Clean) Data
PDF
Business Impact of Bad Data
PPT
AS Sociology: Ethical Factors Influencing Choice of Methods
PPTX
As Research methods, sociology
PPT
Quant Vs Qual Research
PDF
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
PPT
From the classroom to the workplace: how data skills develop better social re...
The Cost Of Bad (And Clean) Data
Business Impact of Bad Data
AS Sociology: Ethical Factors Influencing Choice of Methods
As Research methods, sociology
Quant Vs Qual Research
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
Ad

Similar to Carma internet research module detecting bad data (20)

PPTX
Data_Collection.pptx.pptx
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
PPTX
Survey Surveillance Screening
PPT
Analyzing survey data
PPT
3 Missing data12256429.ppt
PPTX
Database ppt.pptx
PPTX
Regression diagnostics
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
PPTX
ANALYSIS OF DATA (2).pptx
PPTX
Collection & Editing of data
PDF
Practical Econometrics Data collection Analysis and Application 1st Edition H...
PPTX
2. Data_analysis_and_Research_Methodology.pptx
PPT
Research methodology - Analysis of Data
PPTX
Data analytics course notes of Unit-1.pptx
PPTX
Data Collection Preparation
PPTX
unit 4 deta analysis bbaY Dr kanchan.pptx
PPTX
unit 4 deta analysis bbaY Dr kanchan.pptx
PDF
SELECTED DATA PREPARATION METHODS
PDF
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
PPTX
Data Presentation & Analysis.pptx
Data_Collection.pptx.pptx
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
Survey Surveillance Screening
Analyzing survey data
3 Missing data12256429.ppt
Database ppt.pptx
Regression diagnostics
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
ANALYSIS OF DATA (2).pptx
Collection & Editing of data
Practical Econometrics Data collection Analysis and Application 1st Edition H...
2. Data_analysis_and_Research_Methodology.pptx
Research methodology - Analysis of Data
Data analytics course notes of Unit-1.pptx
Data Collection Preparation
unit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptx
SELECTED DATA PREPARATION METHODS
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data Presentation & Analysis.pptx

More from Syracuse University (20)

PPTX
Discovery informaticsstanton
PPTX
Basic SEVIS Overview for U.S. University Faculty
PPTX
Why R? A Brief Introduction to the Open Source Statistics Platform
PPTX
Chapter9 r studio2
PPTX
Basic Overview of Data Mining
PPTX
Strategic planning
PPTX
Carma internet research module scale development
PPTX
Carma internet research module getting started with question pro
PPTX
Carma internet research module visual design issues
PPT
Siop impact of social media
PPTX
Basic Graphics with R
PPTX
R-Studio Vs. Rcmdr
PPTX
Getting Started with R
PPTX
Moving Data to and From R
PPTX
Introduction to Advance Analytics Course
PPTX
Installing R and R-Studio
PPTX
Mining tweets for security information (rev 2)
PPTX
What is Data Science
PPTX
Reducing Response Burden
PPTX
PACIS Survey Workshop
Discovery informaticsstanton
Basic SEVIS Overview for U.S. University Faculty
Why R? A Brief Introduction to the Open Source Statistics Platform
Chapter9 r studio2
Basic Overview of Data Mining
Strategic planning
Carma internet research module scale development
Carma internet research module getting started with question pro
Carma internet research module visual design issues
Siop impact of social media
Basic Graphics with R
R-Studio Vs. Rcmdr
Getting Started with R
Moving Data to and From R
Introduction to Advance Analytics Course
Installing R and R-Studio
Mining tweets for security information (rev 2)
What is Data Science
Reducing Response Burden
PACIS Survey Workshop

Carma internet research module detecting bad data

  • 1. Detecting Bad DataCARMA Research ModuleJeff Stanton
  • 2. May 18-20, 2006Internet Data Collection Methods (Day 2-2)Sources of Data Problems in Online StudiesTechnical errors:Programming errors: Not common, but damaging when they occurServer errors: Can halt the collection of dataTransmission errors: Uncommon and usually isolated to one record or fieldResponse fraud:Inadvertent multiple response and malicious multiple responseMissing dataIntentionally malicious patterns of response leading to outliers or self-contradictory data
  • 3. Response FraudDeindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research processParticipant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of qualityMinimal frauds: skipping questions, not thinking through the answersMaximal frauds: A robot that randomly answers May 18-20, 2006Internet Data Collection Methods (Day 2-3)
  • 4. Duplicate DetectionFingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columnsCreate a new variable that contains this unique “checksum” value for each row/caseSort the dataset on the checksumCreate a lag difference variable that subtracts the checksum for each neighboring rowSort on the lag variable and investigate all cases of zero or small differencesMay 18-20, 2006Internet Data Collection Methods (Day 2-4)
  • 5. May 18-20, 2006Internet Data Collection Methods (Day 2-5)Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subjectCreate new variables for the univariate summaries (mean, sd, skew, kurt, max, min)Sort the cases by the mean valueLook for extreme outliers on the high and low endsSort the cases by standard deviation, skewness, kurtosis, maximum, minimumLook for anomalies and trace them back to the original data for that subject
  • 6. May 18-20, 2006Internet Data Collection Methods (Day 2-6)Multivariate Outlier DetectionUse Mahalanobis distance to detect outliersRegress a set of related items on an arbitrary dependent variableSort by Mahalanobis distance: Larger distances are suggestive of outliersUse autocorrelation to detect unusual data patternsFlip the data: Cases become variables and variables become casesRun an autocorrelation functionLook at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags)I have provided example SPSS code in the utilities area of the LMS for each of these tests
  • 7. May 18-20, 2006Internet Data Collection Methods (Day 2-7)Mahalanobis
  • 8. May 18-20, 2006Internet Data Collection Methods (Day 2-8)Plot, Sort, and Examine
  • 9. May 18-20, 2006Internet Data Collection Methods (Day 2-9)An ACF Indicating No Pattern
  • 10. May 18-20, 2006Internet Data Collection Methods (Day 2-10)An ACF with a Suspicious Pattern
  • 11. May 18-20, 2006Internet Data Collection Methods (Day 2-11)Common Missing Data Mitigation TechniquesItem imputationFor composite scales expressed as the average of a set of items, ignore any missing that appear on a small subsetMean substitutionSuppresses variabilityTime series imputationMean of neighboring points; suppresses spikesRegression imputation, works well for highly intercorrelated variablesFull information maximum likelihood imputationAvailable in some SEM programs
  • 12. May 18-20, 2006Internet Data Collection Methods (Day 2-12)Excel TipsYour friend the “fill” functionThe power of “Paste Special”Sorting: Click on Data/Sort
  • 13. May 18-20, 2006Internet Data Collection Methods (Day 2-13)Excel Statistical Formulas=find(<find text>, <within text>, <start>)Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start>Example: =find(“=“, “fish=head”, 1)=Len(<string>)Returns the number of characters in a stringExample =Len(“Ouch”)=Right(<string>,<length>)Returns the rightmost <length> characters in stringExample: =Right(“fishhead“,4)=Left(<string>,<length>) works similarly=average(value, value…)Gives the arithmetic mean of a collection of cells and/or numeric values=stdev(value, value…) // stdevp(value, value…)Gives the sample/population standard deviation of a collection of cells and/or numeric values=sum(value, value…)Gives the sum of a collection of cells and/or numeric values=correl(vector1, vector2)Gives the pearson correlation between two vectors=if(<test>,<value if true>,<value if false>)Makes a logical test and returns a different value depending on whether the test is true or falseExample =if(1=1, “Yes!”, “No…”)
  • 14. May 18-20, 2006Internet Data Collection Methods (Day 2-14)Summary of Bad Data ProblemsMultiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…Unmotivated responding: participant uses same option over and over againMalicious patterns: Participate enters some unusually regular pattern of responsesThere are at least five errors of these kinds in the exercise dataset (see below)