SlideShare a Scribd company logo
Data Manipulation and
Data Integrity
Data Manipulation
• Data manipulation is the process in which scientific data is
forged, presented in an unprofessional way or changed with
disregard to the rules of the academic world.
• Data manipulation may result in a distorted perception of a
subject, which may lead to false theories being built and
tested.
• An experiment based on data that has been manipulated is
risky and unpredictable.
Consequences of Data Manipulation
• Misleading colleagues
• Impeding progress
• Causing harm to society
• Unpredictable experiments
Statistics as a tool of Data
manipulation
• One of the most common kinds of data manipulation is
misuse of statistics – many article titles on the internet are
based on misuse of statistics, as are some political and
economic arguments.
• Misuse of statistics does include data forgery – the process
in which data is created without any connection to actual
data situations.
• The most important kinds of misuse of statistics are those
that involve real data that is presented in a manner that may
be misleading and even dangerous.
Kinds of Data Manipulation and
Reasons behind Them
• Omitting important facts, factors
• Researchers are looking for results (Because that means research grants, etc.), and thus,
they sometimes deliberately or unintentionally manipulate data to fit their hypothesis.
• When conducting an experiment, researchers use lists of incomplete relevant factors.
• For example, a political poll can include the age, income, or religious beliefs of the participants.
• The weak point here is the fact that the researcher may not have included an important factor as
relevant in the study.
• If a study “Computer games – art or not?” was conducted on participants between the
ages of fifty and sixty, then its results would probably be quite different from the results
of the same study conducted on participants between the ages of fifteen and twenty.
• If, in the resulting publication, the age of the participants is not clearly stated, then that is an example
of data manipulation and, specifically, misuse of statistics.
The Simpson’s Paradox
Include another Variable
Changed Analysis
Data Manipulation And Data Integrity ethics in research
How can you avoid this?
• Consider you are planning an extensive study; you may take the help of your friends
• Read many research papers and identify most of the factors that have been included by
researchers.
• Read about the domain and identify additional factors.
• As the main researcher, put down all the specifics of the study, all of the questions and any
other relevant information that you would expect for actual data gathering.
• Clarify and train your friends in the process of data collection.
• There are many friends– some with high integrity, some infamous, and some not well
known.
• As a researcher, if you publish data from the study, you should describe the data collection
process.
• If not mentioned, then the data should not be taken seriously.
Can a company be hired to do the
study?
• If you have employed companies to do the study.
• Verify that the company used in the experiment is reputed.
• the higher the chances of having a transparent system
• One can investigate how the study is conducted to investigate.
• Big companies caught doing data manipulation would mean the loss of
clients.
• Why would the company do data manipulation?
• To please the client – give the expected poll results.
• A good example is the Election results…
Pre-determined Results
• If a big tobacco-producing company wants to conduct research on the
probability of cancer being a result of smoking the cigarettes that the company
sells.
• There is a definite result the company wants to get, which is that the probability is not
higher than that of non-smokers.
• The understanding of this might lead a study-conducting company to
manipulate data to get results that would please the client.
• A real-life example would be the Volkswagen Scandal of 2015, in which the
Volkswagen Corporation falsified information about the gas emissions of its
cars. This led to the release of cars that polluted forty times more than allowed
by law. The falsification was done using a “defeat device” – smart software that
would turn on emission control when the car was being tested in a laboratory.
Another Reason of data manipulation
• Data is manipulated as research is hard to do. Conducting a
study with a few thousand participants is a lot of work in a
lot of different spectres.
• Instead, copy data from another research on the same topic
• This is technically plagiarism, not data manipulation, but these two
go together
False causality and illogical sequences
• This kind of falsification is done to deceive those who are not quite familiar with the
subject of the research.
• Situation: Research was conducted on mice of brown colour for multiple generations.
• The published theory: “Every generation of brown mice has more deaths than the
previous one”.
• The reason for this statistics (Which is true) is not the mentioned colour of the mice
but the fact that each new generation has more mice than the previous one and thus
has more deaths.
• This is misleading research as it omits the fact of more births.
• The information and the facts are true.
• However, they do not correlate to the work they are trying to show.
SAT Score Example
How can this be avoided?
• Do not compare apples and oranges.
• A graph or a statistic usually compares data, and it is important
to understand what kind of data it is and how it connects.
• For example, If people buy more matchboxes – more people get
cancer – lighters are bad for you.
• In actuality, people get cancer because they smoke, and they buy
lighters because they smoke.
• So, looking at the logical connections that the research makes is
quite important.
Data dredging and fact fitting
• Data dredging is the process in which researchers look through large amounts
of data to find patterns.
• The amounts of data picked for dredging are usually so big that there would
be at least one or two coincidences that can be used to base a theory on them.
• With the use of computers, this became even easier because a computer is
much better at figuring out more strings of facts from even larger amounts of
data.
• This leads to publications that are irrelevant or are based on a pure
coincidence
• Fact fitting is the process which, in a sense, is the opposite of fact omitting –
facts are shaped to fit a certain theory.
Data Manipulation And Data Integrity ethics in research
Bonferroni's principle
• Assume you are trying to identify people who are cheating in
examinations within a certain population
• You know that the percentage in the population who cheat is 5%.
• If you decide that people who claim to go out with friends more
than three times a week are most likely to cheat in examinations.
• You discover that 20% of people in the population qualify with your
method, then you know, in the very best case, only one-quarter of
the people you identify will actually be cheaters
• Furthermore, if there are any false negatives (cheaters who aren't
identified as cheaters), an even higher percentage of the "cheaters"
identified with the system would be false positives.
How can this be avoided?
• Check the facts: That is actually important in a lot of cases.
• If a scientific fact is actually a scientific fact, then it probably
comes up in more than one publication.
• Apply the rule of big numbers – if something like that would
actually be true – would you be the first one to discover it?
• A shocking new discovery? Why has it not been done before?
• Most people don’t get a chance to do PHD, so do not publish
questionable articles at some point in your life.
LaCour and scientifically based data
manipulation
• There are unique situations when data is not just manipulated but
manipulated professionally.
• Facts are pushed in different ways by a person who knows what he is doing.
• An easy example of this in work is any political debate.
• Case: “Irregularities in LaCour” – an exposure done by David Broockman,
both graduate students at UC Berkeley
• Michael LaCour, who forged enormous amounts of data in a significant
study on the perception of marriages in the years 2014-2015, has rocked the
scientific world.
Summary of LaCour’s research
• LaCour hired a company for research that would prove his
theory: people’s views on marriage can change dramatically
after a conversation with someone who has strong feelings
about marriage.
• It was a large-scale poll with ten thousand of respondents.
• The research proved LaCour’s theory, which was a new and
unique result.
• All previous results in similar works had shown that people
hardly change their political and social views.
What did Broockman do?
• Brookman was greatly impressed by LaCour’s work. He wanted to conduct
similar experiments.
• However, he found out the following:
• First hint: A hired company could not have conducted such research for a graduate
student’s budget
• He did not come out with it because it is easy to gain the reputation of someone who
does no work of his own and just tries to ruin others' work.
• A lot of people – scientists and researchers who Broockman talked to told him not to
publish such materials.
• Later, Broockman, with another student, noticed some specific irregularities (the
politically correct term for “mistakes and falsifications”) in the data used.
• The data did not look random enough.
• Later, they found a database that LaCour copied, which was the last argument
needed to publish their report.
Results
• LaCour lost his just obtained position in Princeton and his reputation
– it will now be really hard for him to return into the world of science.
• Broockman made the headlines and spoke a lot about debunking
and academic integrity.
• While some people are arguing about the competence of Lacour's
research, it has now been revealed that he had not, in fact, hired any
poll-conducting company, forged a letter from the company and lied
in later interviews.
• After all of these occurrences, there seems to be no reason for a
question of “competence”.
Lessons to be learned
• LaCour wanted a result that would make him a first-class researcher
and succeeded – until the exposure
• he had time to become quite famous.
• To obtain such a result he plagiarized data AND manipulated it to fit
his theory
• thus, he committed a set of violations of academic integrity.
• He was caught and is now a great example of how debunking works.
• The possibility of such exposure is one of the main protections of the
scientific world from academic dishonesty and thus should be
advocated.
Education on Data Manipulation
• While a lot is being done to expose and debunk data
manipulation, it is a subject that is not a part of popular culture.
• Debunking does not always involve high - class science –
Broockman just started checking how LaCour did his research.
• publication of false data may cause harm, as in Volkswagen's
example or in the medical field.
• LaCour´s data nearly caused the reform of many systems and
structures that concerned political and social views because of
its “original” content.
Image Manipulation
• In research, "image manipulation" is :
• Altering or modifying a digital image using software
• With the purpose of enhancing certain features, adjusting colours, or even
completely fabricating the image.
• Unethical if it
• It leads to misrepresented data.
• changes the interpretation of the results presented in the research article
• Minor adjustments like brightness and contrast are often
acceptable.
• Researchers must clearly disclose any image manipulations made in
their methods section, explaining the rationale behind the changes.
Image Manipulation: Image by Mystic
Art Design from Pixabay
Data Integrity
The Digital Age
• The amount of data created,
captured, copied, and
consumed globally has been
growing rapidly:
• 2020: 64.2 zettabytes
• 2021: 79 zettabytes
• 2022: 97 zettabytes
• 2023: 120 zettabytes
• 2024: Projected to be 147
zettabytes
2020 2021 2022 2023 2024
0
20
40
60
80
100
120
140
160
Growth of Data Generation
Units of Data Measure
• Bit: Binary Digit
• Byte: eight bits: One ASCII character
• Kilobyte: 1,000 bytes
• Megabyte: 1,000 kilobytes. A large book
• Gigabyte: 1,000 megabytes. Hard disk size 500 gigabytes
• Terabyte: 1,000 gigabytes. A huge library
• Petabyte: 1,000 terabytes Research journal data 5 petabytes
• Exabyte: 1,000 petabytes.
• Zettabyte (ZB), Yottabyte (YB), Brontobyte (BB), Geopbyte (GPB)…
Data Integrity
• Research Data: Data used in scientific, engineering, and medical research
as inputs to generate research conclusions.
• Metadata: refers to descriptions of the content, context, and structure of
information objects.
• Data Integrity:
• An uncompromising adherence to ethical values, strict honesty, and absolute
avoidance of deception.
• The state of being whole and complete
• High integrity means having the confidence that the data are complete, verified,
and remain unaltered.
• Data integrity can be defined as the ‘state of data (valid or invalid) and/or
the process of ensuring and preserving the validity and accuracy of data.
Why Data Integrity?
• Key to empirical scientific research
• Leads to consistent decision making
• leads to trustworthy findings
• leads to the creation of correct knowledge
• Research data integrity is desirable from the inception of a
research project through the dissemination of findings and the
subsequent sharing of data.
Data Integrity for us
• You may be asked to describe your methods and tools for collecting
data so that the data can be checked and verified.
• You may also record the process or algorithms of pre-processing of
data, if any.
• Your analysis results should be verifiable.
• You may ensure…
Scientific Rigor Apply scientific methods to ensure unbiased and well-
controlled experimental design, methodology, analysis,
interpretation, and reporting of results.
Computational Reproducibility Obtaining consistent computational results using the same
input data, computational steps, methods, code, and
conditions of analysis
Replicability of Results Replicability means obtaining consistent results across
studies aimed at answering the same scientific question,
each of which has obtained its own data.
Reuse Data reuse is a concept that involves using research data
for a research activity or purpose other than that for which it
was originally intended
Integrity Policy of Publishers
• Data and methods access
• Does the journal require that all data be made available on request to journal editors and reviewers? Yes?
• Does the journal require the deposition of data in a public repository? Yes/No
• Are authors required to provide algorithms or computer programs used in the collection, report, or
analysis of data? No?
• Image manipulation
• Is image manipulation prohibited? No
• Does the journal require that image manipulation be reported? Yes
• Does the journal require that digital techniques be applied to the entire image? Yes
• Does the journal use software tests to detect image manipulation? Yes
• Ethics and Scientific Misconduct
• Is there a specified ethical statement? Yes
• Does the journal have a scientific misconduct investigation or reporting policy in place? Yes
Integrity: The Individual responsibility and
Collective Scrutiny of Research Data and Results
• Individual Responsibility is to ensure that the data are complete,
verified, and undistorted
• Collective responsibility is to ensure the data integrity of the submitted
research data and results derived from those data.
• When others can examine the steps used to generate data and the conclusions
drawn from those data, they can judge the validity of the data and results.
• the collective scrutiny of research results cannot guarantee that results will be
free of error or bias.
• bring multiple perspectives to minimise the error and bias.
Collective Scrutiny
• Data Producers Role:
• make data available to others so that the data’s quality can be judged.
• Data providers Role:
• Make data widely available in a form such that the data can be not only used
but evaluated, which requires the availability of metadata
• Data Users or Researcher's Role:
• Perform critical evaluation of the data generated by themselves and others.
Define the Data Processes
• You should report the following about data collection
• State the tools, techniques and procedures used to collect data
• Record anything that was done to the data thereafter
• Clearly state the models, code, and input data used
• For example, a community may decide that double-blind trials,
independent verification, or particular instrumental calibrations are
necessary for a body of data to be accepted as having high quality.
• Scientific methods include both a core of widely accepted methods and a
periphery of methods that are less widely accepted.
• Data integrity involves scrutiny of the methods used to derive those data.
PEER REVIEW AND INTEGRITY OF DATA
• Peer review of articles submitted to a scholarly journal for
publication is the most important process of ensuring Data
Integrity
• Screens for quality and relevance and help to ensure that
professional standards are followed in data collection and analysis.
• A forum in which the collective standards of a field can be enforced.
• Examines whether research questions have been framed and
addressed properly
• Examines whether findings are original and significant
• Examines whether a paper is clearly written and acknowledges
previous work.
Peer Review in Digital Age
• Digital technologies have put pressure on the peer review system.
• The volume or diversity of research data supporting a conclusion may
overwhelm the ability of a reviewer to evaluate the link between the data and
that conclusion, as supporting information for a finding in a submitted paper
increasingly moves to lengthy supplemental materials, reviewers may be less
able to judge the merits of a paper.
• Difficult to find peer reviewers who are competent and have the time to judge
complex interdisciplinary manuscripts.
• Peer review cannot ensure that all research data are technically accurate,
though inaccuracies in data can become apparent either in review or as
researchers seek to extend or build on data.
Trust in Research
• The research system is based to a large degree on trust.
• Following the standards is a crucial factor in building trust.
• Build and maintain trust.
Breach of Trust
• In 1998, a series of remarkable papers attracted great attention within the condensed
matter physics community.
• The papers, based largely on work done at Bell Laboratories, described methods that could
create carbon-based materials with superconductivity using molecular-level switching.
• However, when other materials scientists sought to reproduce or extend the results, they
were unsuccessful.
• In 2001, several physicists inside and outside Bell Laboratories began to notice anomalies
among the papers.
• Several contained figures that were very similar, even though they described different
experimental systems.
• Some graphs seemed too smooth to describe real-life systems.
• The person who helped create the materials had made the physical measurements on
them and was a co-author on all the papers was questioned.
• A committee was formed, which detected fabrication in 16 of 25 works published.
Using Digital Technologies and Data
Integrity
• Digital technologies can pose risks to data integrity, but they also offer ways to
improve the reliability of research data.
• They enable researchers to build checking and verification procedures into
research protocols in ways that reduce the potential for error and bias.
• Automated data collection that is quality-controlled can be much more accurate
when either substituting for or supplementing human observations.
• An example is the use of digital technologies in clinical research, including the
conduct of clinical trials and plans to link clinical trial information with individuals’
electronic health records.
• Will Digitizing individuals’ electronic health records compromise their security and privacy?
• Will inappropriate usage be properly restricted?
• Will companies be able to acquire and share these data?
• Merging of two datasets might make it possible to identify patients who have been “de-
identified” in each.
AI and data integrity
• Train people working with artificial intelligence (AI) to support
data integrity assurance in AI applications.
• AI practices should be aligned with societal values and ethical
norms.
• Key challenges and opportunities
• Artificial intelligence (AI) significantly enhances data integrity by
reducing human error and increasing efficiency in data processing.
• With its ability to efficiently process and analyse large datasets,
“AI has facilitated “breakthroughs in fields such as predictive
analytics, personalised medicine, and autonomous systems”.
Caution
• However, when using AI systems, data integrity concerns,
including data accuracy, quality, privacy, and security, arise.
• The integrity of AI decisions is directly linked to the integrity
of the data it processes.
• Data manipulation, whether intentional or due to inherent
biases in algorithms, pose serious questions about the
reliability and fairness of AI-driven decision making
AI Implications for data integrity
• AI systems are only as good as the data they are fed and how they are
programmed;
• There is a concern that if the input data is flawed or biased, AI will amplify
these issues.
• There is a need for transparency in AI algorithms to ensure data integrity.
• Policymakers should focus on developing and refining
comprehensive, adaptable regulatory frameworks for AI that
emphasise privacy, transparency, and accountability
• institutions and organisations should invest in continuous ethical
training and awareness programs for AI practitioners. This would
enable them to recognise and address the ethical implications of their
work, thereby ensuring data integrity and fairness in AI applications.
Data Integrity Principle [1]
• “Ensuring the integrity of research data is essential for advancing
scientific, engineering, and medical knowledge and for maintaining
public trust in the research.”
• “Researchers are ultimately responsible for ensuring the integrity of
research data.”
References
• Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the
Digital Age by Committee on Ensuring the Utility and Integrity of Research Data in
a Digital Age; National Academy of Sciences
• Oladoyinbo, Tunboson Oyewale and Olabanji, Samuel Oladiipo and Olaniyi,
Oluwaseun Oladeji and Adebiyi, Olubukola Omolara and Okunleye, Olalekan J. and
Ismaila Alao, Adegbenga, Exploring the Challenges of Artificial Intelligence in Data
Integrity and its Influence on Social Dynamics (January 13, 2024). Asian Journal of
Advanced Research and Reports, Volume 18, Issue 2, Page 1-23, 2024, Available at
SSRN: https://ssrn.com/abstract=4693987
• Condon, P., Simpson, J., and Emanuel, M. (2022) Research data integrity: A
cornerstone of rigorous and reproducible research, IASSIST Quarterly 46(3), pp. 1-
21. DOI: https://doi.org/10.29173/iq1033
• http://ds-wordpress.haverford.edu/psych2015/projects/chapter/plagiarism-and-
data-manipulation/
Questions

More Related Content

PPT
Week 10 fraud copy
PPTX
Misuse of statistics
PDF
Sujay Rao Mandavilli Differentiating strong evidence from weak evidence FINAL...
PPT
Presentation1a paul carpenter
PPT
Methodological Mistakes and Econometric Consequences
DOCX
Articulo 50 palabras
PPTX
Selection Reporting & Misrepresentation .Dr.Anjali Upadhye.pptx
DOCX
Es estadísticas duro
Week 10 fraud copy
Misuse of statistics
Sujay Rao Mandavilli Differentiating strong evidence from weak evidence FINAL...
Presentation1a paul carpenter
Methodological Mistakes and Econometric Consequences
Articulo 50 palabras
Selection Reporting & Misrepresentation .Dr.Anjali Upadhye.pptx
Es estadísticas duro

Similar to Data Manipulation And Data Integrity ethics in research (20)

PPT
Lecture 06
PPTX
Scandal!
PDF
Chapter 0: the what and why of statistics
PDF
IJISRT25JUN1831 Sujay Rao Mandavilli Differentiating strong evidence from wea...
DOCX
Page 579Assess the Constituent Data. What is included Omi.docx
PPTX
There are three kinds of lies
PPTX
IE_expressyourself_EssayH
DOCX
Graphic Representation Grading GuideCOMTM541 Version 22.docx
PDF
Biswa research
DOCX
Using Figure 1.2 in Ch. 1 of Exploring Research, create a flowchar.docx
PPTX
Selective Reporting and Misrepresentation of Data
PPTX
A Thinking Person's Guide to Using Big Data for Development: Myths, Opportuni...
PDF
Integrity
PPTX
Bad statistics
PPTX
Internal Validity in Research Methodology (Factors that affect Internal Valid...
PPT
Data collection methods
PPTX
Insights from psychology on lack of reproducibility
PPTX
Big Data: Big Opportunities or Big Trouble?
PPTX
Bigdatapdi2015 150112111012-conversion-gate02
PPTX
Data visualization for social problems
Lecture 06
Scandal!
Chapter 0: the what and why of statistics
IJISRT25JUN1831 Sujay Rao Mandavilli Differentiating strong evidence from wea...
Page 579Assess the Constituent Data. What is included Omi.docx
There are three kinds of lies
IE_expressyourself_EssayH
Graphic Representation Grading GuideCOMTM541 Version 22.docx
Biswa research
Using Figure 1.2 in Ch. 1 of Exploring Research, create a flowchar.docx
Selective Reporting and Misrepresentation of Data
A Thinking Person's Guide to Using Big Data for Development: Myths, Opportuni...
Integrity
Bad statistics
Internal Validity in Research Methodology (Factors that affect Internal Valid...
Data collection methods
Insights from psychology on lack of reproducibility
Big Data: Big Opportunities or Big Trouble?
Bigdatapdi2015 150112111012-conversion-gate02
Data visualization for social problems
Ad

Recently uploaded (20)

PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
annual-report-2024-2025 original latest.
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
DOCX
Factor Analysis Word Document Presentation
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Introduction to the R Programming Language
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Leprosy and NLEP programme community medicine
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ISS -ESG Data flows What is ESG and HowHow
A Complete Guide to Streamlining Business Processes
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
annual-report-2024-2025 original latest.
Pilar Kemerdekaan dan Identi Bangsa.pptx
SAP 2 completion done . PRESENTATION.pptx
Factor Analysis Word Document Presentation
IMPACT OF LANDSLIDE.....................
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to the R Programming Language
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
New ISO 27001_2022 standard and the changes
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
[EN] Industrial Machine Downtime Prediction
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Leprosy and NLEP programme community medicine
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Ad

Data Manipulation And Data Integrity ethics in research

  • 2. Data Manipulation • Data manipulation is the process in which scientific data is forged, presented in an unprofessional way or changed with disregard to the rules of the academic world. • Data manipulation may result in a distorted perception of a subject, which may lead to false theories being built and tested. • An experiment based on data that has been manipulated is risky and unpredictable.
  • 3. Consequences of Data Manipulation • Misleading colleagues • Impeding progress • Causing harm to society • Unpredictable experiments
  • 4. Statistics as a tool of Data manipulation • One of the most common kinds of data manipulation is misuse of statistics – many article titles on the internet are based on misuse of statistics, as are some political and economic arguments. • Misuse of statistics does include data forgery – the process in which data is created without any connection to actual data situations. • The most important kinds of misuse of statistics are those that involve real data that is presented in a manner that may be misleading and even dangerous.
  • 5. Kinds of Data Manipulation and Reasons behind Them • Omitting important facts, factors • Researchers are looking for results (Because that means research grants, etc.), and thus, they sometimes deliberately or unintentionally manipulate data to fit their hypothesis. • When conducting an experiment, researchers use lists of incomplete relevant factors. • For example, a political poll can include the age, income, or religious beliefs of the participants. • The weak point here is the fact that the researcher may not have included an important factor as relevant in the study. • If a study “Computer games – art or not?” was conducted on participants between the ages of fifty and sixty, then its results would probably be quite different from the results of the same study conducted on participants between the ages of fifteen and twenty. • If, in the resulting publication, the age of the participants is not clearly stated, then that is an example of data manipulation and, specifically, misuse of statistics.
  • 10. How can you avoid this? • Consider you are planning an extensive study; you may take the help of your friends • Read many research papers and identify most of the factors that have been included by researchers. • Read about the domain and identify additional factors. • As the main researcher, put down all the specifics of the study, all of the questions and any other relevant information that you would expect for actual data gathering. • Clarify and train your friends in the process of data collection. • There are many friends– some with high integrity, some infamous, and some not well known. • As a researcher, if you publish data from the study, you should describe the data collection process. • If not mentioned, then the data should not be taken seriously.
  • 11. Can a company be hired to do the study? • If you have employed companies to do the study. • Verify that the company used in the experiment is reputed. • the higher the chances of having a transparent system • One can investigate how the study is conducted to investigate. • Big companies caught doing data manipulation would mean the loss of clients. • Why would the company do data manipulation? • To please the client – give the expected poll results. • A good example is the Election results…
  • 12. Pre-determined Results • If a big tobacco-producing company wants to conduct research on the probability of cancer being a result of smoking the cigarettes that the company sells. • There is a definite result the company wants to get, which is that the probability is not higher than that of non-smokers. • The understanding of this might lead a study-conducting company to manipulate data to get results that would please the client. • A real-life example would be the Volkswagen Scandal of 2015, in which the Volkswagen Corporation falsified information about the gas emissions of its cars. This led to the release of cars that polluted forty times more than allowed by law. The falsification was done using a “defeat device” – smart software that would turn on emission control when the car was being tested in a laboratory.
  • 13. Another Reason of data manipulation • Data is manipulated as research is hard to do. Conducting a study with a few thousand participants is a lot of work in a lot of different spectres. • Instead, copy data from another research on the same topic • This is technically plagiarism, not data manipulation, but these two go together
  • 14. False causality and illogical sequences • This kind of falsification is done to deceive those who are not quite familiar with the subject of the research. • Situation: Research was conducted on mice of brown colour for multiple generations. • The published theory: “Every generation of brown mice has more deaths than the previous one”. • The reason for this statistics (Which is true) is not the mentioned colour of the mice but the fact that each new generation has more mice than the previous one and thus has more deaths. • This is misleading research as it omits the fact of more births. • The information and the facts are true. • However, they do not correlate to the work they are trying to show.
  • 16. How can this be avoided? • Do not compare apples and oranges. • A graph or a statistic usually compares data, and it is important to understand what kind of data it is and how it connects. • For example, If people buy more matchboxes – more people get cancer – lighters are bad for you. • In actuality, people get cancer because they smoke, and they buy lighters because they smoke. • So, looking at the logical connections that the research makes is quite important.
  • 17. Data dredging and fact fitting • Data dredging is the process in which researchers look through large amounts of data to find patterns. • The amounts of data picked for dredging are usually so big that there would be at least one or two coincidences that can be used to base a theory on them. • With the use of computers, this became even easier because a computer is much better at figuring out more strings of facts from even larger amounts of data. • This leads to publications that are irrelevant or are based on a pure coincidence • Fact fitting is the process which, in a sense, is the opposite of fact omitting – facts are shaped to fit a certain theory.
  • 19. Bonferroni's principle • Assume you are trying to identify people who are cheating in examinations within a certain population • You know that the percentage in the population who cheat is 5%. • If you decide that people who claim to go out with friends more than three times a week are most likely to cheat in examinations. • You discover that 20% of people in the population qualify with your method, then you know, in the very best case, only one-quarter of the people you identify will actually be cheaters • Furthermore, if there are any false negatives (cheaters who aren't identified as cheaters), an even higher percentage of the "cheaters" identified with the system would be false positives.
  • 20. How can this be avoided? • Check the facts: That is actually important in a lot of cases. • If a scientific fact is actually a scientific fact, then it probably comes up in more than one publication. • Apply the rule of big numbers – if something like that would actually be true – would you be the first one to discover it? • A shocking new discovery? Why has it not been done before? • Most people don’t get a chance to do PHD, so do not publish questionable articles at some point in your life.
  • 21. LaCour and scientifically based data manipulation • There are unique situations when data is not just manipulated but manipulated professionally. • Facts are pushed in different ways by a person who knows what he is doing. • An easy example of this in work is any political debate. • Case: “Irregularities in LaCour” – an exposure done by David Broockman, both graduate students at UC Berkeley • Michael LaCour, who forged enormous amounts of data in a significant study on the perception of marriages in the years 2014-2015, has rocked the scientific world.
  • 22. Summary of LaCour’s research • LaCour hired a company for research that would prove his theory: people’s views on marriage can change dramatically after a conversation with someone who has strong feelings about marriage. • It was a large-scale poll with ten thousand of respondents. • The research proved LaCour’s theory, which was a new and unique result. • All previous results in similar works had shown that people hardly change their political and social views.
  • 23. What did Broockman do? • Brookman was greatly impressed by LaCour’s work. He wanted to conduct similar experiments. • However, he found out the following: • First hint: A hired company could not have conducted such research for a graduate student’s budget • He did not come out with it because it is easy to gain the reputation of someone who does no work of his own and just tries to ruin others' work. • A lot of people – scientists and researchers who Broockman talked to told him not to publish such materials. • Later, Broockman, with another student, noticed some specific irregularities (the politically correct term for “mistakes and falsifications”) in the data used. • The data did not look random enough. • Later, they found a database that LaCour copied, which was the last argument needed to publish their report.
  • 24. Results • LaCour lost his just obtained position in Princeton and his reputation – it will now be really hard for him to return into the world of science. • Broockman made the headlines and spoke a lot about debunking and academic integrity. • While some people are arguing about the competence of Lacour's research, it has now been revealed that he had not, in fact, hired any poll-conducting company, forged a letter from the company and lied in later interviews. • After all of these occurrences, there seems to be no reason for a question of “competence”.
  • 25. Lessons to be learned • LaCour wanted a result that would make him a first-class researcher and succeeded – until the exposure • he had time to become quite famous. • To obtain such a result he plagiarized data AND manipulated it to fit his theory • thus, he committed a set of violations of academic integrity. • He was caught and is now a great example of how debunking works. • The possibility of such exposure is one of the main protections of the scientific world from academic dishonesty and thus should be advocated.
  • 26. Education on Data Manipulation • While a lot is being done to expose and debunk data manipulation, it is a subject that is not a part of popular culture. • Debunking does not always involve high - class science – Broockman just started checking how LaCour did his research. • publication of false data may cause harm, as in Volkswagen's example or in the medical field. • LaCour´s data nearly caused the reform of many systems and structures that concerned political and social views because of its “original” content.
  • 27. Image Manipulation • In research, "image manipulation" is : • Altering or modifying a digital image using software • With the purpose of enhancing certain features, adjusting colours, or even completely fabricating the image. • Unethical if it • It leads to misrepresented data. • changes the interpretation of the results presented in the research article • Minor adjustments like brightness and contrast are often acceptable. • Researchers must clearly disclose any image manipulations made in their methods section, explaining the rationale behind the changes.
  • 28. Image Manipulation: Image by Mystic Art Design from Pixabay
  • 30. The Digital Age • The amount of data created, captured, copied, and consumed globally has been growing rapidly: • 2020: 64.2 zettabytes • 2021: 79 zettabytes • 2022: 97 zettabytes • 2023: 120 zettabytes • 2024: Projected to be 147 zettabytes 2020 2021 2022 2023 2024 0 20 40 60 80 100 120 140 160 Growth of Data Generation
  • 31. Units of Data Measure • Bit: Binary Digit • Byte: eight bits: One ASCII character • Kilobyte: 1,000 bytes • Megabyte: 1,000 kilobytes. A large book • Gigabyte: 1,000 megabytes. Hard disk size 500 gigabytes • Terabyte: 1,000 gigabytes. A huge library • Petabyte: 1,000 terabytes Research journal data 5 petabytes • Exabyte: 1,000 petabytes. • Zettabyte (ZB), Yottabyte (YB), Brontobyte (BB), Geopbyte (GPB)…
  • 32. Data Integrity • Research Data: Data used in scientific, engineering, and medical research as inputs to generate research conclusions. • Metadata: refers to descriptions of the content, context, and structure of information objects. • Data Integrity: • An uncompromising adherence to ethical values, strict honesty, and absolute avoidance of deception. • The state of being whole and complete • High integrity means having the confidence that the data are complete, verified, and remain unaltered. • Data integrity can be defined as the ‘state of data (valid or invalid) and/or the process of ensuring and preserving the validity and accuracy of data.
  • 33. Why Data Integrity? • Key to empirical scientific research • Leads to consistent decision making • leads to trustworthy findings • leads to the creation of correct knowledge • Research data integrity is desirable from the inception of a research project through the dissemination of findings and the subsequent sharing of data.
  • 34. Data Integrity for us • You may be asked to describe your methods and tools for collecting data so that the data can be checked and verified. • You may also record the process or algorithms of pre-processing of data, if any. • Your analysis results should be verifiable. • You may ensure…
  • 35. Scientific Rigor Apply scientific methods to ensure unbiased and well- controlled experimental design, methodology, analysis, interpretation, and reporting of results. Computational Reproducibility Obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis Replicability of Results Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. Reuse Data reuse is a concept that involves using research data for a research activity or purpose other than that for which it was originally intended
  • 36. Integrity Policy of Publishers • Data and methods access • Does the journal require that all data be made available on request to journal editors and reviewers? Yes? • Does the journal require the deposition of data in a public repository? Yes/No • Are authors required to provide algorithms or computer programs used in the collection, report, or analysis of data? No? • Image manipulation • Is image manipulation prohibited? No • Does the journal require that image manipulation be reported? Yes • Does the journal require that digital techniques be applied to the entire image? Yes • Does the journal use software tests to detect image manipulation? Yes • Ethics and Scientific Misconduct • Is there a specified ethical statement? Yes • Does the journal have a scientific misconduct investigation or reporting policy in place? Yes
  • 37. Integrity: The Individual responsibility and Collective Scrutiny of Research Data and Results • Individual Responsibility is to ensure that the data are complete, verified, and undistorted • Collective responsibility is to ensure the data integrity of the submitted research data and results derived from those data. • When others can examine the steps used to generate data and the conclusions drawn from those data, they can judge the validity of the data and results. • the collective scrutiny of research results cannot guarantee that results will be free of error or bias. • bring multiple perspectives to minimise the error and bias.
  • 38. Collective Scrutiny • Data Producers Role: • make data available to others so that the data’s quality can be judged. • Data providers Role: • Make data widely available in a form such that the data can be not only used but evaluated, which requires the availability of metadata • Data Users or Researcher's Role: • Perform critical evaluation of the data generated by themselves and others.
  • 39. Define the Data Processes • You should report the following about data collection • State the tools, techniques and procedures used to collect data • Record anything that was done to the data thereafter • Clearly state the models, code, and input data used • For example, a community may decide that double-blind trials, independent verification, or particular instrumental calibrations are necessary for a body of data to be accepted as having high quality. • Scientific methods include both a core of widely accepted methods and a periphery of methods that are less widely accepted. • Data integrity involves scrutiny of the methods used to derive those data.
  • 40. PEER REVIEW AND INTEGRITY OF DATA • Peer review of articles submitted to a scholarly journal for publication is the most important process of ensuring Data Integrity • Screens for quality and relevance and help to ensure that professional standards are followed in data collection and analysis. • A forum in which the collective standards of a field can be enforced. • Examines whether research questions have been framed and addressed properly • Examines whether findings are original and significant • Examines whether a paper is clearly written and acknowledges previous work.
  • 41. Peer Review in Digital Age • Digital technologies have put pressure on the peer review system. • The volume or diversity of research data supporting a conclusion may overwhelm the ability of a reviewer to evaluate the link between the data and that conclusion, as supporting information for a finding in a submitted paper increasingly moves to lengthy supplemental materials, reviewers may be less able to judge the merits of a paper. • Difficult to find peer reviewers who are competent and have the time to judge complex interdisciplinary manuscripts. • Peer review cannot ensure that all research data are technically accurate, though inaccuracies in data can become apparent either in review or as researchers seek to extend or build on data.
  • 42. Trust in Research • The research system is based to a large degree on trust. • Following the standards is a crucial factor in building trust. • Build and maintain trust.
  • 43. Breach of Trust • In 1998, a series of remarkable papers attracted great attention within the condensed matter physics community. • The papers, based largely on work done at Bell Laboratories, described methods that could create carbon-based materials with superconductivity using molecular-level switching. • However, when other materials scientists sought to reproduce or extend the results, they were unsuccessful. • In 2001, several physicists inside and outside Bell Laboratories began to notice anomalies among the papers. • Several contained figures that were very similar, even though they described different experimental systems. • Some graphs seemed too smooth to describe real-life systems. • The person who helped create the materials had made the physical measurements on them and was a co-author on all the papers was questioned. • A committee was formed, which detected fabrication in 16 of 25 works published.
  • 44. Using Digital Technologies and Data Integrity • Digital technologies can pose risks to data integrity, but they also offer ways to improve the reliability of research data. • They enable researchers to build checking and verification procedures into research protocols in ways that reduce the potential for error and bias. • Automated data collection that is quality-controlled can be much more accurate when either substituting for or supplementing human observations. • An example is the use of digital technologies in clinical research, including the conduct of clinical trials and plans to link clinical trial information with individuals’ electronic health records. • Will Digitizing individuals’ electronic health records compromise their security and privacy? • Will inappropriate usage be properly restricted? • Will companies be able to acquire and share these data? • Merging of two datasets might make it possible to identify patients who have been “de- identified” in each.
  • 45. AI and data integrity • Train people working with artificial intelligence (AI) to support data integrity assurance in AI applications. • AI practices should be aligned with societal values and ethical norms. • Key challenges and opportunities • Artificial intelligence (AI) significantly enhances data integrity by reducing human error and increasing efficiency in data processing. • With its ability to efficiently process and analyse large datasets, “AI has facilitated “breakthroughs in fields such as predictive analytics, personalised medicine, and autonomous systems”.
  • 46. Caution • However, when using AI systems, data integrity concerns, including data accuracy, quality, privacy, and security, arise. • The integrity of AI decisions is directly linked to the integrity of the data it processes. • Data manipulation, whether intentional or due to inherent biases in algorithms, pose serious questions about the reliability and fairness of AI-driven decision making
  • 47. AI Implications for data integrity • AI systems are only as good as the data they are fed and how they are programmed; • There is a concern that if the input data is flawed or biased, AI will amplify these issues. • There is a need for transparency in AI algorithms to ensure data integrity. • Policymakers should focus on developing and refining comprehensive, adaptable regulatory frameworks for AI that emphasise privacy, transparency, and accountability • institutions and organisations should invest in continuous ethical training and awareness programs for AI practitioners. This would enable them to recognise and address the ethical implications of their work, thereby ensuring data integrity and fairness in AI applications.
  • 48. Data Integrity Principle [1] • “Ensuring the integrity of research data is essential for advancing scientific, engineering, and medical knowledge and for maintaining public trust in the research.” • “Researchers are ultimately responsible for ensuring the integrity of research data.”
  • 49. References • Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age by Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age; National Academy of Sciences • Oladoyinbo, Tunboson Oyewale and Olabanji, Samuel Oladiipo and Olaniyi, Oluwaseun Oladeji and Adebiyi, Olubukola Omolara and Okunleye, Olalekan J. and Ismaila Alao, Adegbenga, Exploring the Challenges of Artificial Intelligence in Data Integrity and its Influence on Social Dynamics (January 13, 2024). Asian Journal of Advanced Research and Reports, Volume 18, Issue 2, Page 1-23, 2024, Available at SSRN: https://ssrn.com/abstract=4693987 • Condon, P., Simpson, J., and Emanuel, M. (2022) Research data integrity: A cornerstone of rigorous and reproducible research, IASSIST Quarterly 46(3), pp. 1- 21. DOI: https://doi.org/10.29173/iq1033 • http://ds-wordpress.haverford.edu/psych2015/projects/chapter/plagiarism-and- data-manipulation/