An Introduction to AI (Formerly Data Science)

Intro to Data
Science AI
Loic Merckel
Frankfurt (DE), 01/2025

Tesco: The Pilot,Early
Success of Data
Leveraging?
What scares me about this is
that you know more about my
customers after three months
than I know after 30 years.
— Ian MacLaurin (while Tesco
chairman), to Clive Humby after he
presented his findings.

Some Do
Drive
Tremendous
Value
Creation By
Leveraging
Data

"If we have
data,let’s look
at data.If all we
have are
opinions,let’s
go with mine."
— Jim Barksdale (former CEO of
Netscape)

Organizations Have Data,
But Most of It May Be
Dead Weight
Key data waste findings from Veritas The
UK 2020 Databerg Report Revisited:
53% is dark (unclassified) data.
–
28% is ROT (Redundant, Obsolete,
Trivial) data.
–
Only 19% is business-critical data.
–
Organizations spending significant
resources on storing non-critical data.
–

Is Having Data
Really Enough?
McKinsey (2023) states that:
success remains the exception, not the
rule,
and
even successful organizational
transformations deliver less than their
full potential.
–
BCG (2020) notes that only 30% of digital
transformations meet or exceed their target value
and result in sustainable change.
–
Data alone is not enough to drive results without the right tools and strategy.

Tesco: The Sequel
"I made a mistake on Tesco.
That was a huge mistake by
me."
— Warren Buffet (2014)
Initial Success: Tesco’s Clubcard set the
standard for data-driven retail, making
Tesco a global leader.
–
Over-reliance on Data: Data alone
proved insufficient without innovation and
adaptability.
–
Competitive Pressures: Simpler, price-
focused rivals like Aldi and Lidl outpaced
Tesco.
–
Broader Implications: Big Data needs
strategic vision and agility to sustain value
(e.g., balancing analytics with customer
trust).
–
Schrage, M. (2014). Tesco’s downfall is a warning to data-driven retailers. Harvard Business
Review.
–
Warren Buffet on CNBC: https://www.cnbc.com/2014/10/03/warren-buffet-i-made-a-mistake-
on-tesco.html
–

Gaining Insight With
Real-World Data
To adapt to data scientists
something I read on the T-shirt of a
classmate years ago about software
developers:
The junior thinks it is hard.
–
The mid-level thinks it is easy.
–
The true senior knows it is
hard.
–

"When the
data and the
anecdotes
disagree,the
anecdotes are
usually right."
— Jeff Bezos
youtu.be/uFUc_5OMB5s?si=NjRe4dnSfyL_OVHc

Explain?
Predict?
Describe?
What are we doing here?
In "What is the Question?" by Leek &
Peng (2015), the authors argue that
misidentifying the type of analysis is
the most frequent cause of flawed
conclusions.
Like Kafka’s K., analysts often don't know their purpose

The Answer is 42! But
What Was The
Question?
Table copied from Leek, J. T., & Peng, R. D. (2015). What is the question?. Science, 347.
https://doi.org/10.1126/science.aaa6146. The two examples are discussed by Leek & Peng
(2015).
Google Flu Trends' Fiasco: Explanatory → Predictive
Cellphones and Brain Cancer: Inferential → Causal
Ambitious goal to predict flu outbreaks from search data
faster than CDC reports.
–
Failed spectacularly, in large part, due to overfitting
(50M terms vs 1,152 data points).
–
Even simple models using old CDC data were
performing better.
–
Case-control studies found an association between
phone use and brain cancer (30–200% increased risk),
but these studies cannot infer causation due to recall
bias and methodological flaws.
–
Larger prospective studies found no evidence of a
causal link, illustrating the confusion between inferential
and causal reasoning.
–

Chocolate
& Nobel
Prizes
Predictive modeling often
needs correlation, not
causation.
Messerli, F. H. (2012). Chocolate consumption, cognitive function, and Nobel laureates. N Engl J Med,
367(16), 1562-1564.

Model Over-Fitting
"With four
parameters I can
fit an elephant,
and with five I can
make him wiggle
his trunk."
— attributed to John von Neumann
Dyson, F. A meeting with Enrico Fermi. Nature 427, 297 (2004). https://doi.org/10.1038/427297a
–
Mayer et al. (2010). Drawing an elephant with four complex parameters. American Journal of Physics
–

Machine Learning vs.
Statistical Modeling
Not competitors, but complementary approaches
working together to unlock insights from data
ML: Algorithms that learn patterns from data
with the goal of generalizing to unseen
examples (with minimal assumptions about
the data generating process).
–
SM: Mathematical frameworks to understand
relationships in data and quantify uncertainty
in conclusions (focuses on understanding
the data generating process through
explicit assumptions).
–
Note: The intricate relationship between these fields is demonstrated in how ML processes often rely on statistical methods for assessment and validation.

The Foundation of Insights:
Representative Data in SM
and ML
SM: Rigorous and Controlled (ideally)
ML: Flexible but Data-Hungry
Set Minimum Detectable Effect (MDE).
–
Define significance level (α) and power (1-β).
–
α (Type I error rate): The probability of incorrectly rejecting H .
– 0
β (Type II error rate): The probability of failing to reject H when it is false.
– 0
Power (1 - β): probability of correctly rejecting H when H is true.
– 0 A
Calculate required sample size (⚠: ↔ ).
–
Conduct systematic data collection.
–
Derive conclusions through structured analysis.
–
Often leverages available or opportunistic data.
–
Focuses on mitigating overfitting (variance).
–
Uses data augmentation or synthetic data to expand datasets.
–
Employs robust validation strategies (e.g., train/validation/test
splits or cross-validation).
–

The ML's Oversize-
and-Regularize
Paradigm
SM: Typically prefers parsimonious models with
fewer parameters, emphasizing interpretability
and theoretical underpinnings.
–
ML: Often relies on large, complex models to
capture intricate patterns in big datasets,
mitigating overfitting through regularization.
–
ResNet and GPT architectures are highly
oversized but use techniques like dropout,
weight decay, and other forms of regularization.
–
This approach has proven highly effective for
real-world tasks, driving state-of-the-art
performance in modern ML.
–
Machine Learning: Oversized Power, Like a Wrecking Ball Controlled by a Crane (Regularization) vs. Statistical Modeling: Precision and Parsimony, Like a Hammer

Sampling Is Crucial for SM
Douglas Hubbard tells a story where the famous
statistician John Tukey is quoted saying:
A random selection of three people
would have been better than a
group of 300 chosen by Mr.Kinsey.
1
Hubbard, D. W. (2010). How to Measure Anything: Finding the Value of "Intangibles" in Business.
1.

Sampling: The Good, The
Somewhat Bad and the
Truly Ugly
Probability Sampling (The Good)
Every member of the population has a known, non-zero chance of
being selected. This method ensures representativeness and
allows for statistical inference. (E.g., random sampling, stratified
sampling, and cluster sampling.)
Non-Probability Sampling (The Somewhat Bad)
Members of the population are selected based on subjective criteria
or convenience, without ensuring representativeness. While faster and
cheaper, it risks bias and limits generalizability. (E.g., convenience
sampling and quota sampling).
Spurious Sampling (The Truly Ugly)
A flawed method where the sample is improperly drawn or defined,
leading to misleading or invalid conclusions. Often results from poor
study design, selection bias, or data contamination.

Samples
must be
handled
with
care!

Sampling User
Manual: Act One
We want to analyze the number of
coffee served during the day at the
cafeteria. We have two hypotheses:
: Morning vs. Afternoon
( )
– H01
μ =
morning μafternoon
: Morning vs. Evening
( )
– H02
μ =
morning μevening

Sampling User Manual:
Act Two
Family-Wise Tests
A family of tests refers to a group of hypotheses that are
analyzed together and share some logical or contextual
relationship. Tests may be grouped when they:
Family-Wise Error Rate (FWER)
Family-Wise Error Correction
Corrections to reduce the per-test α to ensure the FWER
remains below a desired threshold (e.g., 0.05).
Address related aspects of the same research question.
–
Are derived from the same experimental framework.
–
Share statistical or logical dependencies.
–
Probability of making at least one Type I error (i.e., falsely
reject the null) across a family of hypothesis tests.
–
, where = number of tests
– FWER = 1 − (1 − α)m
m
Example of Correction Methods
Our Coffee Study
Bonferroni: α' = α/m.
–
Holm-Bonferroni: Stepwise approach.
–
Benjamini-Hochberg: Controls False Discovery Rate (FDR)
instead.
–
Tests: Morning vs. Afternoon, Morning vs. Evening.
–
These tests collectively assess the hypothesis: “Morning
coffee consumption is higher than other times of the day.”
–
Therefore, they form a family of tests (m = 2).
–
Without correction: FWER = 1-(1-0.05)² ≈ 0.0975 (9.75%).
–
With Bonferroni (α'=0.05/2): FWER controlled at ≈ 5%.
–

An Introduction to AI (Formerly Data Science)

Big data drives sampling error toward zero but
does not reduce other errors associated with
inferences drawn from a sample.... big data [is]
different from little data in that we must be
careful not to be fooled by our own estimated
precision — Nagler & Tucker (2015, p. 1) .
1
Nagler, J., & Tucker, J. A. (2015). Drawing Inferences and Testing Theories with Big Data. PS: Political Science & Politics,
48(1), 84–88. doi:10.1017/S1049096514001796
1.

Big Data and the
Illusion of Precision
or Certainty
Large sample sizes reduce sampling error,
creating an illusion of precision (e.g., narrow CIs, p-
value ≈ 0).
–
Statistical significance can be misleading:
–
Overfitted models may capture noise, not signal.
–
Sample bias undermines validity (e.g., data not
collected with a proper sampling methodology
or plan).
–
Precision in estimates does not guarantee
meaningful or accurate results.
–

I have learned and taught that the
primary product of a research inquiry is
one or more measures of effect size,
not p-values— Cohen (1990, p. 1310) .
1
Cohen, J. (1990). Things I have learned (so far). In Annual Convention of the American Psychological Association.
1.

Effect Size: Beyond
Statistical Significance
Why Effect Size Matters
Focus on meaningful impact
Statistical significance ≠ Practical importance:
–
Large sample sizes can detect trivial effects (e.g., p-value ≈ 0
for a tiny mean difference).
–
Effect size reflects the magnitude and practical relevance of
a result.
–
How large is the effect in real-world terms?
–
Is the effect relevant to the decision-making process?
–
Example: A study finds a statistically significant difference in sales
(p ≈ 0) between two products in a dataset of 1 million rows.
Effect size: Average difference = $0.01.
–
Interpretation (Haribo candies): For items priced at $1.00,
this represents a 1% difference, which could have a huge
economic impact given high sales volume (≈60B Goldbears
alone per year).
–
Interpretation (luxury cars): For items priced at $50,000+,
this difference is negligible and irrelevant for decision-making
(e.g., BMW sales ≈2.5M cars per year).
–

Big or Small Data, Pay Attention
to the Tail...
"Many of the features of heavy tailed phenomena
would render our traditional statistical tools useless
at best, dangerous at worst" — Cooke & Nieboer (2011, p. 7).
Challenges with Heavy Tails
Central Limit Theorem (CLT):
Non-Parametric Methods:
Heavy-tailed data often defies common statistical assumptions.
–
Example: Income distributions, natural disasters, or stock market returns.
–
Assumes finite variance, which heavy-tailed distributions (like Pareto, Cauchy,
Weibull, Lévy) may not have.
–
Misleading p-values or confidence intervals can result. (E.g., the t-test may be
unreliable because CLT may not apply.)
–
Bootstrapping struggles with heavy tails (Hall, 1990).
–
Extreme values dominate resampling, leading to unstable results.
–
Cooke, R. M., & Nieboer, D. (2011). Heavy-tailed distributions: Data, diagnostics, and new developments. Resources for the
Future Discussion Paper, (11-19).
–
Hall, P. (1990). Asymptotic properties of the bootstrap for heavy-tailed distributions. The Annals of Probability, 1342-1360.
–

Misidentifying The Type of Analysis and
Spurious Sampling Are Not The Only Causes
of Troubles
Isaac B. (2024). Harvard Business School
Investigation Report Recommended
Firing Francesca Gino. The Harvard
Crimson.
–
Cyranoski, D. (2014). Accusations pile up
amid Japan’s stem-cell controversy.
Nature.
–
Ioannidis J. P. A. (2005). Why Most
Published Research Findings Are False.
PLoS Med.
–
Harford , T. (2023). Behind the fraud drama
rocking academia. Financial Times.
https://on.ft.com/47Z0WMc.
–

Notable Research
Misconduct Cases
STAP Cell Case (2014)
Francesca Gino Case (2023)
RIKEN researcher claimed groundbreaking stem cell
method in Nature.
–
Data manipulation discovered, papers retracted.
–
Investigation committee faced own integrity crisis.
–
Supervisor died by suicide, researcher resigned.
–
Harvard Business School professor studying dishonesty.
–
Data fabrication discovered in behavioral science
research.
–
Multiple papers retracted.
–
Filed $25M lawsuit against Harvard.
–
Arbitrary Depiction of Dolos (Δόλος)

[T]he headline-grabbing cases of misconduct
and fraud are mere distractions. The state of
our science is strong, but it’s plagued by a
universal problem: Science is hard — really
fucking hard— Christie Aschwanden .
1
Aschwanden, C. (2015). Science isn’t broken. FiveThirtyEight. https://fivethirtyeight.com/features/science-isnt-broken/
1.

"Listening to Beatles
Music Makes People
Younger"
A Case Study in Data Dredging (Simmons et al., 2011)
The "Study"
This highlights how flexibility in analysis can manufacture
statistical significance, even for implausible outcomes.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data
collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.
10 students listened to "When I'm Sixty-Four." And 10 others
to "Kalimba" (Windows OS).
–
Collected many variables including birth dates.
–
Used father's age as a covariate.
–
Result: Participants were "1.5 years younger"! (P = 0.04)
–
Researcher Flexibility Issues
Researchers have too much flexibility ("researcher degrees of
freedom") in collecting and analyzing data.
The authors emphasize that adhering to established hypothesis
testing frameworks and transparent practices is essential to
avoid false positives.
Stopping Rule: Deciding when to stop collecting data.
–
Variable Selection: Choosing which measures to analyze.
–
Condition Selection: Comparing specific groups post hoc.
–
Data Exclusion: Removing or keeping observations arbitrary.
–

The problem is not with the statistical tools
themselves, but with researcher behavior and
incentives. Simonsohn (2014, p. 9) notes:
The root cause of p-
hacking ...lies in a
conflict of interest.
Researchers are rewarded for finding certain
types of results.
And no! There is no evidence that dishonesty
fosters creativity...
Simonsohn, U. (2014). Posterior-hacking: Selective reporting invalidates Bayesian results also.
https://dx.doi.org/10.2139/ssrn.2374040.
–
Ioannidis J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Med 2(8): e124.
https://doi.org/10.1371/journal.pmed.0020124.
–
A striking example: a fraudulent paper claiming benefits from dishonesty.

The Basic and Applied Social
Psychology Journal Went
Ballistic With Banning "p-
values"
and, instead, "encourage the use of larger sample sizes ...
because as the sample size increases, descriptive statistics
become increasingly stable and sampling error is less of a
problem" (Trafimow & Marks, 2015).
The journal also critiques some Bayesian procedures, reserving
judgment on their use on a case-by-case basis.
As discussed earlier, sampling errors are reduced with larger
sample sizes, but other problems remain.
–
Heavy tailed distributed variables “relating to both natural and
social systems are becoming increasingly ubiquitous” (Vogel,
2024, p. 1).
–
In those situations, the bold move of banning p-values may fall
short in preventing fishy descriptive statistics.
–
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 1–2.
–
Vogel, R. M. et al. (2024). When Heavy Tails Disrupt Statistical Inference. The American Statistician, 1–15.
–

Posterior-Hacking: A Bayesian
Counterpart to p-Hacking
While p-hacking is widely discussed, Bayesian methods are not immune to misuse. Posterior-hacking
highlights similar challenges in Bayesian analysis (Simonsohn, 2014).
No statistical framework is immune to misuse. Transparency and rigorous checks are critical, regardless of the
approach.
Simonsohn, U. (2014). Posterior-hacking: Selective reporting invalidates Bayesian results also. https://dx.doi.org/10.2139/ssrn.2374040
Tweaking priors—justified refinement is fine, but arbitrary changes to favor desired posterior
results can mislead.
–
Selective reporting of favorable models or estimates.
–
Repeated re-analysis with different assumptions to achieve the "right" result.
–

Estimating the
Reproducibility of
Psychological Science
Open Science Collaboration. (2015). Estimating the reproducibility of
psychological science. Science, 349(6251).
https://doi.org/10.1126/science.aac4716
Original studies: 97 out of 100
were deemed significant (with
estimated power 92%) hence an
expected 89 replications
significant.
–
Replication studies: 37 out of 97
only were significant.
–
Image from: https://doi.org/10.1126/science.aac4716

Image from: Aschwanden, C. (2015). Science isn’t broken. FiveThirtyEight. https://fivethirtyeight.com/features/science-isnt-broken/ — Research paper: Silberzahn R. et al. (2018) at
https://doi.org/10.1177/2515245917747646

The important lesson here is that a
single analysis is not sufficient to find a
definitive answer — Christie Aschwanden .
1
Aschwanden, C. (2015). Science isn’t broken. FiveThirtyEight. https://fivethirtyeight.com/features/science-isnt-broken/
1.

Aschwanden, C. (2016).
Failure Is Moving
Science Forward,
The Replication
Crisis Is a Sign That
Science Is Working.
FiveThirtyEight.
fivethirtyeight.com/features/failure-is-moving-
science-forward/
Korbmacher, M. et al. (2023).
The replication
crisis has led to
positive structural,
procedural,and
community
changes.
Commun Psychol 1, 3.
nature.com/articles/s44271-023-00003-2

The Corporate
World Isn't
Shielded
And there is no open crisis to spark positive
changes...

This wisdom, passed down from Jason Zweig's father, captures a cynical but often
accurate observation about incentives in corporate settings. It resonates particularly
when considering how data and analysis are presented in business contexts, where the
pressure to deliver "favorable results" often challenges integrity (Zweig, 2018):
Zweig, J. (2018). Three Ways to Get Paid. Jason Zweig. Retrieved from https://jasonzweig.com/three-ways-to-get-paid/.
Lie to people who want to be lied to, and you'll get rich.
[E.g., selectively presenting metrics to align with leadership's preferences.]
–
Tell the truth to those who want the truth, and you'll make a living.
[E.g., offering nuanced insights to leaders who value transparency.]
–
Tell the truth to those who want to be lied to, and you'll go broke.
[E.g., refusing to manipulate data in a culture where validation trumps truth.]
–

Some Wisdoms
Twyman's law: A principle in statistics
that warns against trusting unusually
good results.
–
Occam's razor: Simpler models or
explanations are preferred, but only if they
perform equally well.
–
Law of Diminishing Returns: More
features or data are not always better—
focus on quality over quantity.
–
Garbage In, Garbage Out (GIGO): The
quality of your data directly affects the
quality of your insights.
–

"Every genuine
test of a theory
is an attempt to
falsify it,or to
refute it."
— Popper (1962)

So long, and thanks for all the fish... Or any questions?

An Introduction to AI (Formerly Data Science)

More Related Content

Similar to An Introduction to AI (Formerly Data Science) (20)

More from Loic Merckel (8)

Recently uploaded (20)

An Introduction to AI (Formerly Data Science)