Setting the stage with beginning data analyses

TG3:
Se(ng
the
stage
with

beginning
data
analyses

Marianne
Huebner,
Saskia
Le
Cessie,

Werner
Vach,
Maria
BleAner,

Danielle
Bodicoat

“The
ini(al
examina(on
of
data
is
a
valuable

state
of
most
sta(s(cal
inves(ga(ons,
not
only

for
scru(nizing
and
summarizing
data,
but
also

for
model
formula(ons.”

-‐-‐Cha?ield.
JRSSA
1985

“In
prac(ce
one
has
only
to
look
at
the
literature

to
see
that
the
methods
are
s(ll
generally

undervalued,
oKen
neglected,
and
some(mes

ac(vely
regarded
with
disfavor.”

-‐-‐Cha?ield.
JRSSA
1985

It’s
a
topic
of
interest:

Workflow
for
staIsIcal
analysis
and
report
wriIng

viewed
17020
Imes

How
to
efficiently
manage
a
staIsIcal
analysis

project?

viewed
7159
Imes

How
do
you
combine
“Revision
Control”
with

“Workflow”
for
R?

viewed
3074
Imes

StackExchange,
StackOverflow,
CrossValidated,
Blogs

It
takes
Ime

80%
of
data
analysis
is
spent
on
the
process
of

cleaning
and
preparing
the
data.

Dasu
and
Johnson
2003

It’s
Ime
well
spent

Even
with
best
inten(ons
during
data
collec(on:

data
integrity
checks
ﬁnd
error
rates

2-‐5%
in
the

“best”
datasets

Feedback
from
pracIcing
staIsIcians

from
various
insItuIons

Spreadsheets
can
be
problemaIc

ID
Sex
Date
of

Surgery

Height
(cm)
Weight
(kg)
Diagnosis

1
male
1/1/2011
163
68
1

2
M
15/1/99
167
80
2,1

3
F
2/1/09
166
unknown
2

4
M
2/15/11
172cm
82
2

4
8/19/12
85
2

5
MALE
March
1,
2013
180
67
2

6
m
3/15/2008
164
62
2
(dx
5/2/11)

7
m
4-‐1-‐2013
165
???
66
1

8
female
April,
2005
166
n.a.
1

9
F
2007-‐01-‐25
62
65kg
diabetes

Average=166

Spreadsheet
–
corrected

id
sex
datesurgery
height
weight
diagnosis1
diagnosis2

1
male
2011-‐01-‐01
163
68
1

2
male
1999-‐01-‐15
167
80
2
1

3
female
2009-‐01-‐02
166
NA
2

4
male
2011-‐02-‐15
172
82
2

4
male
2012-‐08-‐19
172
85
2

5
male
2013-‐03-‐01
180
67
2

6
male
2008-‐03-‐15
164
62
2

7
male
2013-‐04-‐01
165
66
1

8
female
2005-‐04-‐15
166
NA
1

9
female
2007-‐01-‐25
162
65
3

Structuring
datasets

1.  Each
variable
forms
a
column.

2.  Each
observaIon
forms
a
row.

Things
go
wrong
when:

•  column
headers
are
values,
not
variable
names

•  mulIple
variables
are
stored
in
one
column

•  variables
are
stored
in
both
rows
and
columns

•  a
subject
is
stored
in
mulIple
tables

H.
Wickham,
Tidy
Data
2014

“Despite
the
amount
of
/me
it
takes,
there
has

been
surprisingly
li7le
research
on
how
to
clean

data
well.
Part
of
the
challenge
is
the
breadth
of

ac(vi(es
it
encompasses:
from
outlier
checking,

to
date
parsing,
to
missing
value
imputa(on.”

H.
Wickham,
Tidy
Data
2014

Data
quality

•  Do
the
date
sequences
make
sense
(birth
before

surgery)?

•  Are
data
consistent
between
variables?
(date
of

surgery
and
date
of
discharge
vs
length
of
stay)

•  What
is
the
proporIon
of
missing
values
for
each

variable
(e.g.
Echocardiogram,
30%
missing
at

one
month
follow-‐up,
70%
missing
at
one
year

follow-‐up)

•  What
is
meant
by
Ime
frames
of
follow-‐up,
e.g

“one
month”,
“one
year”?

RedCap
data
checks

•  Field
validaIon
(incorrect
data
type)

•  Field
validaIon
(out
of
range)

•  Outliers
for
numerical
ﬁelds

The
REDCap
ConsorIum
is
composed
of
1,106

acIve
insItuIonal
partners
from
CTSA,
GCRC,

RCMI
and
other
insItuIons
in
83
countries.

Reproducible

research

Reinhart,
Rogoﬀ:
Growth
in
a
Ime
of
debt.
2010

Herndon,
Ash,
Pollin:
A
criIque
of
Reinhart
and
Rogoﬀ.

2013

o  Selected
exclusion
of
years/countries

o  UnconvenIonal
weighIng

o  Coding
error
(averaging
of
wrong
cells)

o  Averaging
a
variable
with
missing
data.

Image:
hAp://thecolbertreport.cc.com/videos/dcyvro/austerity-‐s-‐spreadsheet-‐error

R
markdown:
data,
code,
report

Inference for means (t-interval or t-test)
The airﬂow rate, FEV1, is the ratio of a person’s forced expiratory volume to the vital capacity, VC (max.
volume of air a person can exhale after taking a deep breath). If the enzyme has an e ect, it will be to reduce
the FEV1/VC ratio. The norm is 0.80 in persons with no lung dysfunction.
ratio <- c(0.61, 0.7, 0.63, 0.76, 0.67, 0.72, 0.64, 0.82, 0.88, 0.82, 0.78,
0.84, 0.83, 0.82, 0.74, 0.85, 0.73, 0.85, 0.87)
Summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.610 0.710 0.780 0.766 0.835 0.880
Are the data symmetric or approximately normal?
−2 −1 0 1 2
0.600.75
Normal Q−Q Plot
Theoretical Quantiles
SampleQuantiles
Note that to get a t interval and t test the same function is used. Type

Report
content

•  StaIsIcal
report
is
more
extensive
than
what
will
be
in

the
manuscript

•  Read
in
raw
data

•  Steps
of
processing
data

•  Numerical
data
summaries

•  Graphical
exploraIons,
e.g
density
plots,
boxplots,

plots
over
Ime,
plots
of
associaIon
of
variables,

overlaid
density
plots
from
diﬀerent
categories

-‐>
Feedback
from
you?

Reproducible
research

•  Data:
raw,
processed

•  Figures:
exploratory,
final

•  Code:
raw
script,
final
script

•  Text:
readme
files,
documents,
markdown/
knitr/sweave
file

•  Making
data
and
code
available:
Markdown,

Knitr,
Sweave,
Github

Baseline
characterisIcs

•  Data
summaries
for
each
variable
and/or
group

–  LocaIon
measures

–  Small
or
large
variaIon

–  Conceptually
or
staIsIcally
moIvated
groupings

–  Zero
inﬂated

–  Missingness

•  Explore
missing
data

–  Table
with
number
of
missing
for
each
variable

–  Comparing
missing
and
non-‐missing
cases

–  Always
assume
missingness
hides
a
meaningful
value

for
analysis
(R.
LiAle,
T
Raghunathan)

Exploring
distribuIon
of
variables

•  What
do
we
expect
the
distribuIon
to
look

like?

•  Do
these
expectaIons
hold?

•  Check
variaIon
and
outliers

•  Do
a
few
observaIons
have
a
large
inﬂuence?

•  What
is
to
be
considered
in
later
analyses?

Length
of
Hospital
Stay
[days]

•  N=6123
from
electronic
health
records

•  Years
2010-‐2013

•  Median
(1st,
3rd
quarIle):
4
(3,6)

•  Range:
0-‐531
days

•  Largest
ﬁve
LOS:
68,
70,
70,
77,

84,
531

-‐>
Error
531
days

•  Mean
(sd):
5.4
(5.7)

(without
the
531
LOS)

Length
of
Stay
[days]

Length
of
Stay
[%
cases]

Cutoﬀ
points?

Figure 4: Real GDP growth vs. public debt/GDP, country-years, 1946–2009 (close-up)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
0 30 60 90 120 150
Public Debt/GDP Ratio
RealGDPGrowth
Notes. Figure 4 is a close-up on a region of Figure 3. Real GDP growth is plotted against debt/GDP
for all country-years. The locally smoothed regression function is estimated with the general additive
model with integrated smoothness estimation using the mgcv package in R. The smoothing parameter
is selected with the default cross-validation method. The shaded region indicating the 95 percent
Herndon,
Ash,
Pollin.
A
CriIque
of
Reinhart
and
Rogoﬀ.
2013.

Original
groupings
were
0-‐30,
30-‐60,
60-‐90,
90+

CategorizaIon
of
conInuous
variables

Down’s syndrome or not?
Alpha-‐fetoprotein
(AFP)
to
detect

Down’s
syndrome

Time-‐to-‐event
analyses

•  How
consistent/reliable
is
the
follow-‐up?

–  All
subject
were
contacted
or
only
incidental

recording
of
an
event?

–  Detailed
records
for
a
Ime
period
but
sporadic
aver?

–  All
subjects
or
convenient
subjects
(e.g
readmissions)

•  Start
Ime

–  Time
since
diagnosis

–  Time
since
surgery

–  Time
since
entry
into
study

Correlated
events

1.  Several
measurements
per
subject

2.  Time
dependent
covariates
(one
endpoint
per

subject,
but
a
covariate
changes
over
Ime)

•  Crossover
treatments

•  Lab
tests

3.  MulIple
events
per
subjects

•  Repeated
infecIons

•  RehospitalizaIons

•  Recurrence
of
tumors

Data
set-‐up
for
sequenIal
events

Choices
in
creaIng
the
dataset.
Which
model
is

being
ﬁt?

Therneau
and
Grambsch:
Modeling
Survival
Data

Therneau
and
Crowson:
Time
dependent
variables.
VigneAe
(online)

PuAer,
Fiocco,
Geskus:
CompeIng
risks
and
mulIstate
models

id
tstart
tstop
status
event
strata
duraOon

1
0
221
1
0
1
221

2
0
193
0
1
0
193

2
193
1100
0
1
1
907

2
1100
1130
1
0
1
30

Data
set-‐up
for
unordered
events

Choices
in
creaIng
the
dataset.
Which
model
is

being
ﬁt?

id
tstart
tstop
event
type
duraOon

1
0
221
1
221

2
0
193
2
193

2
193
366
2
173

2
366
1200
1
834

Modern
Challenges

•  New
technologies:
complex
data,
high

dimensional
data,
big
data

•  Combining
data
from
various
sources:

electronic
health
records,
laboratory,

pharmacy,
operaIon
notes

•  Feeding
data
summaries
to
mobile
apps

Setting the stage with beginning data analyses

Modern
Challenges

•  Reading
data
from
various
sources:
Images,

web,
API
(ApplicaIon
Programming
Interface,

e.g
twiAer,
facebook),
GIS

•  Merging
data
from
various
sources
(SAS,
SPSS,

R,
Minitab,
Excel)

MOOC:
GeYng
and
Cleaning
Data.
Coursera,
Jeﬀ
Leek,
John
Hopkins

University

Setting the stage with beginning data analyses

More Related Content

Similar to Setting the stage with beginning data analyses (20)

Recently uploaded (20)

Setting the stage with beginning data analyses