TG3:	
  Se(ng	
  the	
  stage	
  with	
  
beginning	
  data	
  analyses	
  
Marianne	
  Huebner,	
  Saskia	
  Le	
  Cessie,	
  
Werner	
  Vach,	
  Maria	
  BleAner,	
  
Danielle	
  Bodicoat	
  
“The	
  ini(al	
  examina(on	
  of	
  data	
  is	
  a	
  valuable	
  
state	
  of	
  most	
  sta(s(cal	
  inves(ga(ons,	
  not	
  only	
  
for	
  scru(nizing	
  and	
  summarizing	
  data,	
  but	
  also	
  
for	
  model	
  formula(ons.”	
  
-­‐-­‐Cha?ield.	
  JRSSA	
  1985	
  
“In	
  prac(ce	
  one	
  has	
  only	
  to	
  look	
  at	
  the	
  literature	
  
to	
  see	
  that	
  the	
  methods	
  are	
  s(ll	
  generally	
  
undervalued,	
  oKen	
  neglected,	
  and	
  some(mes	
  
ac(vely	
  regarded	
  with	
  disfavor.”	
  
-­‐-­‐Cha?ield.	
  JRSSA	
  1985	
  
	
  
It’s	
  a	
  topic	
  of	
  interest:	
  
Workflow	
  for	
  staIsIcal	
  analysis	
  and	
  report	
  wriIng	
  	
  
viewed 	
  17020	
  Imes	
  
How	
  to	
  efficiently	
  manage	
  a	
  staIsIcal	
  analysis	
  
project?	
  	
  
viewed 	
  7159	
  Imes	
  
How	
  do	
  you	
  combine	
  “Revision	
  Control”	
  with	
  
“Workflow”	
  for	
  R?	
  	
  
viewed 	
  3074	
  Imes	
  
StackExchange,	
  StackOverflow,	
  CrossValidated,	
  Blogs	
  
It	
  takes	
  Ime	
  
80%	
  of	
  data	
  analysis	
  is	
  spent	
  on	
  the	
  process	
  of	
  
cleaning	
  and	
  preparing	
  the	
  data.	
  
Dasu	
  and	
  Johnson	
  2003	
  	
  
	
  
It’s	
  Ime	
  well	
  spent	
  
Even	
  with	
  best	
  inten(ons	
  during	
  data	
  collec(on:	
  
data	
  integrity	
  checks	
  find	
  error	
  rates	
  	
  2-­‐5%	
  in	
  the	
  
“best”	
  datasets	
  
	
  
Feedback	
  from	
  pracIcing	
  staIsIcians	
  
	
  from	
  various	
  insItuIons	
  
	
  
Spreadsheets	
  can	
  be	
  problemaIc	
  
ID	
   Sex	
   Date	
  of	
  
Surgery	
  
Height	
  (cm)	
   Weight	
  (kg)	
   Diagnosis	
  
1	
   male	
   1/1/2011	
   163	
   68	
   1	
  
2	
   M	
   15/1/99	
   167	
   80	
   2,1	
  
3	
   F	
   2/1/09	
   166	
   unknown	
   2	
  
4	
   M	
   2/15/11	
   172cm	
   82	
   2	
  
4	
   8/19/12	
   85	
   2	
  
5	
   MALE	
   March	
  1,	
  2013	
   180	
   67	
   2	
  
6	
   m	
   3/15/2008	
   164	
   62	
   2	
  (dx	
  5/2/11)	
  
7	
   m	
   4-­‐1-­‐2013	
   165	
  ???	
   66	
   1	
  
8	
   female	
   April,	
  2005	
   166	
   n.a.	
   1	
  
9	
   F	
   2007-­‐01-­‐25	
   62	
   65kg	
   diabetes	
  
Average=166	
  
Spreadsheet	
  –	
  corrected	
  
id	
   sex	
   datesurgery	
   height	
   weight	
   diagnosis1	
   diagnosis2	
  
1	
   male	
   2011-­‐01-­‐01	
   163	
   68	
   1	
  
2	
   male	
   1999-­‐01-­‐15	
   167	
   80	
   2	
   1	
  
3	
   female	
   2009-­‐01-­‐02	
   166	
   NA	
   2	
  
4	
   male	
   2011-­‐02-­‐15	
   172	
   82	
   2	
  
4	
   male	
   2012-­‐08-­‐19	
   172	
   85	
   2	
  
5	
   male	
   2013-­‐03-­‐01	
   180	
   67	
   2	
  
6	
   male	
   2008-­‐03-­‐15	
   164	
   62	
   2	
  
7	
   male	
   2013-­‐04-­‐01	
   165	
   66	
   1	
  
8	
   female	
   2005-­‐04-­‐15	
   166	
   NA	
   1	
  
9	
   female	
   2007-­‐01-­‐25	
   162	
   65	
   3	
  
Structuring	
  datasets	
  
1.  Each	
  variable	
  forms	
  a	
  column.	
  
2.  Each	
  observaIon	
  forms	
  a	
  row.	
  
Things	
  go	
  wrong	
  when:	
  
•  column	
  headers	
  are	
  values,	
  not	
  variable	
  names	
  
•  mulIple	
  variables	
  are	
  stored	
  in	
  one	
  column	
  
•  variables	
  are	
  stored	
  in	
  both	
  rows	
  and	
  columns	
  
•  a	
  subject	
  is	
  stored	
  in	
  mulIple	
  tables	
  	
  
	
  	
  
	
  H.	
  Wickham,	
  Tidy	
  Data	
  2014	
  	
  
	
  
“Despite	
  the	
  amount	
  of	
  /me	
  it	
  takes,	
  there	
  has	
  
been	
  surprisingly	
  li7le	
  research	
  on	
  how	
  to	
  clean	
  
data	
  well.	
  Part	
  of	
  the	
  challenge	
  is	
  the	
  breadth	
  of	
  
ac(vi(es	
  it	
  encompasses:	
  from	
  outlier	
  checking,	
  
to	
  date	
  parsing,	
  to	
  missing	
  value	
  imputa(on.”	
  
	
  
H.	
  Wickham,	
  Tidy	
  Data	
  2014	
  	
  
	
  
Data	
  quality	
  
•  Do	
  the	
  date	
  sequences	
  make	
  sense	
  (birth	
  before	
  
surgery)?	
  
•  Are	
  data	
  consistent	
  between	
  variables?	
  (date	
  of	
  
surgery	
  and	
  date	
  of	
  discharge	
  vs	
  length	
  of	
  stay)	
  
•  What	
  is	
  the	
  proporIon	
  of	
  missing	
  values	
  for	
  each	
  
variable	
  (e.g.	
  Echocardiogram,	
  30%	
  missing	
  at	
  
one	
  month	
  follow-­‐up,	
  70%	
  missing	
  at	
  one	
  year	
  
follow-­‐up)	
  
•  What	
  is	
  meant	
  by	
  Ime	
  frames	
  of	
  follow-­‐up,	
  e.g	
  
“one	
  month”,	
  “one	
  year”?	
  
RedCap	
  data	
  checks	
  
•  Field	
  validaIon	
  (incorrect	
  data	
  type)	
  
•  Field	
  validaIon	
  (out	
  of	
  range)	
  
•  Outliers	
  for	
  numerical	
  fields	
  
The	
  REDCap	
  ConsorIum	
  is	
  composed	
  of	
  1,106	
  
acIve	
  insItuIonal	
  partners	
  from	
  CTSA,	
  GCRC,	
  
RCMI	
  and	
  other	
  insItuIons	
  in	
  83	
  countries.	
  	
  
REDCap	
  data	
  summaries	
  
Reproducible	
  
research	
  
Reinhart,	
  Rogoff:	
  Growth	
  in	
  a	
  Ime	
  of	
  debt.	
  2010	
  
Herndon,	
  Ash,	
  Pollin:	
  A	
  criIque	
  of	
  Reinhart	
  and	
  Rogoff.	
  
2013	
  
	
  
o  Selected	
  exclusion	
  of	
  years/countries	
  
o  UnconvenIonal	
  weighIng	
  	
  
o  Coding	
  error	
  (averaging	
  of	
  wrong	
  cells)	
  
o  Averaging	
  a	
  variable	
  with	
  missing	
  data.	
  
	
  Image:	
  hAp://thecolbertreport.cc.com/videos/dcyvro/austerity-­‐s-­‐spreadsheet-­‐error	
  
R	
  markdown:	
  data,	
  code,	
  report	
  
Inference for means (t-interval or t-test)
The airflow rate, FEV1, is the ratio of a person’s forced expiratory volume to the vital capacity, VC (max.
volume of air a person can exhale after taking a deep breath). If the enzyme has an e ect, it will be to reduce
the FEV1/VC ratio. The norm is 0.80 in persons with no lung dysfunction.
ratio <- c(0.61, 0.7, 0.63, 0.76, 0.67, 0.72, 0.64, 0.82, 0.88, 0.82, 0.78,
0.84, 0.83, 0.82, 0.74, 0.85, 0.73, 0.85, 0.87)
Summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.610 0.710 0.780 0.766 0.835 0.880
Are the data symmetric or approximately normal?
−2 −1 0 1 2
0.600.75
Normal Q−Q Plot
Theoretical Quantiles
SampleQuantiles
Note that to get a t interval and t test the same function is used. Type
Report	
  content	
  
•  StaIsIcal	
  report	
  is	
  more	
  extensive	
  than	
  what	
  will	
  be	
  in	
  
the	
  manuscript	
  
•  Read	
  in	
  raw	
  data	
  
•  Steps	
  of	
  processing	
  data	
  
•  Numerical	
  data	
  summaries	
  
•  Graphical	
  exploraIons,	
  e.g	
  density	
  plots,	
  boxplots,	
  
plots	
  over	
  Ime,	
  plots	
  of	
  associaIon	
  of	
  variables,	
  
overlaid	
  density	
  plots	
  from	
  different	
  categories	
  
-­‐>	
  Feedback	
  from	
  you?	
  	
  
Reproducible	
  research	
  
•  Data:	
  raw,	
  processed	
  
•  Figures:	
  exploratory,	
  final	
  
•  Code:	
  raw	
  script,	
  final	
  script	
  
•  Text:	
  readme	
  files,	
  documents,	
  markdown/
knitr/sweave	
  file	
  
	
  
•  Making	
  data	
  and	
  code	
  available:	
  Markdown,	
  
Knitr,	
  Sweave,	
  Github	
  
Baseline	
  characterisIcs	
  
•  Data	
  summaries	
  for	
  each	
  variable	
  and/or	
  group	
  
–  LocaIon	
  measures	
  
–  Small	
  or	
  large	
  variaIon	
  
–  Conceptually	
  or	
  staIsIcally	
  moIvated	
  groupings	
  
–  Zero	
  inflated	
  
–  Missingness	
  
•  Explore	
  missing	
  data	
  
–  Table	
  with	
  number	
  of	
  missing	
  for	
  each	
  variable	
  
–  Comparing	
  missing	
  and	
  non-­‐missing	
  cases	
  
–  Always	
  assume	
  missingness	
  hides	
  a	
  meaningful	
  value	
  
for	
  analysis	
  (R.	
  LiAle,	
  T	
  Raghunathan)	
  
	
  
Exploring	
  distribuIon	
  of	
  variables	
  
	
  
•  What	
  do	
  we	
  expect	
  the	
  distribuIon	
  to	
  look	
  
like?	
  	
  
•  Do	
  these	
  expectaIons	
  hold?	
  
•  Check	
  variaIon	
  and	
  outliers	
  
•  Do	
  a	
  few	
  observaIons	
  have	
  a	
  large	
  influence?	
  
•  What	
  is	
  to	
  be	
  considered	
  in	
  later	
  analyses?	
  
Length	
  of	
  Hospital	
  Stay	
  [days]	
  
•  N=6123	
  from	
  electronic	
  health	
  records	
  
•  Years	
  2010-­‐2013	
  
•  Median	
  (1st,	
  3rd	
  quarIle):	
  4	
  (3,6)	
  	
  
•  Range:	
  0-­‐531	
  days	
  
•  Largest	
  five	
  LOS:	
  68,	
  70,	
  70,	
  77,	
  	
  84,	
  531	
  
-­‐>	
  Error	
  531	
  days	
  
•  Mean	
  (sd):	
  5.4	
  (5.7)	
  	
  (without	
  the	
  531	
  LOS)	
  
Length	
  of	
  Stay	
  [days]	
  
Length	
  of	
  Stay	
  [%	
  cases]	
  
Cutoff	
  points?	
  	
  Figure 4: Real GDP growth vs. public debt/GDP, country-years, 1946–2009 (close-up)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
0 30 60 90 120 150
Public Debt/GDP Ratio
RealGDPGrowth
Notes. Figure 4 is a close-up on a region of Figure 3. Real GDP growth is plotted against debt/GDP
for all country-years. The locally smoothed regression function is estimated with the general additive
model with integrated smoothness estimation using the mgcv package in R. The smoothing parameter
is selected with the default cross-validation method. The shaded region indicating the 95 percent
Herndon,	
  Ash,	
  Pollin.	
  A	
  CriIque	
  of	
  Reinhart	
  and	
  Rogoff.	
  2013.	
  	
  
Original	
  groupings	
  were	
  0-­‐30,	
  30-­‐60,	
  60-­‐90,	
  90+	
  	
  
CategorizaIon	
  of	
  conInuous	
  variables	
  
Down’s syndrome or not?
Alpha-­‐fetoprotein	
  (AFP)	
  to	
  detect	
  
Down’s	
  syndrome	
  
Time-­‐to-­‐event	
  analyses	
  
•  How	
  consistent/reliable	
  is	
  the	
  follow-­‐up?	
  
–  All	
  subject	
  were	
  contacted	
  or	
  only	
  incidental	
  
recording	
  of	
  an	
  event?	
  
–  Detailed	
  records	
  for	
  a	
  Ime	
  period	
  but	
  sporadic	
  aver?	
  
–  All	
  subjects	
  or	
  convenient	
  subjects	
  (e.g	
  readmissions)	
  
•  Start	
  Ime	
  
–  Time	
  since	
  diagnosis	
  
–  Time	
  since	
  surgery	
  
–  Time	
  since	
  entry	
  into	
  study	
  
Correlated	
  events	
  
1.  Several	
  measurements	
  per	
  subject	
  
2.  Time	
  dependent	
  covariates	
  (one	
  endpoint	
  per	
  
subject,	
  but	
  a	
  covariate	
  changes	
  over	
  Ime)	
  
•  Crossover	
  treatments	
  
•  Lab	
  tests	
  
3.  MulIple	
  events	
  per	
  subjects	
  
•  Repeated	
  infecIons	
  
•  RehospitalizaIons	
  
•  Recurrence	
  of	
  tumors	
  
Data	
  set-­‐up	
  for	
  sequenIal	
  events	
  
Choices	
  in	
  creaIng	
  the	
  dataset.	
  Which	
  model	
  is	
  
being	
  fit?	
  
Therneau	
  and	
  Grambsch:	
  Modeling	
  Survival	
  Data	
  
Therneau	
  and	
  Crowson:	
  Time	
  dependent	
  variables.	
  VigneAe	
  (online)	
  
PuAer,	
  Fiocco,	
  Geskus:	
  CompeIng	
  risks	
  and	
  mulIstate	
  models	
  
id	
   tstart	
   tstop	
   status	
   event	
   strata	
   duraOon	
  
1	
   0	
   221	
   1	
   0	
   1	
   221	
  
2	
   0	
   193	
   0	
   1	
   0	
   193	
  
2	
   193	
   1100	
   0	
   1	
   1	
   907	
  
2	
   1100	
   1130	
   1	
   0	
   1	
   30	
  
Data	
  set-­‐up	
  for	
  unordered	
  events	
  
Choices	
  in	
  creaIng	
  the	
  dataset.	
  Which	
  model	
  is	
  
being	
  fit?	
  
id	
   tstart	
   tstop	
   event	
  type	
   duraOon	
  
1	
   0	
   221	
   1	
   221	
  
2	
   0	
   193	
   2	
   193	
  
2	
   193	
   366	
   2	
   173	
  
2	
   366	
   1200	
   1	
   834	
  
Modern	
  Challenges	
  
•  New	
  technologies:	
  complex	
  data,	
  high	
  
dimensional	
  data,	
  big	
  data	
  
•  Combining	
  data	
  from	
  various	
  sources:	
  
electronic	
  health	
  records,	
  laboratory,	
  
pharmacy,	
  operaIon	
  notes	
  
•  Feeding	
  data	
  summaries	
  to	
  mobile	
  apps	
  
Setting the stage with beginning data analyses
Modern	
  Challenges	
  
•  Reading	
  data	
  from	
  various	
  sources:	
  Images,	
  
web,	
  API	
  (ApplicaIon	
  Programming	
  Interface,	
  
e.g	
  twiAer,	
  facebook),	
  GIS	
  
•  Merging	
  data	
  from	
  various	
  sources	
  (SAS,	
  SPSS,	
  
R,	
  Minitab,	
  Excel)	
  
	
  
	
  
MOOC:	
  GeYng	
  and	
  Cleaning	
  Data.	
  Coursera,	
  Jeff	
  Leek,	
  John	
  Hopkins	
  
University	
  

More Related Content

PDF
Evaluation of the clinical value of biomarkers for risk prediction
PDF
_2_InfoGraphicScrapBook_ROyomopitoPhD_Apr2015_Final
PDF
Big Data Analytics for Healthcare
PPTX
Open science LMU session contribution E Steyerberg 2jul20
PPTX
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
PDF
Data Mining and Life Science: A Survey
PDF
Comparative Analysis of Different Numerical Methods for the Solution of Initi...
PPTX
Day 1 (Lecture 3): Predictive Analytics in Healthcare
Evaluation of the clinical value of biomarkers for risk prediction
_2_InfoGraphicScrapBook_ROyomopitoPhD_Apr2015_Final
Big Data Analytics for Healthcare
Open science LMU session contribution E Steyerberg 2jul20
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
Data Mining and Life Science: A Survey
Comparative Analysis of Different Numerical Methods for the Solution of Initi...
Day 1 (Lecture 3): Predictive Analytics in Healthcare

Similar to Setting the stage with beginning data analyses (20)

PPT
13a Data analysis and causal inference – 1
PDF
Dealing with outliers in Clinical Research
PDF
Analyst’s Nightmare or Laundering Massive Spreadsheets
PDF
2013.11.14 Big Data Workshop Bruno Voisin
PDF
Refresher in statistics and analysis skill
PDF
Explore, Analyze and Present your data
PDF
Introduction to Applied Biostatistics in public health
PPTX
2016 Pittsburgh Data Jam Student Workshop
PPTX
Unit 3 Data Quality and Preprocessing .pptx
DOCX
Pg. 05Question FiveAssignment #Deadline Day 22.docx
PPTX
RSS 2012 Data Entry SPSS
PDF
Exploratory Data Analysis - Satyajit.pdf
PDF
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
PPTX
Minimal viable-datareuse-czi
PPT
Data confusion (how to confuse yourself and others with data analysis)
PDF
Research Course - RCT.pdf
PDF
Research Course - RCT.pdf
PPTX
Research Course - RCT.pptx
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
13a Data analysis and causal inference – 1
Dealing with outliers in Clinical Research
Analyst’s Nightmare or Laundering Massive Spreadsheets
2013.11.14 Big Data Workshop Bruno Voisin
Refresher in statistics and analysis skill
Explore, Analyze and Present your data
Introduction to Applied Biostatistics in public health
2016 Pittsburgh Data Jam Student Workshop
Unit 3 Data Quality and Preprocessing .pptx
Pg. 05Question FiveAssignment #Deadline Day 22.docx
RSS 2012 Data Entry SPSS
Exploratory Data Analysis - Satyajit.pdf
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
Minimal viable-datareuse-czi
Data confusion (how to confuse yourself and others with data analysis)
Research Course - RCT.pdf
Research Course - RCT.pdf
Research Course - RCT.pptx
Model Evaluation & Visualisation part of a series of intro modules for data ...
Metabolomic Data Analysis Workshop and Tutorials (2014)
Ad

Recently uploaded (20)

PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
chrmotography.pptx food anaylysis techni
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Introduction to Inferential Statistics.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Microsoft 365 products and services descrption
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Steganography Project Steganography Project .pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
A Complete Guide to Streamlining Business Processes
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
chrmotography.pptx food anaylysis techni
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Introduction to Inferential Statistics.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft 365 products and services descrption
SAP 2 completion done . PRESENTATION.pptx
Business_Capability_Map_Collection__pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft Core Cloud Services powerpoint
Steganography Project Steganography Project .pptx
Leprosy and NLEP programme community medicine
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
A Complete Guide to Streamlining Business Processes
Ad

Setting the stage with beginning data analyses

  • 1. TG3:  Se(ng  the  stage  with   beginning  data  analyses   Marianne  Huebner,  Saskia  Le  Cessie,   Werner  Vach,  Maria  BleAner,   Danielle  Bodicoat  
  • 2. “The  ini(al  examina(on  of  data  is  a  valuable   state  of  most  sta(s(cal  inves(ga(ons,  not  only   for  scru(nizing  and  summarizing  data,  but  also   for  model  formula(ons.”   -­‐-­‐Cha?ield.  JRSSA  1985  
  • 3. “In  prac(ce  one  has  only  to  look  at  the  literature   to  see  that  the  methods  are  s(ll  generally   undervalued,  oKen  neglected,  and  some(mes   ac(vely  regarded  with  disfavor.”   -­‐-­‐Cha?ield.  JRSSA  1985    
  • 4. It’s  a  topic  of  interest:   Workflow  for  staIsIcal  analysis  and  report  wriIng     viewed  17020  Imes   How  to  efficiently  manage  a  staIsIcal  analysis   project?     viewed  7159  Imes   How  do  you  combine  “Revision  Control”  with   “Workflow”  for  R?     viewed  3074  Imes   StackExchange,  StackOverflow,  CrossValidated,  Blogs  
  • 5. It  takes  Ime   80%  of  data  analysis  is  spent  on  the  process  of   cleaning  and  preparing  the  data.   Dasu  and  Johnson  2003      
  • 6. It’s  Ime  well  spent   Even  with  best  inten(ons  during  data  collec(on:   data  integrity  checks  find  error  rates    2-­‐5%  in  the   “best”  datasets     Feedback  from  pracIcing  staIsIcians    from  various  insItuIons    
  • 7. Spreadsheets  can  be  problemaIc   ID   Sex   Date  of   Surgery   Height  (cm)   Weight  (kg)   Diagnosis   1   male   1/1/2011   163   68   1   2   M   15/1/99   167   80   2,1   3   F   2/1/09   166   unknown   2   4   M   2/15/11   172cm   82   2   4   8/19/12   85   2   5   MALE   March  1,  2013   180   67   2   6   m   3/15/2008   164   62   2  (dx  5/2/11)   7   m   4-­‐1-­‐2013   165  ???   66   1   8   female   April,  2005   166   n.a.   1   9   F   2007-­‐01-­‐25   62   65kg   diabetes   Average=166  
  • 8. Spreadsheet  –  corrected   id   sex   datesurgery   height   weight   diagnosis1   diagnosis2   1   male   2011-­‐01-­‐01   163   68   1   2   male   1999-­‐01-­‐15   167   80   2   1   3   female   2009-­‐01-­‐02   166   NA   2   4   male   2011-­‐02-­‐15   172   82   2   4   male   2012-­‐08-­‐19   172   85   2   5   male   2013-­‐03-­‐01   180   67   2   6   male   2008-­‐03-­‐15   164   62   2   7   male   2013-­‐04-­‐01   165   66   1   8   female   2005-­‐04-­‐15   166   NA   1   9   female   2007-­‐01-­‐25   162   65   3  
  • 9. Structuring  datasets   1.  Each  variable  forms  a  column.   2.  Each  observaIon  forms  a  row.   Things  go  wrong  when:   •  column  headers  are  values,  not  variable  names   •  mulIple  variables  are  stored  in  one  column   •  variables  are  stored  in  both  rows  and  columns   •  a  subject  is  stored  in  mulIple  tables          H.  Wickham,  Tidy  Data  2014      
  • 10. “Despite  the  amount  of  /me  it  takes,  there  has   been  surprisingly  li7le  research  on  how  to  clean   data  well.  Part  of  the  challenge  is  the  breadth  of   ac(vi(es  it  encompasses:  from  outlier  checking,   to  date  parsing,  to  missing  value  imputa(on.”     H.  Wickham,  Tidy  Data  2014      
  • 11. Data  quality   •  Do  the  date  sequences  make  sense  (birth  before   surgery)?   •  Are  data  consistent  between  variables?  (date  of   surgery  and  date  of  discharge  vs  length  of  stay)   •  What  is  the  proporIon  of  missing  values  for  each   variable  (e.g.  Echocardiogram,  30%  missing  at   one  month  follow-­‐up,  70%  missing  at  one  year   follow-­‐up)   •  What  is  meant  by  Ime  frames  of  follow-­‐up,  e.g   “one  month”,  “one  year”?  
  • 12. RedCap  data  checks   •  Field  validaIon  (incorrect  data  type)   •  Field  validaIon  (out  of  range)   •  Outliers  for  numerical  fields   The  REDCap  ConsorIum  is  composed  of  1,106   acIve  insItuIonal  partners  from  CTSA,  GCRC,   RCMI  and  other  insItuIons  in  83  countries.    
  • 14. Reproducible   research   Reinhart,  Rogoff:  Growth  in  a  Ime  of  debt.  2010   Herndon,  Ash,  Pollin:  A  criIque  of  Reinhart  and  Rogoff.   2013     o  Selected  exclusion  of  years/countries   o  UnconvenIonal  weighIng     o  Coding  error  (averaging  of  wrong  cells)   o  Averaging  a  variable  with  missing  data.    Image:  hAp://thecolbertreport.cc.com/videos/dcyvro/austerity-­‐s-­‐spreadsheet-­‐error  
  • 15. R  markdown:  data,  code,  report   Inference for means (t-interval or t-test) The airflow rate, FEV1, is the ratio of a person’s forced expiratory volume to the vital capacity, VC (max. volume of air a person can exhale after taking a deep breath). If the enzyme has an e ect, it will be to reduce the FEV1/VC ratio. The norm is 0.80 in persons with no lung dysfunction. ratio <- c(0.61, 0.7, 0.63, 0.76, 0.67, 0.72, 0.64, 0.82, 0.88, 0.82, 0.78, 0.84, 0.83, 0.82, 0.74, 0.85, 0.73, 0.85, 0.87) Summary statistics ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.610 0.710 0.780 0.766 0.835 0.880 Are the data symmetric or approximately normal? −2 −1 0 1 2 0.600.75 Normal Q−Q Plot Theoretical Quantiles SampleQuantiles Note that to get a t interval and t test the same function is used. Type
  • 16. Report  content   •  StaIsIcal  report  is  more  extensive  than  what  will  be  in   the  manuscript   •  Read  in  raw  data   •  Steps  of  processing  data   •  Numerical  data  summaries   •  Graphical  exploraIons,  e.g  density  plots,  boxplots,   plots  over  Ime,  plots  of  associaIon  of  variables,   overlaid  density  plots  from  different  categories   -­‐>  Feedback  from  you?    
  • 17. Reproducible  research   •  Data:  raw,  processed   •  Figures:  exploratory,  final   •  Code:  raw  script,  final  script   •  Text:  readme  files,  documents,  markdown/ knitr/sweave  file     •  Making  data  and  code  available:  Markdown,   Knitr,  Sweave,  Github  
  • 18. Baseline  characterisIcs   •  Data  summaries  for  each  variable  and/or  group   –  LocaIon  measures   –  Small  or  large  variaIon   –  Conceptually  or  staIsIcally  moIvated  groupings   –  Zero  inflated   –  Missingness   •  Explore  missing  data   –  Table  with  number  of  missing  for  each  variable   –  Comparing  missing  and  non-­‐missing  cases   –  Always  assume  missingness  hides  a  meaningful  value   for  analysis  (R.  LiAle,  T  Raghunathan)    
  • 19. Exploring  distribuIon  of  variables     •  What  do  we  expect  the  distribuIon  to  look   like?     •  Do  these  expectaIons  hold?   •  Check  variaIon  and  outliers   •  Do  a  few  observaIons  have  a  large  influence?   •  What  is  to  be  considered  in  later  analyses?  
  • 20. Length  of  Hospital  Stay  [days]   •  N=6123  from  electronic  health  records   •  Years  2010-­‐2013   •  Median  (1st,  3rd  quarIle):  4  (3,6)     •  Range:  0-­‐531  days   •  Largest  five  LOS:  68,  70,  70,  77,    84,  531   -­‐>  Error  531  days   •  Mean  (sd):  5.4  (5.7)    (without  the  531  LOS)  
  • 21. Length  of  Stay  [days]  
  • 22. Length  of  Stay  [%  cases]  
  • 23. Cutoff  points?    Figure 4: Real GDP growth vs. public debt/GDP, country-years, 1946–2009 (close-up) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 0 30 60 90 120 150 Public Debt/GDP Ratio RealGDPGrowth Notes. Figure 4 is a close-up on a region of Figure 3. Real GDP growth is plotted against debt/GDP for all country-years. The locally smoothed regression function is estimated with the general additive model with integrated smoothness estimation using the mgcv package in R. The smoothing parameter is selected with the default cross-validation method. The shaded region indicating the 95 percent Herndon,  Ash,  Pollin.  A  CriIque  of  Reinhart  and  Rogoff.  2013.     Original  groupings  were  0-­‐30,  30-­‐60,  60-­‐90,  90+    
  • 24. CategorizaIon  of  conInuous  variables   Down’s syndrome or not? Alpha-­‐fetoprotein  (AFP)  to  detect   Down’s  syndrome  
  • 25. Time-­‐to-­‐event  analyses   •  How  consistent/reliable  is  the  follow-­‐up?   –  All  subject  were  contacted  or  only  incidental   recording  of  an  event?   –  Detailed  records  for  a  Ime  period  but  sporadic  aver?   –  All  subjects  or  convenient  subjects  (e.g  readmissions)   •  Start  Ime   –  Time  since  diagnosis   –  Time  since  surgery   –  Time  since  entry  into  study  
  • 26. Correlated  events   1.  Several  measurements  per  subject   2.  Time  dependent  covariates  (one  endpoint  per   subject,  but  a  covariate  changes  over  Ime)   •  Crossover  treatments   •  Lab  tests   3.  MulIple  events  per  subjects   •  Repeated  infecIons   •  RehospitalizaIons   •  Recurrence  of  tumors  
  • 27. Data  set-­‐up  for  sequenIal  events   Choices  in  creaIng  the  dataset.  Which  model  is   being  fit?   Therneau  and  Grambsch:  Modeling  Survival  Data   Therneau  and  Crowson:  Time  dependent  variables.  VigneAe  (online)   PuAer,  Fiocco,  Geskus:  CompeIng  risks  and  mulIstate  models   id   tstart   tstop   status   event   strata   duraOon   1   0   221   1   0   1   221   2   0   193   0   1   0   193   2   193   1100   0   1   1   907   2   1100   1130   1   0   1   30  
  • 28. Data  set-­‐up  for  unordered  events   Choices  in  creaIng  the  dataset.  Which  model  is   being  fit?   id   tstart   tstop   event  type   duraOon   1   0   221   1   221   2   0   193   2   193   2   193   366   2   173   2   366   1200   1   834  
  • 29. Modern  Challenges   •  New  technologies:  complex  data,  high   dimensional  data,  big  data   •  Combining  data  from  various  sources:   electronic  health  records,  laboratory,   pharmacy,  operaIon  notes   •  Feeding  data  summaries  to  mobile  apps  
  • 31. Modern  Challenges   •  Reading  data  from  various  sources:  Images,   web,  API  (ApplicaIon  Programming  Interface,   e.g  twiAer,  facebook),  GIS   •  Merging  data  from  various  sources  (SAS,  SPSS,   R,  Minitab,  Excel)       MOOC:  GeYng  and  Cleaning  Data.  Coursera,  Jeff  Leek,  John  Hopkins   University