everything-biostatistics-(biostats)-pdf.
Biostatistician
- Chengcheng Hu
Epidemiologist
- Janet Foote
M.S.: Paul Kang
GRA: David Margraf / Shafquat “Shaf” Saif
Additional Faculty from the Department of
Biostatistics & Epidemiology as needed.
Finding more information online: Research
On the Research page, select Resources
Under Resources, select
Biostatistics & Study Design Services
Biostatistics & Study Design Services page
http://phoenixmed.arizona.edu/research/resources/biostatistics-services
Separate Intake Form link for Faculty & Students
• As you are starting to design your study protocol
• Prior to submitting proposals for projects, routing
• Before you pilot your survey / start data collection
• As you are preparing to analyze data
Many roads…..
We are here to help !
• What is your overall research question?
• What are your specific aims to answer your
question?
• Who is you target population?
• What approach do you think you’ll be using (i.e.
measurements, surveys, observations/chart
reviews, etc)?
• How much difference / change / variation is
important?
• When do you need this?
everything-biostatistics-(biostats)-pdf.
Research Question
In science, we take an organized,
methodical approach to examine
a question.
The first thing we need to do
for research, is ask a question.
Once we have question, we need to complete
background research. By reading what has
been published about a topic, we often find
out new information and sometimes change
our question a bit because we are better
informed about the topic.
The next important step in the scientific
method is constructing a hypothesis.
A hypothesis is an ‘educated idea’ about
how things will work.
There is a set way one to state the hypothesis.
If _________(I do this), then _________(this)________ will happen.
The hypothesis should be measureable,
and do able.
Ex: If I test the blood sugar of 100 adults, more than 20% will be at risk* for
diabetes.
If I add methyltrexate to HL1 cancer cells, the cells will stop multiplying,
so the cancer cannot progress.
Independent variable Dependent variable
* Risk is defined as…….
Testing a hypothesis is not a “one and done” procedure.
In science, results must be shown to be repeatable and
consistent. Statistics helps us understand the odds that
the results we see are ‘real’ based on our study design.
• Help understand the odds that results are real
• Dependent on type and characteristics of data
• Cannot fix design / data / recruitment problems
everything-biostatistics-(biostats)-pdf.
• Categorical
• Quantity
• Nominal
• Ordinal
• Binary
• Discrete and continuous data.
• Interval and ratio variables
• Qualitative
• Quantitative
Characteristics of data
Categorical Data
• The objects being studied are grouped
into categories.
• Categories are usually based on a
qualitative trait.
• These data are merely labels or
categories.
• May or may not have any underlying
order.
Examples:
• Type of Bicycle
– Mountain bike, road bike, chopper, folding, BMX.
• Ethnicity
– Asian, Pacific Islander, African American, Caucasian, Latino,
Native American (note problems with these categories).
• Smoking status
– smoker, non-smoker, former smoker
Nominal Data
Categorical data in which objects fall into
unordered categories.
Ordinal Data
• Categorical data in which order is important.
• Highest Education level – elementary, high school,
college graduate
• Degree of illness- none, mild, moderate, acute,
chronic.
• Opinion of students about stats classes-
Very unhappy, unhappy, neutral, happy, ecstatic!
Binary Data
• Special type of categorical data in which there are
only two categories.
• Binary data can either be nominal or ordinal.
• Current smoking status: smoker, non-smoker
• Attendance: present, absent
• Class mark: pass, fail.
• Status of student: undergraduate, postgraduate.
Categorical data classified as
Nominal, Ordinal, and/or Binary
Categorical data
Not binary
Binary
Ordinal
data
Nominal
data
Binary Not binary
Quantity Data
• Whatever is under study is being ‘measured’
based on some quantitative trait.
• Data are set of numbers.
• Pulse rate
• Height
• Age
• Exam marks
• Size of bicycle frame
• Time to complete a statistics test
• Number of cigarettes smoked
Examples
Quantity data can be classified as
Discrete or Continuous
Quantity
data
Continuous
Discrete
Discrete Data
Only certain values are possible (there are gaps
between the possible values). Implies counting.
Continuous Data
Theoretically, with a fine enough measuring
device, no gaps.
Discrete Data
• Number of children in a family
• Number of students passing a stats exam
• Number of crimes reported to the police
• Number of bicycles sold in a day.
Generally, discrete data are counts.
We would not expect to find 2.2 children in a family or 88.5
students passing an exam or 127.2 crimes being reported to
the police or half a bicycle being sold in one day.
Continuous data
• Size of bicycle frame
• Height
• Time to run 500 metres
• Age
‘Generally, continuous data come from
measurements.
(any value within an interval is possible with a fine
enough measuring device’- (Rowntree 2000)).
Discrete data -- Gaps between possible values- count
0 1 2 3 4 5 6 7
Continuous data -- Theoretically,
no gaps between possible values- measure
0 1000
The type of data collected in a study
determines the type of statistical analysis
used.
A database is a method of organizing
and analyzing information.
• Organize & analyze information in different ways
 Sorting
 Grouping
 Querying
 Reporting
 Exporting for statistical analysis
• Computerized database
 Speed
 Quality control
 Precision
 Automate repetitive tasks
•Excel has some limited capabilities to sort data but its
primary function is to create financial spreadsheets
– Can create “what if” scenarios to determine financial
consequences
– Can be used for small /limited research data sets & simple lists
– Not multi-user such that only one person can work on the file at
a time
•Databases: designed to collect, sort, & manipulate data
– Databases can process large amounts of data; usually limited by
hardware constraints
– Structure is in the same format for each member record of a table
– Data quality control features ensure that valid data is entered
– A relational database allows for linking of an unlimited number of
tables
– Databases are multi-user because the data can reside on a server and
multiple people can have access at the same time
– Many databases offer web interfaces thereby eliminating the need for
each user to have a copy of the program on their computer
•Many databases offer audit functions required by certain
regulatory agencies
• Tracks date record created and modified
• Tracks original and changed values
• Requires user to give reason for the change
•Databases are more suitable for importing data from multiple
sources
• More robust in connecting to different data sources
• Imports of different data types into different tables can
be linked via common identifiers such as subject ID
• Merging multiple data sources into Excel so that the rows
line up properly in a flat file format can be a challenge
• Qualtrics™ – Survey Monkey on steroids
• Redcap™ – Clinical database
•One or more tables
•Tables store records
 Patient identifiers
 Demographics and history
 Test results
 Etc…..
•A record is a collection of fields
– Patient identifiers
• Name, DOB, address, …..are stored in
separate fields
Records and Fields
Records
Fields
How is data displayed?
 Fields are displayed on layouts
 Forms
 Web
 Reports
 Data can be from a single table or many tables if using a
relational database
Id Name Age
10 Smith 50
11 Jones 55
12 Doe 60
ID Weight (lb) Weight (kg)
10 230 104.5
11 212 96.4
12 199 90.4
ID KCAL KCAL/kg
10 2400 23.1
11 2652 27.5
12 2350 25.9
Relational Database Example
ID V02 V02/kg
10 2.8 26.7
11 3.2 33.1
12 2.1 23.2
Subject Info Anthropometrics
Physical Activity Treadmill Performance
Differences between a clinical &
research database
• Clinical database
– Form or report oriented so data is displayed for clinical
decision making
– Emphasis on displaying or reporting of individual data
rather than accumulating multiple records
• Research database
– Table oriented so that data is accumulated for eventual
export to a statistical package for data analysis and
reporting
– Less emphasis on individual records
Clinical data:
Research data:
Advantages of a database
 Collection of data in a centralized location
 Controls redundant data
 Data stored so as to appear to users in one location
 Data can be stored in multiple tables and come from
multiple sources
 A relational database brings it all together
Sharing and Exchanging Data
• Multiple users can access the same database via a
network
– Can be local or over the internet
– Best done when the data are stored on a database server
• Access via a client application
• Access via a web interface
– Server allows remote access over the internet from
anywhere
• Should be behind a firewall for security with access via VPN and
password protection
Database Design Considerations
• What to collect
– What questions are to be answered?
– Think of the data tables in your future publications
• Focus on the key data elements rather than collect as much as possible
• What statistical package will be used?
– Format of the data file to which the data will be exported
• Allowable characters
• Format for certain analyses
– For example, gender can be recorded in the database as M or F but
statistical package may require 0 and 1
• Length of data field labels
• Long or wide format
Long versus Wide Format
Long: each year is represented as its own observation in a record
Wide: each family is a record and each year is a field with that record
Selected Elements of Data
Management Planning
Quality Control of Data Before Study
 Collect only needed variables
 Select appropriate computer hardware & software
 Plan analyses with dummy tabulations
 Develop study forms
 Precode responses
 Format boxes for data entry
 Label each page with date, time, ID
 Consider scan technology
What needs to be in the research database?
 Research variables directly related to the
hypotheses being tested-YES
 Clinical measures used for screening-MAYBE
 Blood work, ECG, medical history
 Administrative data-NO
 Contact information
 Scheduling
Where Are the Original Data?
In the source documents
What is a Source Document?
• It is the First Recording
• What does it tell?
1. It is the data that document the trial
2. Study was carried out according to protocol
Source Documents
• Original Lab reports
• Pathology reports
• Surgical reports
• Physician Progress Notes
• Nurses Notes
• Medical Record
• Letters from referring physicians
• Original radiological films
• Tumor measurements
• Patient Diary/patient interview
Common Data Elements
• Standardized, unique terms and phrases that
delineate discrete pieces of information used to
collect data in a clinical trial
• Uniform representation of demographics and data
points to consistently track trends
• Elements define study parameters and endpoints
Designing the questions
• Granular primary data
• No observer conclusions, synthesis, coding
• Categorical/ordinal data when possible—statistical power.
Re-slice at analysis
• Use validated scales/instruments
• Don’t build your own unless unavoidable
• Collect key variables with >1 question
• Avoid measurements that cluster at one end of scale
• Distribution problems, Likert scales
everything-biostatistics-(biostats)-pdf.
Operations Manual
 Defines entire study protocol, sequence
 Form-specific annotation, guidance
 Documents all post-hoc validity checks, edit checks, data
curation criteria
 Evolving document with periodic updates
 Preferably on-line
 Use for training, quality control, process planning
Data Dictionary - Operational
• For every form/table, lists:
– Variable name (database field)
– Variable description (plain English)
– Variable type (string, integer, numeric, etc.)
– Variable length (or precision)
– Nullability (missing or no value indicator)
– Range checks, allowable values
– Coding conventions, with definitions
Variable name Code Description
ANYSKCA 1=yes; 0=no Any NMSC post-randomization?
ANYSKCA6 1=yes; 0=no Any NMSC after 6 months post-randomization?
BCCOCC 1=yes; 0=no Any BCC occurrence post-randomization?
BCCOCC6 1=yes; 0=no Any BCC after 6 months post-randomization?
SCCOCC 1=yes; 0=no Any SCC occurrence post-randomization?
SCCOCC6 1=yes; 0=no Any SCC after 6 months post-randomization?
ALLSKCA number Total number of NMSC which occurred post-randomization
ALLSKCA6 number Total number of NMSC after 6 months post-randomization
ALLBCC number Total number of BCC which occurred post-randomization
ALLBCC6 number Total number of BCC after 6 months post-randomization
ALLSCC number Total number of SCC which occurred post-randomization
ALLSCC6 number Total number of SCC after 6 months post-randomization
ANYMOS number Total number of months before any NMSC occurrence
ANYMOS6 number Months (after 6 mos post-randomization) before NMSC
BCCMOS number Total number of months before first BCC occurrence
BCCMOS6 number Months (after 6 mos post-randomization) before BCC
Why code:
 Forces analyzable data structure, format
 Vastly simplifies analysis
 Speeds data input/transcription
 Vastly simplifies data analysis/reporting
Example of the need for data coding
What is the subject’s sex?
male female
Male Female
M F
m f
Man Woman
Boy Girl
0 1
1 2
Gentleman Lady
Tarzan Jane
What do you mean & how will you
record it?
 HEADACHE
 Headache
 Pain in the head
 ACHE:
 Ache:Head
 Head Pain
 HP
Unless there is a standard code for the use
of terms, data retrieval becomes difficult
Rules for Data Entry
 Each variable has a field in the dataset
 Categorical and nominal values require a number
or string code
 Continuous values are entered directly
 Missing values must be different values from a
real response
 Common formats are “99” or bullets “·”
 Don’t know is a response—do not leave blank
 “0” is not the same as missing
 Coding instructions should be on form
 Avoid open-ended questions
Avoid open-ended questions
Enter the subject’s gender:___________________
Enter the subject's level of education:__________
Close Ended Question
What is the subject’s sex? Check one
Male
Female
Use pre-coded responses where
possible
Data in Spreadsheet
Subject ID Gender Age
1001 Male 52
1002 Male 54
103 Mael 65
1004 Female 54
5 Female 52
1006 Female 52
1007 Femele 75
1008 Male 48
1009 M 37
1010 Female 73
11 F 54
Data Validation
everything-biostatistics-(biostats)-pdf.
everything-biostatistics-(biostats)-pdf.
Types of Edit Checks
 Patient identification and record linkage
 ID #’s, name spelling, ID#’s on all pages
 Legibility
 Correct form for examination
 Missing data
 Consistency
 Range and inadmissible codes
Backup
 Data must be backed up on a regular basis to
protect against:
 Theft, fire, floods, hurricanes,
 Equipment failure
 Computer backup
 Mirrored drives
 Digital tapes
 Store backup tapes off-site
Putting it All Together:
Research Data Management
 An artful selection of physical & electronic management
methods
 Signed informed consent documents
 Paper forms
 Regulatory & project management binders
 Data models and databases
 Data acquisition and display technologies
 Communications technologies for project management as
well as data management
Attributes of Successful Data
Management
 Attention to detail
 Explicit structure and process
 Robust designs
 Anticipate failures, lapses and mistakes
 Design systems that identify and correct them
 Mechanisms for verification
 Well documented
Quality
75
Fast is fine, but accuracy is
everything.
(Wyatt Earp)
Study
Design
Study Design
Decision
Short-hand
Measure of
Association
Interpretation of
Measure of Association
Want to be thorough…..
…but as minimally-invasive as possible
everything-biostatistics-(biostats)-pdf.
Finding more information online: Research
On the Research page, select Resources
Under Resources, select
Biostatistics & Study Design Services
Biostatistics & Study Design Services page
http://phoenixmed.arizona.edu/research/resources/biostatistics-services
everything-biostatistics-(biostats)-pdf.

More Related Content

PPTX
Introduction to Data (Data Analytics)...
PPTX
Introduction to Health statistics and biostats
PPTX
Data base and data entry presentation by mj n somya
PPTX
Biostatistics pt 1
PPSX
Data type source presentation im
PPTX
Topic 1- Introduction to Statistics , Data and Measurement Sclaes.pptx
PPTX
Basic statistical & pharmaceutical statistical applications
PPTX
Data in Research
Introduction to Data (Data Analytics)...
Introduction to Health statistics and biostats
Data base and data entry presentation by mj n somya
Biostatistics pt 1
Data type source presentation im
Topic 1- Introduction to Statistics , Data and Measurement Sclaes.pptx
Basic statistical & pharmaceutical statistical applications
Data in Research

Similar to everything-biostatistics-(biostats)-pdf. (20)

PPT
Data collection & management
PPTX
Introduction to Data Management in Human Ecology
PPTX
BIOSTATISTICS (MPT) 11 (1).pptx
PPTX
Introduction to basics of bio statistics.
PPT
1- introduction,data sources and types1 (1).ppt
PPTX
Biostatistic 2.pptx
PPT
Spss introductory session data entry and descriptive stats
PPT
18- Introduction and levels of measurements.ppt
PPTX
8. data analysis in research practice.pptx
PDF
1_Introduction to Biostatistics-2 (2).pdf
PDF
1_Introduction to Biostatistics-2 (2).pdf
PPTX
HM404 Ab120916 ch09
PPTX
Data analytics in Healthcare
PPTX
Introduction of biostatistics
PPT
Biostatistics Concept & Definition
PPTX
Chapter 2 business mathematics for .pptx
PPTX
LECTURE 1 STATISTICS for data analytics and machine learning
PPTX
Introduction of Data and Type of data in Statstics
PDF
Data Analysis with SPSS PPT.pdf
Data collection & management
Introduction to Data Management in Human Ecology
BIOSTATISTICS (MPT) 11 (1).pptx
Introduction to basics of bio statistics.
1- introduction,data sources and types1 (1).ppt
Biostatistic 2.pptx
Spss introductory session data entry and descriptive stats
18- Introduction and levels of measurements.ppt
8. data analysis in research practice.pptx
1_Introduction to Biostatistics-2 (2).pdf
1_Introduction to Biostatistics-2 (2).pdf
HM404 Ab120916 ch09
Data analytics in Healthcare
Introduction of biostatistics
Biostatistics Concept & Definition
Chapter 2 business mathematics for .pptx
LECTURE 1 STATISTICS for data analytics and machine learning
Introduction of Data and Type of data in Statstics
Data Analysis with SPSS PPT.pdf
Ad

Recently uploaded (20)

PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
DOCX
Factor Analysis Word Document Presentation
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
chrmotography.pptx food anaylysis techni
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
recommendation Project PPT with details attached
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Business_Capability_Map_Collection__pptx
expt-design-lecture-12 hghhgfggjhjd (1).ppt
eGramSWARAJ-PPT Training Module for beginners
retention in jsjsksksksnbsndjddjdnFPD.pptx
SET 1 Compulsory MNH machine learning intro
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Navigating the Thai Supplements Landscape.pdf
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
statsppt this is statistics ppt for giving knowledge about this topic
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
Factor Analysis Word Document Presentation
Tapan_20220802057_Researchinternship_final_stage.pptx
MBA JAPAN: 2025 the University of Waseda
chrmotography.pptx food anaylysis techni
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
recommendation Project PPT with details attached
Ad

everything-biostatistics-(biostats)-pdf.

  • 2. Biostatistician - Chengcheng Hu Epidemiologist - Janet Foote M.S.: Paul Kang GRA: David Margraf / Shafquat “Shaf” Saif Additional Faculty from the Department of Biostatistics & Epidemiology as needed.
  • 3. Finding more information online: Research
  • 4. On the Research page, select Resources
  • 5. Under Resources, select Biostatistics & Study Design Services
  • 6. Biostatistics & Study Design Services page http://phoenixmed.arizona.edu/research/resources/biostatistics-services
  • 7. Separate Intake Form link for Faculty & Students
  • 8. • As you are starting to design your study protocol • Prior to submitting proposals for projects, routing • Before you pilot your survey / start data collection • As you are preparing to analyze data Many roads…..
  • 9. We are here to help !
  • 10. • What is your overall research question? • What are your specific aims to answer your question? • Who is you target population? • What approach do you think you’ll be using (i.e. measurements, surveys, observations/chart reviews, etc)? • How much difference / change / variation is important? • When do you need this?
  • 12. Research Question In science, we take an organized, methodical approach to examine a question. The first thing we need to do for research, is ask a question.
  • 13. Once we have question, we need to complete background research. By reading what has been published about a topic, we often find out new information and sometimes change our question a bit because we are better informed about the topic.
  • 14. The next important step in the scientific method is constructing a hypothesis.
  • 15. A hypothesis is an ‘educated idea’ about how things will work. There is a set way one to state the hypothesis. If _________(I do this), then _________(this)________ will happen. The hypothesis should be measureable, and do able. Ex: If I test the blood sugar of 100 adults, more than 20% will be at risk* for diabetes. If I add methyltrexate to HL1 cancer cells, the cells will stop multiplying, so the cancer cannot progress. Independent variable Dependent variable * Risk is defined as…….
  • 16. Testing a hypothesis is not a “one and done” procedure. In science, results must be shown to be repeatable and consistent. Statistics helps us understand the odds that the results we see are ‘real’ based on our study design.
  • 17. • Help understand the odds that results are real • Dependent on type and characteristics of data • Cannot fix design / data / recruitment problems
  • 19. • Categorical • Quantity • Nominal • Ordinal • Binary • Discrete and continuous data. • Interval and ratio variables • Qualitative • Quantitative Characteristics of data
  • 20. Categorical Data • The objects being studied are grouped into categories. • Categories are usually based on a qualitative trait. • These data are merely labels or categories. • May or may not have any underlying order.
  • 21. Examples: • Type of Bicycle – Mountain bike, road bike, chopper, folding, BMX. • Ethnicity – Asian, Pacific Islander, African American, Caucasian, Latino, Native American (note problems with these categories). • Smoking status – smoker, non-smoker, former smoker Nominal Data Categorical data in which objects fall into unordered categories.
  • 22. Ordinal Data • Categorical data in which order is important. • Highest Education level – elementary, high school, college graduate • Degree of illness- none, mild, moderate, acute, chronic. • Opinion of students about stats classes- Very unhappy, unhappy, neutral, happy, ecstatic!
  • 23. Binary Data • Special type of categorical data in which there are only two categories. • Binary data can either be nominal or ordinal. • Current smoking status: smoker, non-smoker • Attendance: present, absent • Class mark: pass, fail. • Status of student: undergraduate, postgraduate.
  • 24. Categorical data classified as Nominal, Ordinal, and/or Binary Categorical data Not binary Binary Ordinal data Nominal data Binary Not binary
  • 25. Quantity Data • Whatever is under study is being ‘measured’ based on some quantitative trait. • Data are set of numbers. • Pulse rate • Height • Age • Exam marks • Size of bicycle frame • Time to complete a statistics test • Number of cigarettes smoked Examples
  • 26. Quantity data can be classified as Discrete or Continuous Quantity data Continuous Discrete
  • 27. Discrete Data Only certain values are possible (there are gaps between the possible values). Implies counting. Continuous Data Theoretically, with a fine enough measuring device, no gaps.
  • 28. Discrete Data • Number of children in a family • Number of students passing a stats exam • Number of crimes reported to the police • Number of bicycles sold in a day. Generally, discrete data are counts. We would not expect to find 2.2 children in a family or 88.5 students passing an exam or 127.2 crimes being reported to the police or half a bicycle being sold in one day.
  • 29. Continuous data • Size of bicycle frame • Height • Time to run 500 metres • Age ‘Generally, continuous data come from measurements. (any value within an interval is possible with a fine enough measuring device’- (Rowntree 2000)).
  • 30. Discrete data -- Gaps between possible values- count 0 1 2 3 4 5 6 7 Continuous data -- Theoretically, no gaps between possible values- measure 0 1000
  • 31. The type of data collected in a study determines the type of statistical analysis used.
  • 32. A database is a method of organizing and analyzing information.
  • 33. • Organize & analyze information in different ways  Sorting  Grouping  Querying  Reporting  Exporting for statistical analysis • Computerized database  Speed  Quality control  Precision  Automate repetitive tasks
  • 34. •Excel has some limited capabilities to sort data but its primary function is to create financial spreadsheets – Can create “what if” scenarios to determine financial consequences – Can be used for small /limited research data sets & simple lists – Not multi-user such that only one person can work on the file at a time •Databases: designed to collect, sort, & manipulate data – Databases can process large amounts of data; usually limited by hardware constraints – Structure is in the same format for each member record of a table – Data quality control features ensure that valid data is entered – A relational database allows for linking of an unlimited number of tables – Databases are multi-user because the data can reside on a server and multiple people can have access at the same time – Many databases offer web interfaces thereby eliminating the need for each user to have a copy of the program on their computer
  • 35. •Many databases offer audit functions required by certain regulatory agencies • Tracks date record created and modified • Tracks original and changed values • Requires user to give reason for the change •Databases are more suitable for importing data from multiple sources • More robust in connecting to different data sources • Imports of different data types into different tables can be linked via common identifiers such as subject ID • Merging multiple data sources into Excel so that the rows line up properly in a flat file format can be a challenge
  • 36. • Qualtrics™ – Survey Monkey on steroids • Redcap™ – Clinical database
  • 37. •One or more tables •Tables store records  Patient identifiers  Demographics and history  Test results  Etc….. •A record is a collection of fields – Patient identifiers • Name, DOB, address, …..are stored in separate fields
  • 39. How is data displayed?  Fields are displayed on layouts  Forms  Web  Reports  Data can be from a single table or many tables if using a relational database
  • 40. Id Name Age 10 Smith 50 11 Jones 55 12 Doe 60 ID Weight (lb) Weight (kg) 10 230 104.5 11 212 96.4 12 199 90.4 ID KCAL KCAL/kg 10 2400 23.1 11 2652 27.5 12 2350 25.9 Relational Database Example ID V02 V02/kg 10 2.8 26.7 11 3.2 33.1 12 2.1 23.2 Subject Info Anthropometrics Physical Activity Treadmill Performance
  • 41. Differences between a clinical & research database • Clinical database – Form or report oriented so data is displayed for clinical decision making – Emphasis on displaying or reporting of individual data rather than accumulating multiple records • Research database – Table oriented so that data is accumulated for eventual export to a statistical package for data analysis and reporting – Less emphasis on individual records
  • 44. Advantages of a database  Collection of data in a centralized location  Controls redundant data  Data stored so as to appear to users in one location  Data can be stored in multiple tables and come from multiple sources  A relational database brings it all together
  • 45. Sharing and Exchanging Data • Multiple users can access the same database via a network – Can be local or over the internet – Best done when the data are stored on a database server • Access via a client application • Access via a web interface – Server allows remote access over the internet from anywhere • Should be behind a firewall for security with access via VPN and password protection
  • 46. Database Design Considerations • What to collect – What questions are to be answered? – Think of the data tables in your future publications • Focus on the key data elements rather than collect as much as possible • What statistical package will be used? – Format of the data file to which the data will be exported • Allowable characters • Format for certain analyses – For example, gender can be recorded in the database as M or F but statistical package may require 0 and 1 • Length of data field labels • Long or wide format
  • 47. Long versus Wide Format Long: each year is represented as its own observation in a record Wide: each family is a record and each year is a field with that record
  • 48. Selected Elements of Data Management Planning
  • 49. Quality Control of Data Before Study  Collect only needed variables  Select appropriate computer hardware & software  Plan analyses with dummy tabulations  Develop study forms  Precode responses  Format boxes for data entry  Label each page with date, time, ID  Consider scan technology
  • 50. What needs to be in the research database?  Research variables directly related to the hypotheses being tested-YES  Clinical measures used for screening-MAYBE  Blood work, ECG, medical history  Administrative data-NO  Contact information  Scheduling
  • 51. Where Are the Original Data? In the source documents
  • 52. What is a Source Document? • It is the First Recording • What does it tell? 1. It is the data that document the trial 2. Study was carried out according to protocol
  • 53. Source Documents • Original Lab reports • Pathology reports • Surgical reports • Physician Progress Notes • Nurses Notes • Medical Record • Letters from referring physicians • Original radiological films • Tumor measurements • Patient Diary/patient interview
  • 54. Common Data Elements • Standardized, unique terms and phrases that delineate discrete pieces of information used to collect data in a clinical trial • Uniform representation of demographics and data points to consistently track trends • Elements define study parameters and endpoints
  • 55. Designing the questions • Granular primary data • No observer conclusions, synthesis, coding • Categorical/ordinal data when possible—statistical power. Re-slice at analysis • Use validated scales/instruments • Don’t build your own unless unavoidable • Collect key variables with >1 question • Avoid measurements that cluster at one end of scale • Distribution problems, Likert scales
  • 57. Operations Manual  Defines entire study protocol, sequence  Form-specific annotation, guidance  Documents all post-hoc validity checks, edit checks, data curation criteria  Evolving document with periodic updates  Preferably on-line  Use for training, quality control, process planning
  • 58. Data Dictionary - Operational • For every form/table, lists: – Variable name (database field) – Variable description (plain English) – Variable type (string, integer, numeric, etc.) – Variable length (or precision) – Nullability (missing or no value indicator) – Range checks, allowable values – Coding conventions, with definitions
  • 59. Variable name Code Description ANYSKCA 1=yes; 0=no Any NMSC post-randomization? ANYSKCA6 1=yes; 0=no Any NMSC after 6 months post-randomization? BCCOCC 1=yes; 0=no Any BCC occurrence post-randomization? BCCOCC6 1=yes; 0=no Any BCC after 6 months post-randomization? SCCOCC 1=yes; 0=no Any SCC occurrence post-randomization? SCCOCC6 1=yes; 0=no Any SCC after 6 months post-randomization? ALLSKCA number Total number of NMSC which occurred post-randomization ALLSKCA6 number Total number of NMSC after 6 months post-randomization ALLBCC number Total number of BCC which occurred post-randomization ALLBCC6 number Total number of BCC after 6 months post-randomization ALLSCC number Total number of SCC which occurred post-randomization ALLSCC6 number Total number of SCC after 6 months post-randomization ANYMOS number Total number of months before any NMSC occurrence ANYMOS6 number Months (after 6 mos post-randomization) before NMSC BCCMOS number Total number of months before first BCC occurrence BCCMOS6 number Months (after 6 mos post-randomization) before BCC
  • 60. Why code:  Forces analyzable data structure, format  Vastly simplifies analysis  Speeds data input/transcription  Vastly simplifies data analysis/reporting
  • 61. Example of the need for data coding What is the subject’s sex? male female Male Female M F m f Man Woman Boy Girl 0 1 1 2 Gentleman Lady Tarzan Jane
  • 62. What do you mean & how will you record it?  HEADACHE  Headache  Pain in the head  ACHE:  Ache:Head  Head Pain  HP Unless there is a standard code for the use of terms, data retrieval becomes difficult
  • 63. Rules for Data Entry  Each variable has a field in the dataset  Categorical and nominal values require a number or string code  Continuous values are entered directly  Missing values must be different values from a real response  Common formats are “99” or bullets “·”  Don’t know is a response—do not leave blank  “0” is not the same as missing  Coding instructions should be on form  Avoid open-ended questions
  • 64. Avoid open-ended questions Enter the subject’s gender:___________________ Enter the subject's level of education:__________
  • 65. Close Ended Question What is the subject’s sex? Check one Male Female
  • 66. Use pre-coded responses where possible
  • 67. Data in Spreadsheet Subject ID Gender Age 1001 Male 52 1002 Male 54 103 Mael 65 1004 Female 54 5 Female 52 1006 Female 52 1007 Femele 75 1008 Male 48 1009 M 37 1010 Female 73 11 F 54
  • 71. Types of Edit Checks  Patient identification and record linkage  ID #’s, name spelling, ID#’s on all pages  Legibility  Correct form for examination  Missing data  Consistency  Range and inadmissible codes
  • 72. Backup  Data must be backed up on a regular basis to protect against:  Theft, fire, floods, hurricanes,  Equipment failure  Computer backup  Mirrored drives  Digital tapes  Store backup tapes off-site
  • 73. Putting it All Together: Research Data Management  An artful selection of physical & electronic management methods  Signed informed consent documents  Paper forms  Regulatory & project management binders  Data models and databases  Data acquisition and display technologies  Communications technologies for project management as well as data management
  • 74. Attributes of Successful Data Management  Attention to detail  Explicit structure and process  Robust designs  Anticipate failures, lapses and mistakes  Design systems that identify and correct them  Mechanisms for verification  Well documented
  • 75. Quality 75 Fast is fine, but accuracy is everything. (Wyatt Earp)
  • 77. Want to be thorough….. …but as minimally-invasive as possible
  • 79. Finding more information online: Research
  • 80. On the Research page, select Resources
  • 81. Under Resources, select Biostatistics & Study Design Services
  • 82. Biostatistics & Study Design Services page http://phoenixmed.arizona.edu/research/resources/biostatistics-services