Advanced statistics for librarians

Advanced Statistics for Librarians How to use and evaluate statistical information in library research Claremont Colleges Caltech Science & Electronic Resources Librarian Acquisitions Librarian Jason Price John McDonald

Advanced Statistics Part I : Research Design Part II : Statistical Concepts Part III : Evaluating Library Statistics

Research Design Validity How well an indicator accurately measures the concept being studied. Is the technique appropriate to measure the concept being studied? Reliability How consistent is the measurement. Does it yield the same results over repeated attempts and by different researchers? How certain are the results? Generalizability How well (or likely) can the findings be applied to other situations?

Research Design Steps Research Question Hypotheses Data definitions Data collection Data analysis Conclusions

Research Question What is the study designed to answer? Why is the study important? The more specific, the better! Example: Should the library increase hours during finals week?

Hypothesis A statement about the expected results. What you will test after collecting data. Null Hypothesis , that there is no difference between Group 1 & Group 2 or Before/After. Notated H o = H a Alternate Hypothesis , that there is a difference and what that difference will be. Notated H o ≠ H a Can also be directional if theory or prior research indicates : H o > H a

Data collection Observation Interviews Focus Groups Surveys Transaction Logs Others?

Data Collection: Sampling Necessary when it is impossible to study an entire population due to logical, geographical, monetary, or time constraints. A sample must be a good representation of the rest of the population. The larger your sample, the more sure you can be that their answers truly reflect the population Accuracy increases when more respondents pick one choice over another. E.g. More accuracy when 99% choose one presidential candidate The larger your population size, the larger your sample needs to be, except if your population is very large (i.e. the U.S., or very small (i.e. your household)

Simple Stratified Assumes homogeneity Assumes heterogeneity Sampling Designs

1) SS = Z 2 * (p) * (1-p) / c 2 2) ss = SS/1+(SS-1/pop) When you have very large pop size When you have finite pop size Z = Z value (e.g. 1.96 for 95% confidence level) p = percentage picking a choice, expressed as decimal (e.g. .5 for 50%) c = confidence interval, expressed as decimal (e.g., .04 = ±4%) Sample size spreadsheet Calculating Sample Sizes

Research Question : What is the color distribution of M&Ms? Sample : What is the color distribution of a simple random sample of M&Ms. Test : Does my sample yield different results than what is reported by the company? Method : Packages of M&Ms distributed to each participant. Each package is a random sample from the company. M&M Sampling

Let’s look at the colors in individual samples of M&Ms M&M Data Collection & Testing M&M Sampling

Data Definitions Data Scales Nominal Ordinal Interval Ratio Frequency Distributions Flat Normal Skewed Variable Types Dependent Independent Extraneous

Data Scales Nominal : scaled without order, indicating that classifications are different. Example : Public & private institutions. Ordinal : scaled with order, but without distance between values. Example : Carnegie classifications Interval : scaled with order and establishes numerically equal distances on the scale. Example : Grade level (freshman, sophomore, etc.) Ratio : scaled with equal intervals and a zero starting point. Example : Fulltext downloads. Nominal or ordinal variables are discrete , while interval and ratio variables are continuous

Name that data type! Salary Author of a book Hours spent in the library Patron status Publication year of a journal Ranked journal lists Test results on instruction classes Number of articles read FTE

Data Distributions Described by their kurtosis (variability) and skew (extremes) Non-normal (skewed): extreme values with steep slopes Normal : bell shaped curve with gradual slopes

Fulltime Students at ARL Schools N=114 Mean = 22K SD = 10K

Total Salaries & Wages at ARL Libraries N=114 Mean = 10M SD = 6.5M

Variables Dependent: the variable being measured, studied, and predicted. Independent : variables that can be manipulated or are predictors of the dependent variable. Extraneous : variables other than the independent variables that can influence the dependent variable.

Data analysis Descriptive statistics Mean, Median, Mode Standard Deviation Correlational statistics Correlation Inferential statistics T-test Regression Chi-square ANOVA

Correlational Statistics Correlation establishes that two measures have a relationship. Indicates direction & strength, but not causation! Allows researcher to consider other statistical tests with confidence. Requirements random sample interval or ratio data normal distribution linear relationship

Correlational Statistics Direction Positive: As one value increases, the other does as well. Example : Age and height. Library : Enrollment & materials budget. Negative: As one value increases, the other decreases. Example : Car speed & time to destination. Library : Items purchased & shelf space. Strength Value between 1 (positive) and -1 (negative). The closer to those values, the stronger the relationship.

Inferential Statistics Parametric : assume that the dependent variable has a known underlying mathematical distribution (normal, binomial, Poisson, etc.) which serves as the basis for sample-to-population estimates. Parametric tests are robust and have great power efficiency. Non-parametric : do not assume a normal distribution ( distribution free ) & require that the data meet fewer assumptions. Allow for the analysis of a mixture of data types.

T-Test Determine if there is a difference (in a characteristic) between two populations based on data from samples of those populations. Requirements random sample interval or ratio data normal distribution equal standard deviations

Regression Predicts values of a dependent variable based on values of independent (predictor) variables Requirements : interval or ratio data normal distribution correlated variables linear relationship

ANOVA Determine if there are differences between three or more sample means. Test the significance and direction of the difference. Requirements : normal distribution (in each cell) Interval or ratio data homogeneity of variance

Chi Square Test Difference between expected and observed frequencies for nominal or ordinal data Requirements : Any type of data Large sample size (>50) Similar distributions

Chi Square Test Pepsi Challenge Observed : Pepsi 85, Coke 57, RC 78 Expected (equal) = 73.33 Degrees of freedom = rows - 1 = 3 - 1 = 2 Critical value of χ 2 = 5.99 at alpha = 0.05 Observed value of χ 2 = 5.8 Decision: Fail to reject H 0 5.8 χ 2 = 219.99 220 Totals 0.3 21.81 4.67 73.33 78 RC 3.64 266.67 -16.33 73.33 57 Coke 1.86 136.19 11.67 73.33 85 Pepsi (O-E) 2 /E (O-E) 2 O-E E O

Inferential Statistics Poisson regression Negative Binomial reg. OLS Regression Predict value from measured variables Wilcoxon test Chi-Square T-test Compare sample to a hypothetical value Kruskal-Wallace test Chi-square test ANOVA Compare 3+ unmatched groups Mann-Whitney Komogorov-Smirnov Standard two-group t-test Compare 2 paired groups Mann-Whitney test Fisher's test Unpaired t-test Compare 2 unpaired groups Spearman correlation Kendall's tau Pearson correlation Quantify association between variables Non-parametric Parametric Goal

Review: Research Design Research Question What will the study answer? Hypotheses What do you think the results will be? Data definitions What scales are the variables, what is the distribution, and what are the dependent, independent & extraneous variables? Data collection What is the best method for collecting the variables of interest? Data analysis What are the proper statistical tests to use on the data? Conclusions What does the data show us or indicate?

Case Studies Citation Analysis Antelman, K (2004) “Do Open-Access Articles Have a Greater Research Impact?” College & Research Libraries News 65(5):pp. 372-382 Usage Analysis Blecic, DD (1999) “Measurements of journal use: an analysis of the correlations between three methods.” Bull Med Libr Assoc 87(1): 20-25. Service Analysis Nichols, J; Shaffer, B; Shockey, K. (2003). “Changing the Face of Instruction: Is Online or In-class More Effective?” College & Research Libraries , 64:5: 378-389.

“ Changing the Face of Instruction…” Is an online tutorial as effective in teaching library instruction as a classroom setting? H3. Students will report as much or more satisfaction with online instruction as students taking traditional instruction. Research Question Hypotheses H1. Students will have higher scores in information literacy tests after library instruction. H2. Students will have the same or higher scores in info-lit tests after taking online tutorials as students taking traditional instruction.

“ Changing the Face of Instruction…” Variables: Test scores & survey results Data Collection: Pretest/Posttest & Survey Variables & Data Collection Statistical Tests Conclusions Accept H1: Instruction improves literacy. Desc Stats incl. mean, standard deviation, standard error, T-tests (1 & 2 tailed) Accept H3 alternative hypothesis – Student satisfaction is equal with both methods. Accept H2 alternative hypothesis – Online has no significant difference from traditional.

“ Do Open-Access Articles…” Research Question Hypothesis Variables and Data Collection Statistical Tests Conclusions Critical Questions

“ Do Open-Access Articles…” Do freely available articles have a greater research impact? Research impact: citation rates Open Access: freely available Research Question Hypotheses H1. Scholarly articles have a greater research impact if the articles are freely available online than if they are not. Ho: (null hypothesis): There is no difference between the mean citation rates: Ho: d1 = d0 Measures

“ Do Open-Access Articles…” Variables: Mean citation rates Data Collection: At least 50 articles from 10 leading journals in 4 disciplines. Variables & Data Collection Statistical Tests Conclusions Reject Ho: Open Access articles are citation more than those that are not OA. Desc Stats incl. mean, standard deviation, standard error, Wilcoxon sign-rank Validity? Reliability of Measures? Generalizability? Alternate hypotheses? Discussion

My favorite statistic… Baseball is 90% mental – the other half is physical.

Advanced statistics for librarians

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Advanced statistics for librarians (20)

More from John McDonald (20)

Recently uploaded (20)

Advanced statistics for librarians

Editor's Notes