SlideShare a Scribd company logo
InnerSoft STATS
Methods and Formulas Help
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
2
Mean
The arithmetic mean is the sum of a collection of numbers divided by the number of numbers in the
collection.
Sample Variance
The estimator of population variance, also called the unbiased sample variance, is:
𝑆2
=
∑ (𝑥𝑖 − 𝑥̅)2𝑛
𝑖=1
𝑛 − 1
Source: http://en.wikipedia.org/wiki/Variance
Sample Kurtosis
The estimators of population kurtosis is:
𝐺2 =
𝑘4
𝑘2
2 =
(𝑛 + 1)𝑛
(𝑛 − 1)(𝑛 − 2)(𝑛 − 3)
∗
∑ (𝑥𝑖 − 𝑥̅)4𝑛
𝑖=1
𝑘2
2 − 3
(𝑛 − 1)2
(𝑛 − 2)(𝑛 − 3)
The standard error of the sample kurtosis of a sample of size n from the normal distribution is:
𝐾 𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √
4[6𝑛(𝑛 − 1)2(𝑛 + 1)]
(𝑛 − 3)(𝑛 − 2)(𝑛 + 1)(𝑛 + 3)(𝑛 + 5)
Source: http://en.wikipedia.org/wiki/Kurtosis#Estimators_of_population_kurtosis
Sample Skewness
Skewness of a population sample is estimated by the adjusted Fisher–Pearson standardized moment
coefficient:
𝐺 =
𝑛
(𝑛 − 1)(𝑛 − 2)
∑ (
𝑥𝑖 − 𝑥̅
𝑠
)
3𝑛
𝑖=1
where n is the sample size and s is the sample standard deviation.
The standard error of the skewness of a sample of size n from a normal distribution is:
𝐺 𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √
6𝑛(𝑛 − 1)
(𝑛 − 2)(𝑛 + 1)(𝑛 + 3)
Source: https://en.wikipedia.org/wiki/Skewness#Sample_skewness
Total Variance
Variance of the entire population is:
𝜎2
=
∑ (𝑥𝑖 − 𝑥̅)2𝑛
𝑖=1
𝑛
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
3
Source: http://en.wikipedia.org/wiki/Variance
Total Kurtosis
Kurtosis of the entire population is:
𝐺2 =
∑ (𝑥𝑖 − 𝑥̅)4𝑛
𝑖=1
𝑛
𝜎4
− 3
where n is the sample size and σ is the total standard deviation.
Source: http://en.wikipedia.org/wiki/Kurtosis
Total Skewness
Skewness of the entire population is:
𝐺 =
∑ (𝑥𝑖 − 𝑥̅)3𝑛
𝑖=1
𝑛
𝜎3
where n is the sample size and σ is the total standard deviation.
Source: https://en.wikipedia.org/wiki/Skewness
Quantiles of a population
ISSTATS uses the same method as R–7, Excel CUARTIL.INC function, SciPy–(1,1), SPSS and Minitab.
Qp, the estimate for the kth
q–quantile, where p = k/q and h = (N–1)*p + 1, is computing by
Qp =
Linear interpolation of the modes for the order statistics for the uniform distribution on [0, 1]. When p =
1, use xN.
Source: http://en.wikipedia.org/wiki/Quantile#Estimating_the_quantiles_of_a_population
MSSD (Mean of the squared successive differences)
It is calculated by taking the sum of the differences between consecutive observations squared, then
taking the mean of that sum and dividing by two.
𝑀𝑆𝑆𝐷 =
∑ (𝑥𝑖+1 − 𝑥𝑖)2𝑛
𝑖=1
2(𝑛 − 1)
The MSSD has the desirable property that one half the MSSD is an unbiased estimator of true variance.
Pearson Chi Square Test
The value of the test-statistic is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
4
𝜒2
= ∑
(𝑂𝑖 − 𝐸𝑖)2
𝐸𝑖
𝑛
𝑖=1
Where
 𝜒2
is the Pearson's cumulative test statistic, which asymptotically approaches a 𝜒2
distribution
with (r - 1)(c - 1) degrees of freedom.
 𝑂𝑖 is the number of observations of type i.
 𝐸𝑖 is the expected (theoretical) frequency of type i
Yates's Continuity Correction
The value of the test-statistic is
𝜒2
= ∑
(𝑚𝑎𝑥{0, |𝑂𝑖 − 𝐸𝑖| − 0.5})2
𝐸𝑖
𝑛
𝑖=1
When |𝑂𝑖 − 𝐸𝑖| − 0.5 is below zero, the null value is computed. The effect of Yates' correction is to
prevent overestimation of statistical significance for small data. This formula is chiefly used when at least
one cell of the table has an expected count smaller than 5.
Likelihood Ratio G-Test
The value of the test-statistic is
𝐺 = 2 (∑ ∑ 𝑂𝑖𝑗 ∗ 𝑙𝑛(
𝑂𝑖𝑗
𝐸𝑖𝑗
)
𝑐
𝑗=1
𝑟
𝑖=1
)
where
 Oij is the observed count in row i and column j
 Eij is the expected count in row i and column j
G has an asymptotically approximate χ2
distribution with (r - 1)(c - 1) degrees of freedom when the null
hypothesis is true and n is large enough.
Mantel-Haenszel Chi-Square Test
The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association
between the row variable and the column variable. Both variables must lie on an ordinal scale. The
Mantel-Haenszel chi-square statistic is computed as:
𝑄 𝑀𝐻 = (𝑛 − 1)𝑟2
Where r is the Pearson correlation between the row variable and the column variable, n is the sample size.
Under the null hypothesis of no association, has an asymptotic chi-square distribution with one degree of
freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
5
Fisher's Exact Test
Fisher’s exact test assumes that the row and column totals are fixed, and then uses the hypergeometric
distribution to compute probabilities of possible tables conditional on the observed row and column totals.
Fisher’s exact test does not depend on any large-sample distribution assumptions, and so it is appropriate
even for small sample sizes and for sparse tables. This test is computed for 2X2 tables such as
𝐴 = (
𝑎 𝑏
𝑐 𝑑
)
For an efficient computing, the elements of the matrix A are reordered
A’ = ( 𝑎′ 𝑏′
𝑐′ 𝑑′
)
Being a’ the cell of A that have the minimum marginals (minimum row and column totals). The test result
does not depend on the cells disposition.
The left-sided –value sums the probability for all the tables that have equal or smaller a’.
p 𝑙𝑒𝑓𝑡 = P(𝑥 ≤ 𝑎′) = ∑
(
𝐾 = 𝑎′
+ 𝑏′
𝑖
) (
𝑁 − 𝐾
𝑛 − 𝑖
)
(
𝑁 = 𝑎′ + 𝑏′ + 𝑐′ + 𝑑′
𝑛 = 𝑎′ + 𝑐′
)
𝑎′
𝑖=0
The right-sided –value sums the probability for all the tables that have equal or larger a’.
p 𝑟𝑖𝑔ℎ𝑡 = P(𝑥 ≥ 𝑎′) = ∑
(
𝐾 = 𝑎′
+ 𝑏′
𝑖
) (
𝑁 − 𝐾
𝑛 − 𝑖
)
(
𝑁 = 𝑎′ + 𝑏′ + 𝑐′ + 𝑑′
𝑛 = 𝑎′ + 𝑐′
)
𝐾=𝑎′+𝑏′
𝑖=𝑎′
Most of the statistical packages output -as the one-sided test result- the minimum value of pleft and pright.
The Fisher two-tailed p-value for a table A is defined as the sum of probabilities for all tables consistent
with the marginals that are as likely as the current table.
McNemar's Test
This test is computed for 2X2 tables such as
𝐴 = (
𝑎 𝑏
𝑐 𝑑
)
The value of the test-statistic is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
6
𝜒2
=
(𝑏 − 𝑐)2
𝑏 + 𝑐
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
Edwards Continuity Correction
The value of the test-statistic is
𝜒2
=
(𝑚𝑎𝑥{0, |𝑏 − 𝑐| − 1})2
𝑏 + 𝑐
When |𝑏 − 𝑐| − 1 is below zero, the statistic is zero.
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
McNemar Exact Binomial
Assuming that b < c. Let be n = b + c, and B(x, n, p) the binomial distribution
Two − sided p − value = 2 ∗ (one − sided p − value) = 2 ∗ ∑ 𝐵(𝑥, 𝑛, 0.5)
𝑏
𝑥=0
= 2 ∗ ∑ (
𝑛
𝑥
) ∗ 0.5 𝑥
∗ 0.5 𝑛−𝑥
𝑏
𝑥=0
= 2 ∗
1
2 𝑛
∗ ∑ (
𝑛
𝑥
)
𝑏
𝑥=0
If b = c, the exact p-value equals 1.0.
Mid-P McNemar Test
Let be n = b + c. Assuming that b < c.
Mid − P value = 2 ∗ ∑ 𝐵(𝑥, 𝑛, 0.5)
𝑏
𝑥=0
− 𝐵(𝑏, 𝑛, 0.5) = 2 ∗
1
2 𝑛
∗ ∑ (
𝑛
𝑥
) − (
𝑛
𝑏
) ∗
1
2 𝑛
𝑏
𝑥=0
If b = c, the mid p-value is 1.0 −
1
2
(
𝑛
𝑏
) ∗
1
2 𝑛
Bowker’s Test of Symmetry
This test is computed for m-by-m square matrix as:
𝐵𝑊 = ∑ ∑
(𝑛𝑖𝑗 − 𝑛𝑗𝑖)2
𝑛𝑖𝑗 + 𝑛𝑗𝑖
𝑖−1
𝑗=1
𝑚−1
𝑖=1
For large samples, BW has an asymptotic chi-square distribution with M*(M - 1)/2 – R degrees of
freedom under the null hypothesis of symmetry, where R is the number of off-diagonal cells with nij + nji
= 0.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
7
Risk Test
Let be
Risk Factor Disease status
Cohort = Present Cohort = Absent
Present a b
Absent c d
Odds ratio
The odds ratio (Risk Factor = Present / Risk Factor = Absent) is computed as:
𝑂𝑅 =
𝑎
𝑏⁄
𝑐
𝑑⁄
The distribution of the log odds ratio is approximately normal with:
𝜒 ~ 𝑁(log(𝑂𝑅) , 𝜎2
)
The standard error for the log odds ratio is approximately
𝑆𝐸 = √
1
𝑎
+
1
𝑏
+
1
𝑐
+
1
𝑑
The 95% confidence interval for the odds ratio is computed as
[exp(log(𝑂𝑅) − 𝑧0.025 ∗ 𝑆𝐸) ; exp(log(𝑂𝑅) + 𝑧0.025 ∗ 𝑆𝐸)]
To test the hypothesis that the population odds ratio equals one, is computed the two-sided p-value as
𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 (2 − 𝑠𝑖𝑑𝑒𝑑) = 2 ∗ 𝑃(𝑧 ≤
−|log(𝑂𝑅)|
𝑆𝐸
)
Source: https://en.wikipedia.org/wiki/Odds_ratio
Relative Risk
The relative risk (for cohort Disease status = Present) is computed as
𝑅𝑅 =
𝑎
𝑎 + 𝑏⁄
𝑐
𝑐 + 𝑑⁄
The distribution of the log relative risk is approximately normal with:
𝜒 ~ 𝑁(log(𝑂𝑅) , 𝜎2
)
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
8
The standard error for the log relative risk is approximately
𝑆𝐸 = √
1
𝑎
+
1
𝑏
−
1
𝑎 + 𝑏
−
1
𝑐 + 𝑑
The 95% confidence interval for the relative risk is computed as
[exp(log(𝑅𝑅) − 𝑧0.025 ∗ 𝑆𝐸) ; exp(log(𝑅𝑅) + 𝑧0.025 ∗ 𝑆𝐸)]
To test the hypothesis that the population relative risk equals one, is computed the two-sided p-value as
𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 (2 − 𝑠𝑖𝑑𝑒𝑑) = 2 ∗ 𝑃(𝑧 ≤
−|log(𝑅𝑅)|
𝑆𝐸
)
The relative risk (for cohort Disease status = Absent) is computed as
𝑅𝑅 =
𝑏
𝑎 + 𝑏⁄
𝑑
𝑐 + 𝑑⁄
Epidemiology Risk
All the parameters are computed for cohort Disease status = Present.
Attributable risk, represents how much the risk factor increase/decrease the risk of disease
𝐴𝑅 =
𝑎
𝑎 + 𝑏
−
𝑐
𝑐 + 𝑑
If AR > 0 there an increase of the risk. If AR < 0 there is a reduction of the risk.
Relative Attributable Risk
𝑅𝑅 =
𝑎
𝑎 + 𝑏
−
𝑐
𝑐 + 𝑑
𝑐
𝑐 + 𝑑
=
𝐴𝑅
𝑐
𝑐 + 𝑑
Number Needed to Harm
𝑁𝑁𝐻 =
1
𝑎
𝑎 + 𝑏
−
𝑐
𝑐 + 𝑑
=
1
𝐴𝑅
The number needed to harm (NNH) is an epidemiological measure that indicates how many patients on
average need to be exposed to a risk-factor over a specific period to cause harm in an average of one
patient who would not otherwise have been harmed.
A negative number would not be presented as a NNH, rather, as the risk factor is not harmful, it is
expressed as a number needed to treat (NNT) or number needed to avoid to expose to risk.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
9
Attributable risk per unit
𝐴𝑅𝑃 =
𝑅𝑅 − 1
𝑅𝑅
Preventive fraction
𝑃𝐹 = 1 − 𝑅𝑅
Etiologic fraction is the proportion of cases in which the exposure has played a causal role in disease
development.
𝐸𝐹 =
𝑎 − 𝑐
𝑎
A similar parameters are computed for cohort Disease status = Absent.
Source: https://en.wikipedia.org/wiki/Relative_risk
Cohen's Kappa Test
Given a k-by-k square matrix, which collect the scores of two raters who each classify N items into k
mutually exclusive categories, the equation for Cohen's kappa coefficient is
𝑘̂ =
𝑝 𝑜 − 𝑝 𝑒
1 − 𝑝 𝑒
Where
𝑝 𝑜 = ∑
𝑛𝑖𝑖
𝑁
= ∑ 𝑝𝑖𝑖
𝑘
𝑖=1
𝑘
𝑖=1
𝑎𝑛𝑑 𝑝𝑒 = ∑ 𝑝𝑖.
𝑝.𝑖
𝑘
𝑖=1
𝑤ℎ𝑒𝑟𝑒 𝑝𝑖𝑗 =
𝑛𝑖𝑗
𝑁
𝑎𝑛𝑑 𝑝𝑖. = ∑
𝑛𝑖𝑗
𝑁
𝑘
𝑗=1
𝑎𝑛𝑑 𝑝.𝑗 = ∑
𝑛𝑖𝑗
𝑁
𝑘
𝑖=1
The asymptotic variance is computed by
𝑣𝑎𝑟(𝑘̂) =
1
𝑁(1 − 𝑝𝑒)4
{ ∑ 𝑝𝑖𝑖[(1 − 𝑝𝑒) − (𝑝.𝑖 + 𝑝𝑖.)(1 − 𝑝 𝑜)]2
𝑘
𝑖=1
+ (1 − 𝑝0)2
∑ ∑ 𝑝𝑖𝑗(𝑝.𝑖 + 𝑝𝑗.)2
𝑘
𝑗=1,𝑗≠𝑖
− (𝑝 𝑜 𝑝𝑒 − 2𝑝𝑒 + 𝑝 𝑜)2
𝑘
𝑖=1
}
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
10
The formulae is given by Fleiss, Cohen, and Everitt (1969), and modified by Fleiss (1981). The
asymptotic standard error is the root square of the value given above. This standard error and the normal
distribution N(0,1) must be used to compute confidence intervals.
𝑘̂ ± 𝑧∝/2√ 𝑣𝑎𝑟(𝑘̂)
To compute an asymptotic test for the kappa coefficient, ISSTATS uses a standardized test statistic T
which has an asymptotic standard normal distribution under the null hypothesis that kappa equals zero
(H0: k = 0). The standardized test statistic is computed as
𝑇 =
𝑘̂
√ 𝑣𝑎𝑟0(𝑘̂)
≅ 𝑁(0,1)
Where the variance of the kappa coefficient under the null hypothesis is
𝑣𝑎𝑟0(𝑘̂) =
1
𝑁(1 − 𝑝𝑒)2
{ 𝑝𝑒 + 𝑝𝑒
2
− ∑ 𝑝.𝑖 𝑝𝑖.(𝑝.𝑖+ 𝑝𝑖.)
𝑘
𝑖=1
}
Refer to Fleiss (1981)
Source: https://v8doc.sas.com/sashtml/stat/chap28/sect26.htm
Nominal by Nominal Measures of Association
Contingency Coefficient
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
𝐶 = √
𝜒2
𝜒2 + 𝑁
Where
 𝜒2
is the Pearson's cumulative test statistic.
 N is the total sample size.
C asymptotically approaches a 𝜒2
distribution with (r - 1)(c - 1) degrees of freedom.
Standardized Contingency Coefficient
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
11
If X and Y have the same number of categories (r = c), then the maximum value for the contingency
coefficient is calculated as:
𝑐 𝑚𝑎𝑥 = √
𝑟 − 1
𝑟
If X and Y have a differing number of categories (r ≠ c), then the maximum value for the contingency
coefficient is calculated as
𝑐 𝑚𝑎𝑥 = √
(𝑟 − 1)(𝑐 − 1)
𝑟 ∗ 𝑐
4
The standardized contingency coefficient is calculated as the ratio:
𝑐𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 =
𝐶
𝑐 𝑚𝑎𝑥
which varies between 0 and 1 with 0 indicating independence and 1 dependence.
Phi coefficient
The phi coefficient is a measure of association for two nominal variables.
𝛷 = √
𝜒2
𝑁
Where
 𝜒2
is the Pearson's cumulative test statistic.
 N is the total sample size.
Phi asymptotically approaches a 𝜒2
distribution with (r - 1)(c - 1) degrees of freedom.
Cramer's V
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
𝑉 = √
𝜒2
𝑁
⁄
𝑚𝑖𝑛{𝑟 − 1, 𝑐 − 1}
Where
 𝜒2
is the Pearson's cumulative test statistic.
 N is the total sample size.
V asymptotically approaches a χ2
distribution with (r - 1)(c - 1) degrees of freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
12
Tschuprow's T
Tschuprow's T is a measure of association between two nominal variables, giving a value between 0 and
1 (inclusive).
𝑇 = √
𝜒2
𝑁
⁄
√(𝑟 − 1)(𝑐 − 1)
Lambda
Asymmetric lambda, λ(C/R) or column variable dependent, is interpreted as the probable improvement in
predicting the column variable Y given knowledge of the row variable X. The range of asymmetric
lambda is {0, 1}. Asymmetric lambda (C/R) or column variable dependent is computed as
𝜆(𝐶/𝑅) =
∑ 𝑟𝑖𝑖 − 𝑟
𝑁 − 𝑟
The asymptotic variance is
𝑣𝑎𝑟( 𝜆(𝐶/𝑅)) =
𝑁 − ∑ 𝑟𝑖𝑖
( 𝑟 − 𝑁)3
{ ∑ 𝑟𝑖
𝑖
+ 𝑟 − 2 ∑(𝑟𝑖|𝑙𝑖 = 𝑙)
𝑖
}
Where
𝑟𝑖 = max
𝑗
{𝑛𝑖𝑗} 𝑎𝑛𝑑 𝑟 = max
𝑗
{𝑟.𝑗} 𝑎𝑛𝑑 𝑐𝑗 = max
𝑖
{𝑛𝑖𝑗} 𝑎𝑛𝑑 𝑐 = max
𝑖
{𝑛𝑖.}
The values of li and l are determined as follows. Denote by li the unique value of j such that ri = nij, and
let l be the unique value of j such that r = n.j. Because of the uniqueness assumptions, ties in the
frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties,
l is defined as the smallest value of such that r = n.j.
For those columns containing a cell (i, j) for which nij = ri = cj, csj records the row in which cj is assumed
to occur. Initially is set equal to –1 for all j. Beginning with i = 1, if there is at least one value j such that
nij = ri = cj, and if csj = -1, then li is defined to be the smallest such value of j, and csj is set equal to i.
Otherwise, if nil = ri, then li is defined to be equal to l. If neither condition is true, then li is taken to be the
smallest value of j such that nij = ri.
The asymptotic standard error is the root square of the asymptotic variance.
The formulas for lambda asymmetric λ(R/C) can be obtained by interchanging the indices.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
13
𝜆(𝑅/𝐶) =
∑ 𝑐𝑗𝑗 − 𝑐
𝑁 − 𝑐
The Symmetric lambda is the average of the two asymmetric lambdas, λ(C/R) and λ(R/C). Its range is {-
1, 1}. Lambda symmetric is computed as
𝜆 =
∑ 𝑟𝑖𝑖 + ∑ 𝑐𝑗𝑗 − 𝑟 − 𝑐
2𝑁 − 𝑟 − 𝑐
The asymptotic variance is
𝑣𝑎𝑟( 𝜆) =
1
𝑤4
{ 𝑤𝑣𝑦 − 2𝑤2
[𝑁 − ∑ ∑(𝑛𝑖𝑗|𝑗 = 𝑙𝑖, 𝑖 = 𝑘𝑗)
𝑗𝑖
] − 2𝑣2
(𝑁 − 𝑛 𝑘𝑙)}
Where
𝑤 = 2𝑛 − 𝑟 − 𝑐 𝑎𝑛𝑑 𝑣 = 2𝑛 − ∑ 𝑟𝑖
𝑖
− ∑ 𝑐𝑗
𝑗
𝑎𝑛𝑑 𝑥
= ∑(𝑟𝑖
| 𝑙𝑖 = 𝑙)
𝑖
+ ∑(𝑐𝑗
| 𝑘𝑗 = 𝑘)
𝑗
+ 𝑟𝑘 + 𝑐𝑙 𝑎𝑛𝑑 𝑦 = 8𝑁 − 𝑤 − 𝑣 − 2𝑥
The definitions of l and li are given in the previous section. The values k and kj are defined in a similar
way for lambda asymmetric (R/C).
Uncertainty Coefficient
The uncertainty coefficient U (C/R) -or column variable dependent U- measures the proportion of
uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is {0, 1}.
The uncertainty coefficient is computed as
𝑈(𝐶/𝑅) = 𝑈 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 =
𝑣
𝑤
=
H(X) + H(Y) − H(XY)
H(Y)
Where
𝐻(𝑋) = − ∑
𝑛𝑖.
𝑛
ln (
𝑛𝑖.
𝑛
)
𝑖
𝑎𝑛𝑑 𝐻(𝑌) = − ∑
𝑛.𝑗
𝑛
ln (
𝑛.𝑗
𝑛
)
𝑖
𝑎𝑛𝑑 𝐻(𝑋𝑌)
= − ∑ ∑
𝑛𝑖𝑗
𝑛
ln (
𝑛𝑖𝑗
𝑛
)
𝑗𝑖
The asymptotic variance is
𝑣𝑎𝑟(𝑈(𝐶/𝑅)) =
1
𝑛2 𝑤4
∑ ∑ 𝑛𝑖𝑗 {𝐻(𝑌) ln (
𝑛𝑖𝑗
𝑛𝑖.
) + (H(X) − H(XY)) ln (
𝑛.𝑗
𝑛
)}
2
𝑗𝑖
The asymptotic standard error is the root square of the asymptotic variance.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
14
The formulas for the uncertainty coefficient U (C/R) can be obtained by interchanging the indices.
The symmetric uncertainty coefficient is computed as
𝑈 =
2 ∗ [H(X) + H(Y) − H(XY)]
H(X) + H(Y)
The asymptotic variance is
𝑣𝑎𝑟(𝑈) = 4 ∑ ∑
𝑛𝑖𝑗 {𝐻(𝑋𝑌) ln (
𝑛𝑖. 𝑛.𝑗
𝑛2 ) − (H(X) − H(Y)) ln (
𝑛.𝑗
𝑛 )}
2
𝑛2(H(X) + H(Y))4
𝑗𝑖
The asymptotic standard error is the root square of the asymptotic variance.
Ordinal by Ordinal Measures of Association
Let nij denote the observed frequency in cell (i, j) in a IxJ contingency table. Let be N the total frequency
and
𝐴𝑖𝑗 = ∑ ∑ 𝑛 𝑘𝑙
𝑙<𝑗𝑘<𝑖
+ ∑ ∑ 𝑛 𝑘𝑙
𝑙>𝑗𝑘>𝑖
𝐷𝑖𝑗 = ∑ ∑ 𝑛 𝑘𝑙
𝑙<𝑗𝑘>𝑖
+ ∑ ∑ 𝑛 𝑘𝑙
𝑙>𝑗𝑘<𝑖
𝑃 = ∑ ∑ 𝑎𝑖𝑗 𝐴𝑖𝑗
𝑗𝑖
𝑎𝑛𝑑 𝑄 = ∑ ∑ 𝑎𝑖𝑗 𝐷𝑖𝑗
𝑗𝑖
Gamma Coefficient
The gamma (G) statistic is based only on the number of concordant and discordant pairs of observations.
It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y).
Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is {-1, 1}. If
the row and column variables are independent, then gamma tends to be close to zero.
Gamma is estimated by
𝐺 =
𝑃 − 𝑄
𝑃 + 𝑄
The asymptotic variance is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
15
𝑣𝑎𝑟(𝐺) =
16
( 𝑃 + 𝑄)2
{ ∑ ∑ 𝑛𝑖𝑗 ∗ (𝑄𝐴𝑖𝑗 − 𝑃𝐷𝑖𝑗)2
𝐽
𝑗=1
𝐼
𝑖=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that gamma equals zero is computed as
𝑣𝑎𝑟0(𝐺) =
4
( 𝑃 + 𝑄)2
{ ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗
2
𝐽
𝑗=1
−
(𝑃 − 𝑄)2
𝑁
𝐼
𝑖=1
}
Where dij = Aij - Dij
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Kendall's tau-b
Kendall’s tau-b is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only
when both variables lie on an ordinal scale. The range of tau-b is {-1, 1}. Kendall’s tau-b is estimated by
𝜏 𝑏 =
𝑃 − 𝑄
𝑤
Where
𝑤𝑟 = 𝑛2
− ∑ 𝑛𝑖.
2
𝑖
𝑎𝑛𝑑 𝑤𝑐 = 𝑛2
− ∑ 𝑛.𝑗
2
𝑖
𝑎𝑛𝑑 𝑤 = √ 𝑤𝑟 𝑤𝑐
The asymptotic variance is
𝑣𝑎𝑟( 𝜏 𝑏) =
1
𝑤4
{ ∑ ∑ 𝑛𝑖𝑗(2𝑤𝑑𝑖𝑗 + 𝜏 𝑏 𝑣𝑖𝑗)2
𝐽
𝑗=1
𝐼
𝑖=1
− 𝑁3
𝜏 𝑏
2
( 𝑤 𝑟 + 𝑤 𝑐)2
}
where
𝑣𝑖𝑗 = 𝑤 𝑐 𝑛𝑖. + 𝑤 𝑟 𝑛.𝑗
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-b equals zero is computed as
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
16
𝑣𝑎𝑟0( 𝜏 𝑏) =
4
𝑤 𝑟 𝑤 𝑐
{ ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗
2
𝐽
𝑗=1
−
(𝑃 − 𝑄)2
𝑁
𝐼
𝑖=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Stuart-Kendall's tau-c
Stuart-Kendall’s tau-c makes an adjustment for table size in addition to a correction for ties. Tau-c is
appropriate only when both variables lie on an ordinal scale. The range of tau-c is {-1, 1}. Stuart-
Kendall’s tau-c is estimated by
𝜏 𝑐 =
𝑚(𝑃 − 𝑄)
𝑁2(𝑚 − 1)
Where m =min {I, J}.
The asymptotic variance is
𝑣𝑎𝑟( 𝜏 𝑐) =
4𝑚2
𝑁4
(𝑚 − 1)2 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗
2
𝐽
𝑗=1
−
(𝑃 − 𝑄)2
𝑁
𝐼
𝑖=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance.
Sommers’ D
Somers’ D(C/R) and Somers’ D(R/C) are asymmetric modifications of tau-b. C/R indicates that the row
variable X is regarded as the independent variable and the column variable Y is regarded as dependent.
Similarly, R/C indicates that the column variable Y is regarded as the independent variable and the row
variable X is regarded as dependent. Somers’ D differs from tau-b in that it uses a correction only for
pairs that are tied on the independent variable. Somers’ D is appropriate only when both variables lie on
an ordinal scale. The range of Somers’ D is {-1, 1}. Somers’ D is computed as
𝐷(𝐶/𝑅) = 𝐷 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 =
𝑃 − 𝑄
𝑤𝑟
The asymptotic variance is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
17
𝑣𝑎𝑟( 𝐷(𝐶/𝑅)) =
4
𝑤 𝑟
4
{ ∑ ∑ 𝑛𝑖𝑗[𝑤𝑟 𝑑𝑖𝑗 − (𝑃 − 𝑄)(𝑁 − 𝑛𝑖.)]2
𝐽
𝑗=1
𝐼
𝑖=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that D(C/R) equals zero is computed as
𝑣𝑎𝑟0( 𝐷(𝐶/𝑅)) =
4
𝑤 𝑟
2
{ ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗
2
𝐽
𝑗=1
−
(𝑃 − 𝑄)2
𝑁
𝐼
𝑖=1
}
The asymptotic standard error under the null hypothesis that D(C/R) equals zero is the root square of the
variance.
Formulas for Somers’ D(R/C) are obtained by interchanging the indices.
Symmetric version of Somers’ d is
𝑑 =
𝑃 − 𝑄
𝑤𝑟 + 𝑤𝑐
2
The standard error is
𝐴𝑆𝐸(𝑑) =
2𝜎 𝜏𝑏 𝑤
𝑤 𝑟 + 𝑤 𝑐
where στb is the asymptotic standard error of Kendall’s tau-b.
The variance under the null hypothesis that d equals zero is computed as
𝑣𝑎𝑟0(𝑑) =
16
( 𝑤 𝑟 + 𝑤 𝑐)2
{ ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗
2
𝐽
𝑗=1
−
(𝑃 − 𝑄)2
𝑁
𝐼
𝑖=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Confidence Bounds and One-Sided Tests
Suppose you are testing the null hypothesis H0:  ≥ 0 against the one-sided alternative H1:  < 0. Rather
than give a two-sided confidence interval for , the more appropriate procedure is to give an upper
confidence bound in this setting. This upper confidence bound has a direct relationship to the one-sided
test, namely:
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
18
1. A level  test of H0:  ≥ 0 against the one-sided alternative H1:  < 0 rejects H0 exactly when
the value 0 is above the 1–α upper confidence bound.
2. A level  test of H0:  ≤ 0 against the one-sided alternative H1:  > 0 rejects H0 exactly when
the value 0 is above the 1–α lower confidence bound.
ANOVA Test
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ∑ ∑(𝑦𝑖𝑗 − 𝑌..
̅)2
𝑁𝑖
𝑗=1
𝑘
𝑖=1
𝑆𝑆𝐼𝑛𝑡𝑒𝑟 = ∑ 𝑛𝑖(𝑌̅𝑖. − 𝑌..
̅)2
𝑘
𝑖=1
𝑆𝑆𝐼𝑛𝑡𝑟𝑎 = ∑ ∑(𝑦𝑖𝑗 − 𝑌𝑖.
̅ )2
𝑛 𝑖
𝑗=1
𝑘
𝑖=1
= 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐼𝑛𝑡𝑒𝑟
DF Total = N – 1
DF Inter = k – 1
DF Intra = N – k
𝑀𝑆𝑇𝑜𝑡𝑎𝑙 =
SSTotal
DFTotal
𝑀𝑆𝐼𝑛𝑡𝑒𝑟 =
SSInter
DFInter
𝑀𝑆𝐼𝑛𝑡𝑟𝑎 =
SSIntra
DFIntra
𝐹 =
MSInter
MSIntra
where
 F is the result of the test
 k is the number of different groups to which the sampled cases belong
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 ni is the number of cases in the i-th group
 yij is the value of the measured variable for the j-th case from the i-th group
 𝑌̅.. is the mean of all yij
 𝑌̅𝑖. is the mean of the yij for group i.
The test statistic has a F-distribution with DF Inter and DF Intra degrees of freedom. Thus the null
hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘
𝑘−1
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
19
ANOVA Multiple Comparisons
Difference of Means
𝑦̅𝑖 − 𝑦̅𝑗
Standard Error of the Difference of Means Estimator
𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Scheffe’s Method
Confidence Interval for Difference of Means
𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± √𝐷𝐹𝐼𝑛𝑡𝑒𝑟 ∗ 𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ 𝐹(1 − 𝛼) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎
𝐷𝐹 𝐼𝑛𝑡𝑒𝑟
∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Source: http://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method
Tukey's range test HSD
Confidence Interval for Difference of Means
𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑞(1 − 𝛼) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎
𝑘
√
𝑀𝑆𝐼𝑛𝑡𝑟𝑎
2
∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Where q is the studentized range distribution.
Source: https://en.wikipedia.org/wiki/Tukey%27s_range_test
Fisher's Method LSD
If overall ANOVA test is not significant, you must not consider any results of Fisher test, significant or
not.
Confidence Interval for Difference of Means
𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡(1 − 𝛼
2⁄ )
𝐷𝐹 𝐼𝑛𝑡𝑟𝑎
√𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Where t is the student distribution.
Bonferroni's Method
The family-wise significance level (FWER) is α = 1 - Confidence Level. Thus any comparison flagged by
ISSTATS as significant is based on a Bonferroni Correction:
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
20
𝛼′ =
2𝛼
𝑘(𝑘 − 1)
𝑝′ = 𝑝
𝑘(𝑘 − 1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡 (1 − 𝛼′
2⁄ )
𝐷𝐹 𝐼𝑛𝑡𝑟𝑎
√𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Where t is the student distribution.
Sidak's Method
The family-wise significance level (FWER) is α = 1 - Confidence Level. So any comparison flagged by
ISSTATS as significant is based on a Sidak Correction:
𝛼′ = (1 − 𝛼)
2
𝑘(𝑘−1)
𝑝′
= 1 − 𝑒
log(1−𝑝)𝑘(𝑘−1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡 (1 − 𝛼′
2⁄ )
𝐷𝐹 𝐼𝑛𝑡𝑟𝑎
√𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ (
1
𝑛𝑖
+
1
𝑛𝑗
)
Where t is the student distribution.
Welch’s Test for equality of means
The test statistic, F*
, is defined as follows:
𝐹∗
=
∑ 𝑤𝑖(𝑥̅𝑖 − 𝑋̃)2𝑘
𝑖=1
𝑘 − 1
1 +
2(𝑘 − 2)
𝑘2 − 1
∗ ∑ ℎ𝑖
𝑘
𝑖=1
where
 F*
is the result of the test
 k is the number of different groups to which the sampled cases belong
 ni is the number of cases in the i-th group
 𝑤𝑖 =
𝑛 𝑖
𝑆𝑖
2
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
21
 𝑊 = ∑ 𝑤𝑖 = ∑
𝑛 𝑖
𝑆𝑖
2
𝑘
𝑖=1
𝑘
𝑖=1
 𝑋̃ =
∑ 𝑤 𝑖 𝑥̅ 𝑖
𝑘
𝑖=1
𝑊
 ℎ𝑖 =
(1−
𝑤 𝑖
𝑊
)
2
𝑛 𝑖−1
The test statistic has approximately a F-distribution with k-1 and 𝑑𝑓 =
𝑘2−1
3∗∑ ℎ 𝑖
𝑘
𝑖=1
degrees of freedom. Thus
the null hypothesis is rejected if 𝐹∗
≥ 𝐹(1 − 𝛼) 𝑑𝑓
𝑘−1
Brown–Forsythe Test for equality of means
The test statistic, F*
, is defined as follows:
𝐹∗
=
∑ 𝑛𝑖(𝑥̅𝑖 − 𝑋̅..)2𝑘
𝑖=1
∑ (1 −
𝑛𝑖
𝑁) 𝑆𝑖
2𝑘
𝑖=1
where
 F*
is the result of the test
 k is the number of different groups to which the sampled cases belong
 ni is the number of cases in the i-th group (sample size of group i)
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 𝑋̅.. =
∑ 𝑛 𝑖 𝑥̅ 𝑖
𝑘
𝑖=1
𝑁
is the overall mean.
The test statistic has approximately a F-distribution with k-1 and df degrees of freedom. Where df is
obtained with the Satterthwaite (1941) approximation as
1
df
= ∑
ci
2
ni − 1
k
i=1
with
𝑐𝑗 =
(1 −
𝑛𝑗
𝑁) 𝑆𝑗
2
∑ (1 −
𝑛𝑖
𝑁) 𝑆𝑖
2𝑘
𝑖=1
Thus the null hypothesis is rejected if 𝐹∗
≥ 𝐹(1 − 𝛼) 𝑑𝑓
𝑘−1
Homoscedasticity Tests
Levene's Test
The test statistic, F, is defined as follows:
𝐹 =
𝑁 − 𝑘
𝑘 − 1
∗
∑ 𝑛𝑖(𝑍̅𝑖. − 𝑍̅..)2𝑘
𝑖=1
∑ ∑ (𝑍𝑖𝑗 − 𝑍̅𝑖.)2𝑛 𝑖
𝑗=1
𝑘
𝑖=1
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
22
where
 F is the result of the test
 k is the number of different groups to which the sampled cases belong
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 ni is the number of cases in the i-th group
 Yij is the value of the measured variable for the j-th case from the i-th group
 𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̅𝑖.| where 𝑌̅𝑖. is a mean of i-th group
 𝑍̅.. is the mean of all Zij
 𝑍̅𝑖. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘
𝑘−1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
Brown–Forsythe Test for equality of variances
The test statistic, F, is defined as follows:
𝐹 =
𝑁 − 𝑘
𝑘 − 1
∗
∑ 𝑛𝑖(𝑍̅𝑖. − 𝑍̅..)2𝑘
𝑖=1
∑ ∑ (𝑍𝑖𝑗 − 𝑍̅𝑖.)2𝑛𝑖
𝑗=1
𝑘
𝑖=1
where
 F is the result of the test
 k is the number of different groups to which the sampled cases belong
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 ni is the number of cases in the i-th group
 Yij is the value of the measured variable for the j-th case from the i-th group
 𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̃𝑖.| where 𝑌̃𝑖. is a median of i-th group
 𝑍̅.. is the mean of all Zij
 𝑍̅𝑖. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘
𝑘−1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
Bartlett's Test
Bartlett's test is used to test the null hypothesis, H0 that all k population variances are equal against the
alternative that at least two are different.
If there are k samples with size ni and sample variances S2
i then Bartlett's test statistic is
𝜒2
=
(𝑁 − 𝑘)𝑙𝑛(𝑆 𝑝
2
) − ∑ (𝑛𝑖 − 1)𝑙𝑛(𝑆𝑖
2
)𝑘
𝑖=1
1 +
1
3(𝑘 − 1)
∗ (∑ (
1
𝑛𝑖 − 1)𝑘
𝑖=1 −
1
𝑁 − 𝑘
)
where
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
23
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 𝑆 𝑝
2
=
∑ (𝑛 𝑖−1)𝑆𝑖
2𝑘
𝑖=1
𝑁−𝑘
is the pooled estimate for the variance
The test statistic has approximately a chi-squared distribution with k-1 degrees of freedom. Thus the null
hypothesis is rejected if 𝜒2
≥ 𝜒 𝑘−1
2
(1 − 𝛼).
Source: http://en.wikipedia.org/wiki/Bartlett%27s_test
Bivariate Correlation Tests
Sample Covariance
Sxy =
∑ (xi − x̅)(yi − y̅)N
i=1
N − 1
Where 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size.
Source: http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance
Sample Pearson Product-Moment Correlation Coefficient
r =
1
N − 1
∗
∑ (𝑥𝑖 − 𝑥̅)(𝑦𝑖 − 𝑦̅)𝑁
𝑖=1
𝑆 𝑥 𝑆 𝑦
=
𝑆 𝑥𝑦
𝑆 𝑥 𝑆 𝑦
where Sx and Sy are the sample standard deviation of the paired sample (xi, yi), Sxy is the sample
covariance and N is the total sample size.
Source: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
Test for the Significance of the Pearson Product-Moment Correlation Coefficient
Test hypothesis are:
 H0: the sample values come from a population in which ρ=0
 H1: the sample values come from a population in which ρ≠0
Test statistic is
t =
r ∗ √N − 2
√1 − r2
where
 𝑁 = ∑ 𝑛𝑖
𝑘
𝑖=1 is the total sample size
 r is the Sample Pearson Product-Moment Correlation Coefficient
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
24
The test statistic has a t-student distribution with N-2 degrees of freedom.
Spearman Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. Identical values (rank ties or value duplicates) are assigned a rank equal to the
average of their positions in the ascending order of the values. Each time t observations are tied (t>1), the
quantity t3
−t is calculated and summed separately for each variable. These sums will be designated STx
and STy.
For each of the N observations, the difference between the rank of X and rank of Y is computed as:
di = Rank(Xi) − Rank(Yi)
If there are no ties in both samples, Spearman’s rho (ρ) is calculated as
ρ = 1 −
6 ∑ 𝑑𝑖
N(𝑁2 − 1)
If there are any ties in any of the samples, Spearman’s rho (ρ) is calculated as (Siegel, 1956):
ρ =
𝑇𝑥 + 𝑇𝑦 − ∑ di
2√ 𝑇𝑥 ∗ 𝑇𝑦
where
𝑇𝑥 =
N(𝑁2
− 1) − 𝑆𝑇𝑥
12
𝑇𝑦 =
N(𝑁2
− 1) − 𝑆𝑇𝑦
12
If Tx or Ty is 0, the statistic is not computed.
Source:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_spearman.htm
Test for the Significance of the Spearman’s Correlation Coefficient
Test hypothesis are:
 H0: the sample values come from a population in which ρ=0
 H1: the sample values come from a population in which ρ≠0
Test statistic is
t =
ρ ∗ √N − 2
√1 − ρ2
The test statistic has a t-student distribution with N-2 degrees of freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
25
Kendall's Tau-b Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. In situations where t observations are tied, the average rank is assigned.
Each time t > 1, the following quantities are computed and summed over all groups of ties for each
variable separately.
T1 = ∑ 𝑡2
− 𝑡
T2 = ∑(𝑡2
− 𝑡)(𝑡 − 2)
T3 = ∑(𝑡2
− 𝑡)(2𝑡 + 5)
Each of the N cases is compared to the others to determine with how many cases its ranking of X and Y is
concordant or discordant. The following procedure is used. For each distinct pair of cases (i, j), where i <
j the quantity
dij=[Rank(Xj)−Rank(Xi)][Rank(Yj)−Rank(Yi)]
is computed. If the sign of this product is positive, the pair of observations (i, j) is concordant. If the sign
is negative, the pair is discordant. The number of concordant pairs minus the number of discordant pairs
is
S = ∑ ∑ 𝑠𝑖𝑔𝑛(𝑑𝑖𝑗)
𝑁
𝑗=𝑖+1
𝑁−1
𝑖=1
where sign(dij) is defined as +1 or –1 depending on the sign of dij. Pairs in which dij=0 are ignored in the
computation of S.
If there are no ties in both samples, Kendall’s tau (τ) is computed as
τ =
2S
N2 − N
If there are any ties in any of the samples, Kendall’s tau (τ) is computed as
τ =
2S
√N2 − N − 𝑇1 𝑥√N2 − N − 𝑇1 𝑦
If the denominator is 0, the statistic is not computed.
Source: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Tau-b
Test for the Significance of the Kendall's Tau-b Correlation Coefficient
The variance of S is estimated by (Kendall, 1955):
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
26
Var =
(N2
− N)(2N + 5) − T3x − T3y
18
+
T2x ∗ T2y
9(N2 − N)(N − 2)
+
T1x ∗ T1y
2(N2 − N)
The significance level is obtained using
Z =
S
√Var
Which, under the null hypothesis, is approximately distributed as a standard normal when the variables
are statistically independent.
Sources: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Significance_tests
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_kendalls.htm
Parametric Value at Risk
Value at Risk of a single asset
Given the time series of daily return rates for an asset, the daily mean of the return rates is μ, the daily
variance of the daily return rates is σ2
. Given the position, hold or investment in the asset P.
One-day Expected Return is:
ER = Pμ
The Standard Deviation or Volatility is the square root of the Variance:
𝜎 = √ 𝜎2
One-day Value at Risk is:
𝑉𝑎𝑅1−𝛼 = −(μ + 𝑧 𝛼 𝜎)P
where zα is the left-tail α quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
𝑉𝑎𝑅1−𝛼
𝑛 𝑑𝑎𝑦𝑠
= 𝑉𝑎𝑅1−𝛼 ∗ √ 𝑛 = −(μ + 𝑧 𝛼 𝜎)P√ 𝑛
Portfolio Value at Risk
Given the time series of daily return rates on different assets, the daily mean of the return rates for the i-th
asset is μi, the daily variance of the return rate for the i-th asset is σi
2
, the daily standard deviation (or
volatility) of the return rates for the i-th asset is σi. The covariance of the daily return rates of i-th and j-th
assets is σij. All parameters are unbiased estimates. Given the holds, positions or investments on each of
these assets: Pi
Total positions is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
27
P = ∑ 𝑃𝑖
𝑁
𝑖=1
The weighting of each position is
𝑤𝑖 =
𝑃𝑖
𝑃
The weighted mean of the portfolio is
μ 𝑃 = ∑ 𝑤𝑖 𝜇𝑖 =
𝑁
𝑖=1
1
𝑃
∑ 𝑃𝑖 𝜇𝑖
𝑁
𝑖=1
One-day Expected Return of the portfolio is the weighted mean of the portfolio multiplied by the total
position
ER = Pμ 𝑃 = P ∑ 𝑤𝑖 𝜇𝑖 =
𝑁
𝑖=1
∑ 𝑃𝑖 𝜇𝑖
𝑁
𝑖=1
The Portfolio Variance is
𝜎 𝑃
2
= [𝑤1 … 𝑤𝑖 … 𝑤 𝑛] [
𝜎1
2
⋯ 𝜎1𝑛
⋮ ⋱ ⋮
𝜎 𝑛1 ⋯ 𝜎 𝑛
2
]
[
𝑤1
⋮
𝑤𝑖
⋮
𝑤 𝑛]
= 𝑊 𝑇
𝑀𝑊
where W is the vector of weights and M is the covariance matrix. The item i-th in the diagonal of M is the
daily variance of the return rates for the i-th asset. The items outside the diagonal are covariances.
Portfolio Variance also can be computed as:
𝜎 𝑃
2
=
1
𝑃2
∗ [𝑃1 … 𝑃𝑖 … 𝑃𝑛] [
𝜎1
2
⋯ 𝜎1𝑛
⋮ ⋱ ⋮
𝜎 𝑛1 ⋯ 𝜎 𝑛
2
]
[
𝑃1
⋮
𝑃𝑖
⋮
𝑃𝑛]
=
1
𝑃2
∗ 𝑋 𝑇
𝑀𝑋
where X is the vector of positions.
The Portfolio Standard Deviation or Portfolio Volatility is the square root of the Portfolio Variance:
𝜎 𝑃 = √𝜎 𝑃
2
One-day Value at Risk is:
𝑉𝑎𝑅1−𝛼 = −(μ 𝑃 + 𝑧 𝛼 𝜎 𝑃)P
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
28
Where zα is the left-tail α quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
𝑉𝑎𝑅1−𝛼
𝑛 𝑑𝑎𝑦𝑠
= 𝑉𝑎𝑅1−𝛼 ∗ √ 𝑛 = −(μ 𝑃 + 𝑧 𝛼 𝜎 𝑃)P√ 𝑛
𝑉𝑎𝑅1−𝛼
𝑛 𝑑𝑎𝑦𝑠
is the minimum potential loss that a portfolio can suffer in the α% worst cases in n days.
About the Signs: A positive value of VaR is an expected loss. A negative VaR would imply the portfolio
has a high probability of making a profit.
Source: http://www.jpmorgan.com/tss/General/Risk_Management/1159360877242
Remark: Some texts about VaR express the covariance as σij = σiσjρij where ρij is the correlation
coefficient.
Remark: Sometimes VaR is assumed to be the Portfolio Volatility multiplied by the position as expected
return is supposed to be approximately zero. ISSTATS does NOT consider VaR as Portfolio Volatility
and do NOT suppose expected return is zero.
Marginal Value at Risk
Marginal Value at Risk is the change in portfolio VaR resulting from a marginal change in the currency
(dollar, euro…) position in component i:
𝑀𝑉𝑎𝑅𝑖 =
𝜕𝑉𝑎𝑅
𝜕𝑃𝑖
Assuming the linearity of the risk in the parametric approach, the vector of Marginal Value at Risk is
[
𝑀𝑉𝑎𝑅1
⋮
𝑀𝑉𝑎𝑅𝑖
⋮
𝑀𝑉𝑎𝑅 𝑛]
= −
([
𝜇1
⋮
𝜇𝑖
⋮
𝜇 𝑛]
+
𝑧 𝛼
𝜎 𝑃
∗ [
𝜎1
2
⋯ 𝜎1𝑛
⋮ ⋱ ⋮
𝜎 𝑛1 ⋯ 𝜎 𝑛
2
]
[
𝑤1
⋮
𝑤𝑖
⋮
𝑤 𝑛])
[
𝑀𝑉𝑎𝑅1
⋮
𝑀𝑉𝑎𝑅𝑖
⋮
𝑀𝑉𝑎𝑅 𝑛]
= −
([
𝜇1
⋮
𝜇𝑖
⋮
𝜇 𝑛]
+
𝑧 𝛼
𝑃 ∗ 𝜎 𝑃
∗ [
𝜎1
2
⋯ 𝜎1𝑛
⋮ ⋱ ⋮
𝜎 𝑛1 ⋯ 𝜎 𝑛
2
]
[
𝑃1
⋮
𝑃𝑖
⋮
𝑃𝑛])
Total Marginal Value at Risk for n trading days is:
𝑀𝑉𝑎𝑅𝑖
𝑛 𝑑𝑎𝑦𝑠
= 𝑀𝑉𝑎𝑅𝑖 ∗ √ 𝑛
Component Value at Risk
Component Value at Risk is a partition of the portfolio VaR that indicates the change of VaR if a given
component was deleted.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
29
𝐶𝑉𝑎𝑅𝑖 =
𝜕𝑉𝑎𝑅
𝜕𝑃𝑖
𝑃𝑖 = 𝑀𝑉𝑎𝑅𝑖 ∗ 𝑃𝑖
Note that the sum of all component VaRs (CVaR) is the VaR for the entire portfolio:
𝑉𝑎𝑅 = ∑ 𝐶𝑉𝑎𝑅𝑖
𝑁
𝑖=1
= ∑
𝜕𝑉𝑎𝑅
𝜕𝑃𝑖
𝑁
𝑖=1
𝑃𝑖 = ∑ 𝑀𝑉𝑎𝑅𝑖
𝑁
𝑖=1
∗ 𝑃𝑖
Total Component Value at Risk for n trading days is:
𝐶𝑉𝑎𝑅𝑖
𝑛 𝑑𝑎𝑦𝑠
= 𝐶𝑉𝑎𝑅𝑖 ∗ √ 𝑛
Source: http://www.math.nus.edu.sg/~urops/Projects/valueatrisk.pdf
Incremental Value at Risk
Incremental VaR of a given position is the VaR of the portfolio with the given position minus the VaR of
the portfolio without the given position, which measures the change in VaR due to a new position on the
portfolio:
IVaR (a) = VaR (P) – VaR (P - a)
Source:
http://www.jpmorgan.com/tss/General/Portfolio_Management_With_Incremental_VaR/1259104336084
Conditional Value at Risk, Expected Shortfall, Expected Tail Loss or Average Value at Risk
𝐸𝑆1−𝛼
1 𝑑𝑎𝑦
is the expected value of the loss of the portfolio in the α% worst cases in one day.
Under Multivariate Normal Assumption, Expected Shortfall, also known as Expected Tail Loss (ETL),
Conditional Value-at-Risk (CVaR), Average Value at Risk (AVaR) and Worst Conditional Expectation,
is computed by
ES(−VaR) = −𝐸(𝑥|𝑥 < −𝑉𝑎𝑅) ∗ 𝑃 = −[𝜇 + 𝐸𝑆(𝑧 𝛼)𝜎] ∗ 𝑃 = −[𝜇 + 𝐸(𝑧|𝑧 < 𝑧 𝛼)𝜎] ∗ 𝑃
= − [𝜇 +
∫ 𝑡𝑒−
𝑡2
2
𝑧 𝛼
−∞
𝑑𝑡
𝛼
𝜎] ∗ 𝑃 = −(𝜇 −
𝑒−
𝑧 𝛼
2
2
𝛼√2𝜋
𝜎) ∗ 𝑃
where zα is the left-tail α quantile of the normal standard distribution.
About the Sign: Because VaR is given by ISSTATS with a negative sign, as J.P. Mogan recommend, we
take its original value to perform calculations (-VaR = μ + zασ). Once the ES is computed, it is given
multiplied by a negative sign. That is mean; a positive value of ES is an expected loss. On the other hand,
a negative value of ES would imply the portfolio has a high probability of making a profit even in the
worst cases.
Source: http://www.imes.boj.or.jp/english/publication/mes/2002/me20-1-3.pdf
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
30
Exponentially Weighted Moving Average (EWMA) Forecast
Given a series of k daily return rates {r1, …….., rk} computed as Continuously Compounded Return:
𝑟𝑖 = ln (
𝑠𝑖
𝑠𝑖−1
)
Where r1 corresponds to the earliest date in the series, and rk corresponds to the latest or most recent date.
Supposed k > 50, and assuming that the sample mean of daily returns is zero, the EWMA estimates the
one-day variance for a given sequence of k returns as:
𝜎2
= (1 − 𝜆) ∑ 𝜆𝑖
𝑟𝑘−𝑖
2
𝑘−1
𝑖=0
where 0 < λ< 1 is the decay factor.
The one-day volatility is:
𝜎 = √ 𝜎2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the volatility is:
𝜎 𝑇 𝑑𝑎𝑦𝑠 = 𝜎√𝑇
For two return series, assuming that both averages are zero, the EWMA estimate of one-day covariance
for a given sequence of k returns is given by
𝑐𝑜𝑣1,2 = 𝜎1,2 = (1 − 𝜆) ∑ 𝜆𝑖
𝑟1,𝑘−𝑖 𝑟2,𝑘−𝑖
𝑘−1
𝑖=0
The corresponding one-day correlation forecast for the two returns is given by
𝜌1,2 =
𝑐𝑜𝑣1,2
𝜎1 𝜎2
=
𝜎1,2
𝜎1 𝜎2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the covariance is:
𝑐𝑜𝑣1,2
𝑇 𝑑𝑎𝑦𝑠
= 𝜎1,2 𝑇
Source: http://pascal.iseg.utl.pt/~aafonso/eif/rm/TD4ePt_2.pdf
Value at Risk of a single asset, Portfolio Value at Risk, Marginal Value at Risk, Component
Value at Risk, Incremental Value at Risk, Incremental Value at Risk by EWMA method.
See methods and formulas at Parametric Value at Risk.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
31
Linear Regression
Given n equations for a regression model, with p predictor variables. The i-th given equation is
yi = β0 + β1xi1 + β2xi2 + …+ βpxip
The n equations stacked together and written in vector form is
[
𝑦1
⋮
𝑦𝑖
⋮
𝑦𝑛]
= [
1 ⋯ 𝑥1𝑝
⋮ ⋱ ⋮
1 ⋯ 𝑥 𝑛𝑝
]
[
β0
⋮
β 𝑖
⋮
β 𝑝]
+
[
ԑ0
⋮
ԑ𝑖
⋮
ԑ 𝑛]
In matrix notation:
Y = Xβ + ԑ
X is here named the design matrix, of dimensions n-by-(p+1).
If constant is not included, the matrix are
[
𝑦1
⋮
𝑦𝑖
⋮
𝑦𝑛]
= [
𝑥11 ⋯ 𝑥1𝑝
⋮ ⋱ ⋮
𝑥 𝑛1 ⋯ 𝑥 𝑛𝑝
]
[
β1
⋮
β 𝑖
⋮
β 𝑝]
+
[
ԑ1
⋮
ԑ𝑖
⋮
ԑ 𝑛]
If constant is not included, X, the design matrix, has now dimensions n-by-p.
The estimated value of the unknown parameter β:
𝛽̂ = (𝑋𝑋 𝑇
)−1
𝑋 𝑇
𝑌
Estimation can be carried out if, and only if, there is no perfect multicollinearity between the predictor
variables.
If constant is not included, the parameters can also be estimated by
𝛽̂𝑗 =
∑ 𝑥𝑖𝑗 𝑦𝑖
𝑛
𝑖=1
∑ 𝑥𝑖𝑗
2𝑛
𝑖=1
The standardized coefficients are
𝛽̂𝑖
𝑠𝑡
=
𝛽̂𝑖 ∗ 𝑆 𝑥𝑖
𝑆 𝑦
Where
 Sxi is the unbiased standard deviation of the i-th predictor variable
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
32
 Sy is the unbiased standard deviation of the response variable y
The estimate of the standard error of each coefficient is obtained by
𝑠𝑒(𝛽̂𝑖) = √𝑀𝑆𝐸 ∗ (𝑋𝑋 𝑇)𝑖𝑖
−1
Where MSE is the mean squared error of the regression model.
It is known that
𝛽̂𝑖
𝑠𝑒(𝛽̂𝑖)
↝ 𝑡 𝑛−𝑝−1
Where
 p is the number of predictor variables
 n is the total number of observations (number of rows in the design matrix)
If constant is not included, the degrees of freedom for the t statistics are n-p.
ANOVA for linear regression
If the constant is included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p-1 MSE = SSE/(n-p-1)
Total SST n-1 MST = SST/(n-1)
Being
𝑆𝑆𝑀 = ∑(𝑦𝑖̂ − 𝑦̅)2
𝑛
𝑖=1
𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂)2
𝑛
𝑖=1
𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2
𝑛
𝑖=1
Where
 p is the number of predictor variables
 n is the total number of observations (number of rows in the design matrix)
 SSE = sum of squared residuals
 MSE = mean squared error of the regression model
The test statistic has a F-distribution with p and (n-p-1) degrees of freedom. Thus the ANOVA null
hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑛−𝑝−1
𝑝
The coefficient of determination R2
is defined as SSM/SST. It is output as a percentage.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
33
The Adjusted R2
is defined as 1-MSE/MST. It is output as a percentage.
The square root of MSE is called the standard error of the regression, or standard error of the Estimate.
If the constant is not included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p MSE = SSE/(n-p)
Total SST n SST/n
Being
𝑆𝑆𝑀 = ∑ 𝑦𝑖̂2
𝑛
𝑖=1
𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂)2
𝑛
𝑖=1
𝑆𝑆𝑇 = ∑ 𝑦𝑖
2
𝑛
𝑖=1
Unstandardized Predicted Values
The fitted values (or unstandardized predicted values) from the regression will be
𝑌̂ = 𝑋𝛽̂ = 𝑋(𝑋𝑋 𝑇
)−1
𝑋 𝑇
𝑌 = HY
where H is the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
Standardized Predicted Values
Once computed the mean and unbiased standard deviation of the unstandardized predicted values, we
standardize the fitted values as
𝑦̂𝑖
𝑠𝑡
=
𝑦̂𝑖 − 𝑦̂̅
𝑆 𝑦̂
When new predictions are included outside of the design matrix, they are standardized with the above
values.
Prediction Intervals for Mean
Let define the vector of given predictors as
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
34
Xh = (1, xh,1, xh,2, …, xh, p)T
We define the standard error of the fit at Xh given by:
𝑠𝑒(𝑦̂ℎ) = √𝑀𝑆𝐸 ∗ 𝑋ℎ
𝑇
(𝑋 𝑇 𝑋)−1 𝑋ℎ
Then, the Confidence Interval for the Mean Response is
𝑦̂ℎ ± 𝑡 𝛼
2
;𝑛−𝑝−1
∗ 𝑠𝑒(𝑦̂ℎ)
Where
 X is the design matrix
 ŷh is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
 MSE is the mean squared error of the regression model
 n is the total number of observations
 p is the number of predictor variables
Prediction Intervals for Individuals
Let define the vector of given predictors as
Xh = (1, xh,1, xh,2, …, xh, p)T
We define the standard error of the fit at Xh given by:
𝑠𝑒(𝑦̂ℎ) = √𝑀𝑆𝐸 ∗ [1 + 𝑋ℎ
𝑇(𝑋 𝑇 𝑋)−1 𝑋ℎ]
Then, the Confidence Interval for individuals or new observations is
𝑦̂ℎ ± 𝑡 𝛼
2
;𝑛−𝑝−1
∗ 𝑠𝑒(𝑦̂ℎ)
Where
 X is the design matrix
 ŷh is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
 MSE is the mean squared error of the regression model
 n is the total number of observations
 p is the number of predictor variables
Unstandardized Residuals
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
35
The Unstandardized Residual for the i-th data unit is defined as:
êi = yi - ŷi
In matrix notation
Ê = Y - Ȳ = Y – HY = (Inxn – H)Y
Where H is the hat matrix.
Standardized Residuals
The standardized Residual for the i-th data unit is defined as:
eŝ 𝑖 =
ê 𝑖
√ 𝑀𝑆𝐸
Where
 êi is the unstandardized residual for the i-th data unit.
 MSE is the mean squared error of the regression model
Studentized Residuals (internally studentized residuals)
The leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The Studentized Residual for the i-th data unit is defined as:
𝑡𝑖 =
𝑒̂ 𝑖
√𝑀𝑆𝐸 ∗ (1 − ℎ𝑖𝑖)
Where
 êi is the unstandardized residual for the i-th data unit.
 MSE is the mean squared error of the regression model
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
36
Source:https://en.wikipedia.org/wiki/Studentized_residual
Centered Leverage Values
The regular leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The centered leverage value for the i-th data unit is defined as:
clvi = hii – 1/n
Where n is the number of observations.
If the intercept is not included, then the centered leverage value for the i-th data unit is defined as:
clvi = hii
Source:https://en.wikipedia.org/wiki/Leverage_(statistics)
Mahalanobis Distance
The Mahalanobis Distance for the i-th data unit is defined as:
Di2
= (n - 1)*(hii – 1/n) = (n - 1)*clvi
Where
 hii is the i-th diagonal element of the projection matrix.
 n is the number of observations
If the intercept is not included, the Mahalanobis Distance for the i-th data unit is defined as:
Di2
= n*hii
Source: https://en.wikipedia.org/wiki/Mahalanobis_distance
Cook’s Distance
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
37
The Cook’s Distance for the i-th data unit is defined as:
𝐷𝑖 =
𝑒̂ 𝑖
2
ℎ𝑖𝑖
𝑀𝑆𝐸 ∗ (𝑝 + 1) ∗ (1 − ℎ𝑖𝑖)2
Where
 hii is the i-th diagonal element of the projection matrix.
 p is the number of predictor variables
 êi is the unstandardized residual for the i-th data unit.
 MSE is the mean squared error of the regression model
If the intercept is not included, the Cook’s Distance for the i-th data unit is defined as:
𝐷𝑖 =
𝑒̂ 𝑖
2
ℎ𝑖𝑖
𝑀𝑆𝐸 ∗ 𝑝 ∗ (1 − ℎ𝑖𝑖)2
Source: https://en.wikipedia.org/wiki/Cook%27s_distance
Curve Estimation Models
Linear. Model whose equation is Y = b0 + (b1 * t). The series values are modeled as a linear
function of time.
Quadratic. Model whose equation is Y = b0 + (b1 * t) + (b2 * t**2). The quadratic model can be
used to model a series that "takes off" or a series that dampens.
Cubic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3).
Quartic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4).
Quintic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4) + (b5 * t**5).
Sextic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 *
t**4) + (b5 * t**5) + (b6 * t**6).
Logarithmic. Model whose equation is Y = b0 + (b1 * ln(t)).
Inverse. Model whose equation is Y = b0 + (b1 / t).
Power. Model whose equation is Y = b0 * (t**b1) or ln(Y) = ln(b0) + (b1 * ln(t)).
Compound. Model whose equation is Y = b0 * (b1**t) or ln(Y) = ln(b0) + (ln(b1) * t).
S-curve. Model whose equation is Y = e**(b0 + (b1/t)) or ln(Y) = b0 + (b1/t).
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
38
Logistic. Model whose equation is Y = 1 / (1/u + (b0 * (b1**t))) or ln(1/y-1/u) = ln (b0) + (ln(b1)
* t) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to
use in the regression equation. The value must be a positive number that is greater than the largest
dependent variable value.
Growth. Model whose equation is Y = e**(b0 + (b1 * t)) or ln(Y) = b0 + (b1 * t).
Exponential. Model whose equation is Y = b0 * (e**(b1 * t)) or ln(Y) = ln(b0) + (b1 * t).
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
39
© Copyright InnerSoft 2017. All rights reserved.
Los hijos perdidos del Sinclair ZX Spectrum 128K (RANDOMIZE USR 123456)
innersoft@itspanish.org
innersoft@gmail.com
http://isstats.itspanish.org/

More Related Content

PPT
Chi-square, Yates, Fisher & McNemar
PPTX
Wilcoxon Rank-Sum Test
PPTX
PPTX
Statr session 19 and 20
PPTX
PDF
Practice test ch 10 correlation reg ch 11 gof ch12 anova
PPT
Chapter12
DOC
Mc Nemar
Chi-square, Yates, Fisher & McNemar
Wilcoxon Rank-Sum Test
Statr session 19 and 20
Practice test ch 10 correlation reg ch 11 gof ch12 anova
Chapter12
Mc Nemar

What's hot (20)

PPT
Chi square using excel
PDF
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
PDF
Categorical data analysis
PPTX
PPTX
Student’s t test
PPT
Mc namer test of correlation
PPTX
Goodness of Fit Notation
PPTX
Analysis of variance (ANOVA)
PPT
Nonparametric statistics
PPTX
Analysis of Variance-ANOVA
PPTX
Chi square test
PPT
My regression lecture mk3 (uploaded to web ct)
PPTX
Lesson 27 using statistical techniques in analyzing data
PPT
F test Analysis of Variance (ANOVA)
ODP
Multiple linear regression
PPT
Chapter 14
PDF
PG STAT 531 Lecture 2 Descriptive statistics
PPTX
Contingency Tables
PDF
Data Science - Part IV - Regression Analysis & ANOVA
PDF
Multiple linear regression
Chi square using excel
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Categorical data analysis
Student’s t test
Mc namer test of correlation
Goodness of Fit Notation
Analysis of variance (ANOVA)
Nonparametric statistics
Analysis of Variance-ANOVA
Chi square test
My regression lecture mk3 (uploaded to web ct)
Lesson 27 using statistical techniques in analyzing data
F test Analysis of Variance (ANOVA)
Multiple linear regression
Chapter 14
PG STAT 531 Lecture 2 Descriptive statistics
Contingency Tables
Data Science - Part IV - Regression Analysis & ANOVA
Multiple linear regression
Ad

Similar to InnerSoft STATS - Methods and formulas help (20)

PPTX
Simple Regression.pptx
PDF
Econometrics 1 Slide from the masters degree 1
PPTX
What is chi square test
PDF
CFA Formula Cheat Sheet. All Topics covered
PPTX
Variance component analysis by paravayya c pujeri
PDF
Data Science Cheatsheet.pdf
DOCX
Descriptive Statistics Formula Sheet Sample Populatio.docx
PPTX
Sampling distribution.pptx
PDF
PDF
Statistical parameters
PDF
Memorization of Various Calculator shortcuts
PPTX
Company Induction process and Onboarding
PDF
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
PDF
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
PDF
Bio-L8- Correlation and Regression Analysis.pdf
PPTX
Statistics78 (2)
PPTX
Test of hypothesis test of significance
PPTX
Testing of hypothesis
PPTX
Categorical data analysis full lecture note PPT.pptx
PDF
Stat3 central tendency & dispersion
Simple Regression.pptx
Econometrics 1 Slide from the masters degree 1
What is chi square test
CFA Formula Cheat Sheet. All Topics covered
Variance component analysis by paravayya c pujeri
Data Science Cheatsheet.pdf
Descriptive Statistics Formula Sheet Sample Populatio.docx
Sampling distribution.pptx
Statistical parameters
Memorization of Various Calculator shortcuts
Company Induction process and Onboarding
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
Bio-L8- Correlation and Regression Analysis.pdf
Statistics78 (2)
Test of hypothesis test of significance
Testing of hypothesis
Categorical data analysis full lecture note PPT.pptx
Stat3 central tendency & dispersion
Ad

More from InnerSoft (10)

PDF
InnerSoft CAD para AutoCAD, v4.0 Manual
PDF
InnerSoft STATS - Introduction
PDF
InnerSoft STATS - Index
PDF
InnerSoft STATS - Graphs
PDF
InnerSoft STATS - Analyze
PDF
Manual InnerSoft STATS
PDF
Ingeniería de caminos rurales
PDF
InnerSoft CAD Manual
PDF
Norma 3.1 ic. trazado, de la instrucción de carreteras
PDF
Manual de InnerSoft CAD en español
InnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft STATS - Introduction
InnerSoft STATS - Index
InnerSoft STATS - Graphs
InnerSoft STATS - Analyze
Manual InnerSoft STATS
Ingeniería de caminos rurales
InnerSoft CAD Manual
Norma 3.1 ic. trazado, de la instrucción de carreteras
Manual de InnerSoft CAD en español

Recently uploaded (20)

PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Computing-Curriculum for Schools in Ghana
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Lesson notes of climatology university.
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Classroom Observation Tools for Teachers
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
Abdominal Access Techniques with Prof. Dr. R K Mishra
Computing-Curriculum for Schools in Ghana
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Complications of Minimal Access Surgery at WLH
2.FourierTransform-ShortQuestionswithAnswers.pdf
Insiders guide to clinical Medicine.pdf
Lesson notes of climatology university.
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pre independence Education in Inndia.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Microbial disease of the cardiovascular and lymphatic systems
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
PPH.pptx obstetrics and gynecology in nursing
Classroom Observation Tools for Teachers
Microbial diseases, their pathogenesis and prophylaxis
Module 4: Burden of Disease Tutorial Slides S2 2025

InnerSoft STATS - Methods and formulas help

  • 2. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 2 Mean The arithmetic mean is the sum of a collection of numbers divided by the number of numbers in the collection. Sample Variance The estimator of population variance, also called the unbiased sample variance, is: 𝑆2 = ∑ (𝑥𝑖 − 𝑥̅)2𝑛 𝑖=1 𝑛 − 1 Source: http://en.wikipedia.org/wiki/Variance Sample Kurtosis The estimators of population kurtosis is: 𝐺2 = 𝑘4 𝑘2 2 = (𝑛 + 1)𝑛 (𝑛 − 1)(𝑛 − 2)(𝑛 − 3) ∗ ∑ (𝑥𝑖 − 𝑥̅)4𝑛 𝑖=1 𝑘2 2 − 3 (𝑛 − 1)2 (𝑛 − 2)(𝑛 − 3) The standard error of the sample kurtosis of a sample of size n from the normal distribution is: 𝐾 𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √ 4[6𝑛(𝑛 − 1)2(𝑛 + 1)] (𝑛 − 3)(𝑛 − 2)(𝑛 + 1)(𝑛 + 3)(𝑛 + 5) Source: http://en.wikipedia.org/wiki/Kurtosis#Estimators_of_population_kurtosis Sample Skewness Skewness of a population sample is estimated by the adjusted Fisher–Pearson standardized moment coefficient: 𝐺 = 𝑛 (𝑛 − 1)(𝑛 − 2) ∑ ( 𝑥𝑖 − 𝑥̅ 𝑠 ) 3𝑛 𝑖=1 where n is the sample size and s is the sample standard deviation. The standard error of the skewness of a sample of size n from a normal distribution is: 𝐺 𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √ 6𝑛(𝑛 − 1) (𝑛 − 2)(𝑛 + 1)(𝑛 + 3) Source: https://en.wikipedia.org/wiki/Skewness#Sample_skewness Total Variance Variance of the entire population is: 𝜎2 = ∑ (𝑥𝑖 − 𝑥̅)2𝑛 𝑖=1 𝑛
  • 3. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 3 Source: http://en.wikipedia.org/wiki/Variance Total Kurtosis Kurtosis of the entire population is: 𝐺2 = ∑ (𝑥𝑖 − 𝑥̅)4𝑛 𝑖=1 𝑛 𝜎4 − 3 where n is the sample size and σ is the total standard deviation. Source: http://en.wikipedia.org/wiki/Kurtosis Total Skewness Skewness of the entire population is: 𝐺 = ∑ (𝑥𝑖 − 𝑥̅)3𝑛 𝑖=1 𝑛 𝜎3 where n is the sample size and σ is the total standard deviation. Source: https://en.wikipedia.org/wiki/Skewness Quantiles of a population ISSTATS uses the same method as R–7, Excel CUARTIL.INC function, SciPy–(1,1), SPSS and Minitab. Qp, the estimate for the kth q–quantile, where p = k/q and h = (N–1)*p + 1, is computing by Qp = Linear interpolation of the modes for the order statistics for the uniform distribution on [0, 1]. When p = 1, use xN. Source: http://en.wikipedia.org/wiki/Quantile#Estimating_the_quantiles_of_a_population MSSD (Mean of the squared successive differences) It is calculated by taking the sum of the differences between consecutive observations squared, then taking the mean of that sum and dividing by two. 𝑀𝑆𝑆𝐷 = ∑ (𝑥𝑖+1 − 𝑥𝑖)2𝑛 𝑖=1 2(𝑛 − 1) The MSSD has the desirable property that one half the MSSD is an unbiased estimator of true variance. Pearson Chi Square Test The value of the test-statistic is
  • 4. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 4 𝜒2 = ∑ (𝑂𝑖 − 𝐸𝑖)2 𝐸𝑖 𝑛 𝑖=1 Where  𝜒2 is the Pearson's cumulative test statistic, which asymptotically approaches a 𝜒2 distribution with (r - 1)(c - 1) degrees of freedom.  𝑂𝑖 is the number of observations of type i.  𝐸𝑖 is the expected (theoretical) frequency of type i Yates's Continuity Correction The value of the test-statistic is 𝜒2 = ∑ (𝑚𝑎𝑥{0, |𝑂𝑖 − 𝐸𝑖| − 0.5})2 𝐸𝑖 𝑛 𝑖=1 When |𝑂𝑖 − 𝐸𝑖| − 0.5 is below zero, the null value is computed. The effect of Yates' correction is to prevent overestimation of statistical significance for small data. This formula is chiefly used when at least one cell of the table has an expected count smaller than 5. Likelihood Ratio G-Test The value of the test-statistic is 𝐺 = 2 (∑ ∑ 𝑂𝑖𝑗 ∗ 𝑙𝑛( 𝑂𝑖𝑗 𝐸𝑖𝑗 ) 𝑐 𝑗=1 𝑟 𝑖=1 ) where  Oij is the observed count in row i and column j  Eij is the expected count in row i and column j G has an asymptotically approximate χ2 distribution with (r - 1)(c - 1) degrees of freedom when the null hypothesis is true and n is large enough. Mantel-Haenszel Chi-Square Test The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association between the row variable and the column variable. Both variables must lie on an ordinal scale. The Mantel-Haenszel chi-square statistic is computed as: 𝑄 𝑀𝐻 = (𝑛 − 1)𝑟2 Where r is the Pearson correlation between the row variable and the column variable, n is the sample size. Under the null hypothesis of no association, has an asymptotic chi-square distribution with one degree of freedom.
  • 5. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 5 Fisher's Exact Test Fisher’s exact test assumes that the row and column totals are fixed, and then uses the hypergeometric distribution to compute probabilities of possible tables conditional on the observed row and column totals. Fisher’s exact test does not depend on any large-sample distribution assumptions, and so it is appropriate even for small sample sizes and for sparse tables. This test is computed for 2X2 tables such as 𝐴 = ( 𝑎 𝑏 𝑐 𝑑 ) For an efficient computing, the elements of the matrix A are reordered A’ = ( 𝑎′ 𝑏′ 𝑐′ 𝑑′ ) Being a’ the cell of A that have the minimum marginals (minimum row and column totals). The test result does not depend on the cells disposition. The left-sided –value sums the probability for all the tables that have equal or smaller a’. p 𝑙𝑒𝑓𝑡 = P(𝑥 ≤ 𝑎′) = ∑ ( 𝐾 = 𝑎′ + 𝑏′ 𝑖 ) ( 𝑁 − 𝐾 𝑛 − 𝑖 ) ( 𝑁 = 𝑎′ + 𝑏′ + 𝑐′ + 𝑑′ 𝑛 = 𝑎′ + 𝑐′ ) 𝑎′ 𝑖=0 The right-sided –value sums the probability for all the tables that have equal or larger a’. p 𝑟𝑖𝑔ℎ𝑡 = P(𝑥 ≥ 𝑎′) = ∑ ( 𝐾 = 𝑎′ + 𝑏′ 𝑖 ) ( 𝑁 − 𝐾 𝑛 − 𝑖 ) ( 𝑁 = 𝑎′ + 𝑏′ + 𝑐′ + 𝑑′ 𝑛 = 𝑎′ + 𝑐′ ) 𝐾=𝑎′+𝑏′ 𝑖=𝑎′ Most of the statistical packages output -as the one-sided test result- the minimum value of pleft and pright. The Fisher two-tailed p-value for a table A is defined as the sum of probabilities for all tables consistent with the marginals that are as likely as the current table. McNemar's Test This test is computed for 2X2 tables such as 𝐴 = ( 𝑎 𝑏 𝑐 𝑑 ) The value of the test-statistic is
  • 6. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 6 𝜒2 = (𝑏 − 𝑐)2 𝑏 + 𝑐 The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom. Edwards Continuity Correction The value of the test-statistic is 𝜒2 = (𝑚𝑎𝑥{0, |𝑏 − 𝑐| − 1})2 𝑏 + 𝑐 When |𝑏 − 𝑐| − 1 is below zero, the statistic is zero. The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom. McNemar Exact Binomial Assuming that b < c. Let be n = b + c, and B(x, n, p) the binomial distribution Two − sided p − value = 2 ∗ (one − sided p − value) = 2 ∗ ∑ 𝐵(𝑥, 𝑛, 0.5) 𝑏 𝑥=0 = 2 ∗ ∑ ( 𝑛 𝑥 ) ∗ 0.5 𝑥 ∗ 0.5 𝑛−𝑥 𝑏 𝑥=0 = 2 ∗ 1 2 𝑛 ∗ ∑ ( 𝑛 𝑥 ) 𝑏 𝑥=0 If b = c, the exact p-value equals 1.0. Mid-P McNemar Test Let be n = b + c. Assuming that b < c. Mid − P value = 2 ∗ ∑ 𝐵(𝑥, 𝑛, 0.5) 𝑏 𝑥=0 − 𝐵(𝑏, 𝑛, 0.5) = 2 ∗ 1 2 𝑛 ∗ ∑ ( 𝑛 𝑥 ) − ( 𝑛 𝑏 ) ∗ 1 2 𝑛 𝑏 𝑥=0 If b = c, the mid p-value is 1.0 − 1 2 ( 𝑛 𝑏 ) ∗ 1 2 𝑛 Bowker’s Test of Symmetry This test is computed for m-by-m square matrix as: 𝐵𝑊 = ∑ ∑ (𝑛𝑖𝑗 − 𝑛𝑗𝑖)2 𝑛𝑖𝑗 + 𝑛𝑗𝑖 𝑖−1 𝑗=1 𝑚−1 𝑖=1 For large samples, BW has an asymptotic chi-square distribution with M*(M - 1)/2 – R degrees of freedom under the null hypothesis of symmetry, where R is the number of off-diagonal cells with nij + nji = 0.
  • 7. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 7 Risk Test Let be Risk Factor Disease status Cohort = Present Cohort = Absent Present a b Absent c d Odds ratio The odds ratio (Risk Factor = Present / Risk Factor = Absent) is computed as: 𝑂𝑅 = 𝑎 𝑏⁄ 𝑐 𝑑⁄ The distribution of the log odds ratio is approximately normal with: 𝜒 ~ 𝑁(log(𝑂𝑅) , 𝜎2 ) The standard error for the log odds ratio is approximately 𝑆𝐸 = √ 1 𝑎 + 1 𝑏 + 1 𝑐 + 1 𝑑 The 95% confidence interval for the odds ratio is computed as [exp(log(𝑂𝑅) − 𝑧0.025 ∗ 𝑆𝐸) ; exp(log(𝑂𝑅) + 𝑧0.025 ∗ 𝑆𝐸)] To test the hypothesis that the population odds ratio equals one, is computed the two-sided p-value as 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 (2 − 𝑠𝑖𝑑𝑒𝑑) = 2 ∗ 𝑃(𝑧 ≤ −|log(𝑂𝑅)| 𝑆𝐸 ) Source: https://en.wikipedia.org/wiki/Odds_ratio Relative Risk The relative risk (for cohort Disease status = Present) is computed as 𝑅𝑅 = 𝑎 𝑎 + 𝑏⁄ 𝑐 𝑐 + 𝑑⁄ The distribution of the log relative risk is approximately normal with: 𝜒 ~ 𝑁(log(𝑂𝑅) , 𝜎2 )
  • 8. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 8 The standard error for the log relative risk is approximately 𝑆𝐸 = √ 1 𝑎 + 1 𝑏 − 1 𝑎 + 𝑏 − 1 𝑐 + 𝑑 The 95% confidence interval for the relative risk is computed as [exp(log(𝑅𝑅) − 𝑧0.025 ∗ 𝑆𝐸) ; exp(log(𝑅𝑅) + 𝑧0.025 ∗ 𝑆𝐸)] To test the hypothesis that the population relative risk equals one, is computed the two-sided p-value as 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 (2 − 𝑠𝑖𝑑𝑒𝑑) = 2 ∗ 𝑃(𝑧 ≤ −|log(𝑅𝑅)| 𝑆𝐸 ) The relative risk (for cohort Disease status = Absent) is computed as 𝑅𝑅 = 𝑏 𝑎 + 𝑏⁄ 𝑑 𝑐 + 𝑑⁄ Epidemiology Risk All the parameters are computed for cohort Disease status = Present. Attributable risk, represents how much the risk factor increase/decrease the risk of disease 𝐴𝑅 = 𝑎 𝑎 + 𝑏 − 𝑐 𝑐 + 𝑑 If AR > 0 there an increase of the risk. If AR < 0 there is a reduction of the risk. Relative Attributable Risk 𝑅𝑅 = 𝑎 𝑎 + 𝑏 − 𝑐 𝑐 + 𝑑 𝑐 𝑐 + 𝑑 = 𝐴𝑅 𝑐 𝑐 + 𝑑 Number Needed to Harm 𝑁𝑁𝐻 = 1 𝑎 𝑎 + 𝑏 − 𝑐 𝑐 + 𝑑 = 1 𝐴𝑅 The number needed to harm (NNH) is an epidemiological measure that indicates how many patients on average need to be exposed to a risk-factor over a specific period to cause harm in an average of one patient who would not otherwise have been harmed. A negative number would not be presented as a NNH, rather, as the risk factor is not harmful, it is expressed as a number needed to treat (NNT) or number needed to avoid to expose to risk.
  • 9. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 9 Attributable risk per unit 𝐴𝑅𝑃 = 𝑅𝑅 − 1 𝑅𝑅 Preventive fraction 𝑃𝐹 = 1 − 𝑅𝑅 Etiologic fraction is the proportion of cases in which the exposure has played a causal role in disease development. 𝐸𝐹 = 𝑎 − 𝑐 𝑎 A similar parameters are computed for cohort Disease status = Absent. Source: https://en.wikipedia.org/wiki/Relative_risk Cohen's Kappa Test Given a k-by-k square matrix, which collect the scores of two raters who each classify N items into k mutually exclusive categories, the equation for Cohen's kappa coefficient is 𝑘̂ = 𝑝 𝑜 − 𝑝 𝑒 1 − 𝑝 𝑒 Where 𝑝 𝑜 = ∑ 𝑛𝑖𝑖 𝑁 = ∑ 𝑝𝑖𝑖 𝑘 𝑖=1 𝑘 𝑖=1 𝑎𝑛𝑑 𝑝𝑒 = ∑ 𝑝𝑖. 𝑝.𝑖 𝑘 𝑖=1 𝑤ℎ𝑒𝑟𝑒 𝑝𝑖𝑗 = 𝑛𝑖𝑗 𝑁 𝑎𝑛𝑑 𝑝𝑖. = ∑ 𝑛𝑖𝑗 𝑁 𝑘 𝑗=1 𝑎𝑛𝑑 𝑝.𝑗 = ∑ 𝑛𝑖𝑗 𝑁 𝑘 𝑖=1 The asymptotic variance is computed by 𝑣𝑎𝑟(𝑘̂) = 1 𝑁(1 − 𝑝𝑒)4 { ∑ 𝑝𝑖𝑖[(1 − 𝑝𝑒) − (𝑝.𝑖 + 𝑝𝑖.)(1 − 𝑝 𝑜)]2 𝑘 𝑖=1 + (1 − 𝑝0)2 ∑ ∑ 𝑝𝑖𝑗(𝑝.𝑖 + 𝑝𝑗.)2 𝑘 𝑗=1,𝑗≠𝑖 − (𝑝 𝑜 𝑝𝑒 − 2𝑝𝑒 + 𝑝 𝑜)2 𝑘 𝑖=1 }
  • 10. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 10 The formulae is given by Fleiss, Cohen, and Everitt (1969), and modified by Fleiss (1981). The asymptotic standard error is the root square of the value given above. This standard error and the normal distribution N(0,1) must be used to compute confidence intervals. 𝑘̂ ± 𝑧∝/2√ 𝑣𝑎𝑟(𝑘̂) To compute an asymptotic test for the kappa coefficient, ISSTATS uses a standardized test statistic T which has an asymptotic standard normal distribution under the null hypothesis that kappa equals zero (H0: k = 0). The standardized test statistic is computed as 𝑇 = 𝑘̂ √ 𝑣𝑎𝑟0(𝑘̂) ≅ 𝑁(0,1) Where the variance of the kappa coefficient under the null hypothesis is 𝑣𝑎𝑟0(𝑘̂) = 1 𝑁(1 − 𝑝𝑒)2 { 𝑝𝑒 + 𝑝𝑒 2 − ∑ 𝑝.𝑖 𝑝𝑖.(𝑝.𝑖+ 𝑝𝑖.) 𝑘 𝑖=1 } Refer to Fleiss (1981) Source: https://v8doc.sas.com/sashtml/stat/chap28/sect26.htm Nominal by Nominal Measures of Association Contingency Coefficient Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). 𝐶 = √ 𝜒2 𝜒2 + 𝑁 Where  𝜒2 is the Pearson's cumulative test statistic.  N is the total sample size. C asymptotically approaches a 𝜒2 distribution with (r - 1)(c - 1) degrees of freedom. Standardized Contingency Coefficient
  • 11. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 11 If X and Y have the same number of categories (r = c), then the maximum value for the contingency coefficient is calculated as: 𝑐 𝑚𝑎𝑥 = √ 𝑟 − 1 𝑟 If X and Y have a differing number of categories (r ≠ c), then the maximum value for the contingency coefficient is calculated as 𝑐 𝑚𝑎𝑥 = √ (𝑟 − 1)(𝑐 − 1) 𝑟 ∗ 𝑐 4 The standardized contingency coefficient is calculated as the ratio: 𝑐𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 = 𝐶 𝑐 𝑚𝑎𝑥 which varies between 0 and 1 with 0 indicating independence and 1 dependence. Phi coefficient The phi coefficient is a measure of association for two nominal variables. 𝛷 = √ 𝜒2 𝑁 Where  𝜒2 is the Pearson's cumulative test statistic.  N is the total sample size. Phi asymptotically approaches a 𝜒2 distribution with (r - 1)(c - 1) degrees of freedom. Cramer's V Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). 𝑉 = √ 𝜒2 𝑁 ⁄ 𝑚𝑖𝑛{𝑟 − 1, 𝑐 − 1} Where  𝜒2 is the Pearson's cumulative test statistic.  N is the total sample size. V asymptotically approaches a χ2 distribution with (r - 1)(c - 1) degrees of freedom.
  • 12. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 12 Tschuprow's T Tschuprow's T is a measure of association between two nominal variables, giving a value between 0 and 1 (inclusive). 𝑇 = √ 𝜒2 𝑁 ⁄ √(𝑟 − 1)(𝑐 − 1) Lambda Asymmetric lambda, λ(C/R) or column variable dependent, is interpreted as the probable improvement in predicting the column variable Y given knowledge of the row variable X. The range of asymmetric lambda is {0, 1}. Asymmetric lambda (C/R) or column variable dependent is computed as 𝜆(𝐶/𝑅) = ∑ 𝑟𝑖𝑖 − 𝑟 𝑁 − 𝑟 The asymptotic variance is 𝑣𝑎𝑟( 𝜆(𝐶/𝑅)) = 𝑁 − ∑ 𝑟𝑖𝑖 ( 𝑟 − 𝑁)3 { ∑ 𝑟𝑖 𝑖 + 𝑟 − 2 ∑(𝑟𝑖|𝑙𝑖 = 𝑙) 𝑖 } Where 𝑟𝑖 = max 𝑗 {𝑛𝑖𝑗} 𝑎𝑛𝑑 𝑟 = max 𝑗 {𝑟.𝑗} 𝑎𝑛𝑑 𝑐𝑗 = max 𝑖 {𝑛𝑖𝑗} 𝑎𝑛𝑑 𝑐 = max 𝑖 {𝑛𝑖.} The values of li and l are determined as follows. Denote by li the unique value of j such that ri = nij, and let l be the unique value of j such that r = n.j. Because of the uniqueness assumptions, ties in the frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties, l is defined as the smallest value of such that r = n.j. For those columns containing a cell (i, j) for which nij = ri = cj, csj records the row in which cj is assumed to occur. Initially is set equal to –1 for all j. Beginning with i = 1, if there is at least one value j such that nij = ri = cj, and if csj = -1, then li is defined to be the smallest such value of j, and csj is set equal to i. Otherwise, if nil = ri, then li is defined to be equal to l. If neither condition is true, then li is taken to be the smallest value of j such that nij = ri. The asymptotic standard error is the root square of the asymptotic variance. The formulas for lambda asymmetric λ(R/C) can be obtained by interchanging the indices.
  • 13. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 13 𝜆(𝑅/𝐶) = ∑ 𝑐𝑗𝑗 − 𝑐 𝑁 − 𝑐 The Symmetric lambda is the average of the two asymmetric lambdas, λ(C/R) and λ(R/C). Its range is {- 1, 1}. Lambda symmetric is computed as 𝜆 = ∑ 𝑟𝑖𝑖 + ∑ 𝑐𝑗𝑗 − 𝑟 − 𝑐 2𝑁 − 𝑟 − 𝑐 The asymptotic variance is 𝑣𝑎𝑟( 𝜆) = 1 𝑤4 { 𝑤𝑣𝑦 − 2𝑤2 [𝑁 − ∑ ∑(𝑛𝑖𝑗|𝑗 = 𝑙𝑖, 𝑖 = 𝑘𝑗) 𝑗𝑖 ] − 2𝑣2 (𝑁 − 𝑛 𝑘𝑙)} Where 𝑤 = 2𝑛 − 𝑟 − 𝑐 𝑎𝑛𝑑 𝑣 = 2𝑛 − ∑ 𝑟𝑖 𝑖 − ∑ 𝑐𝑗 𝑗 𝑎𝑛𝑑 𝑥 = ∑(𝑟𝑖 | 𝑙𝑖 = 𝑙) 𝑖 + ∑(𝑐𝑗 | 𝑘𝑗 = 𝑘) 𝑗 + 𝑟𝑘 + 𝑐𝑙 𝑎𝑛𝑑 𝑦 = 8𝑁 − 𝑤 − 𝑣 − 2𝑥 The definitions of l and li are given in the previous section. The values k and kj are defined in a similar way for lambda asymmetric (R/C). Uncertainty Coefficient The uncertainty coefficient U (C/R) -or column variable dependent U- measures the proportion of uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is {0, 1}. The uncertainty coefficient is computed as 𝑈(𝐶/𝑅) = 𝑈 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 = 𝑣 𝑤 = H(X) + H(Y) − H(XY) H(Y) Where 𝐻(𝑋) = − ∑ 𝑛𝑖. 𝑛 ln ( 𝑛𝑖. 𝑛 ) 𝑖 𝑎𝑛𝑑 𝐻(𝑌) = − ∑ 𝑛.𝑗 𝑛 ln ( 𝑛.𝑗 𝑛 ) 𝑖 𝑎𝑛𝑑 𝐻(𝑋𝑌) = − ∑ ∑ 𝑛𝑖𝑗 𝑛 ln ( 𝑛𝑖𝑗 𝑛 ) 𝑗𝑖 The asymptotic variance is 𝑣𝑎𝑟(𝑈(𝐶/𝑅)) = 1 𝑛2 𝑤4 ∑ ∑ 𝑛𝑖𝑗 {𝐻(𝑌) ln ( 𝑛𝑖𝑗 𝑛𝑖. ) + (H(X) − H(XY)) ln ( 𝑛.𝑗 𝑛 )} 2 𝑗𝑖 The asymptotic standard error is the root square of the asymptotic variance.
  • 14. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 14 The formulas for the uncertainty coefficient U (C/R) can be obtained by interchanging the indices. The symmetric uncertainty coefficient is computed as 𝑈 = 2 ∗ [H(X) + H(Y) − H(XY)] H(X) + H(Y) The asymptotic variance is 𝑣𝑎𝑟(𝑈) = 4 ∑ ∑ 𝑛𝑖𝑗 {𝐻(𝑋𝑌) ln ( 𝑛𝑖. 𝑛.𝑗 𝑛2 ) − (H(X) − H(Y)) ln ( 𝑛.𝑗 𝑛 )} 2 𝑛2(H(X) + H(Y))4 𝑗𝑖 The asymptotic standard error is the root square of the asymptotic variance. Ordinal by Ordinal Measures of Association Let nij denote the observed frequency in cell (i, j) in a IxJ contingency table. Let be N the total frequency and 𝐴𝑖𝑗 = ∑ ∑ 𝑛 𝑘𝑙 𝑙<𝑗𝑘<𝑖 + ∑ ∑ 𝑛 𝑘𝑙 𝑙>𝑗𝑘>𝑖 𝐷𝑖𝑗 = ∑ ∑ 𝑛 𝑘𝑙 𝑙<𝑗𝑘>𝑖 + ∑ ∑ 𝑛 𝑘𝑙 𝑙>𝑗𝑘<𝑖 𝑃 = ∑ ∑ 𝑎𝑖𝑗 𝐴𝑖𝑗 𝑗𝑖 𝑎𝑛𝑑 𝑄 = ∑ ∑ 𝑎𝑖𝑗 𝐷𝑖𝑗 𝑗𝑖 Gamma Coefficient The gamma (G) statistic is based only on the number of concordant and discordant pairs of observations. It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y). Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is {-1, 1}. If the row and column variables are independent, then gamma tends to be close to zero. Gamma is estimated by 𝐺 = 𝑃 − 𝑄 𝑃 + 𝑄 The asymptotic variance is
  • 15. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 15 𝑣𝑎𝑟(𝐺) = 16 ( 𝑃 + 𝑄)2 { ∑ ∑ 𝑛𝑖𝑗 ∗ (𝑄𝐴𝑖𝑗 − 𝑃𝐷𝑖𝑗)2 𝐽 𝑗=1 𝐼 𝑖=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that gamma equals zero is computed as 𝑣𝑎𝑟0(𝐺) = 4 ( 𝑃 + 𝑄)2 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗 2 𝐽 𝑗=1 − (𝑃 − 𝑄)2 𝑁 𝐼 𝑖=1 } Where dij = Aij - Dij The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Kendall's tau-b Kendall’s tau-b is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only when both variables lie on an ordinal scale. The range of tau-b is {-1, 1}. Kendall’s tau-b is estimated by 𝜏 𝑏 = 𝑃 − 𝑄 𝑤 Where 𝑤𝑟 = 𝑛2 − ∑ 𝑛𝑖. 2 𝑖 𝑎𝑛𝑑 𝑤𝑐 = 𝑛2 − ∑ 𝑛.𝑗 2 𝑖 𝑎𝑛𝑑 𝑤 = √ 𝑤𝑟 𝑤𝑐 The asymptotic variance is 𝑣𝑎𝑟( 𝜏 𝑏) = 1 𝑤4 { ∑ ∑ 𝑛𝑖𝑗(2𝑤𝑑𝑖𝑗 + 𝜏 𝑏 𝑣𝑖𝑗)2 𝐽 𝑗=1 𝐼 𝑖=1 − 𝑁3 𝜏 𝑏 2 ( 𝑤 𝑟 + 𝑤 𝑐)2 } where 𝑣𝑖𝑗 = 𝑤 𝑐 𝑛𝑖. + 𝑤 𝑟 𝑛.𝑗 The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that tau-b equals zero is computed as
  • 16. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 16 𝑣𝑎𝑟0( 𝜏 𝑏) = 4 𝑤 𝑟 𝑤 𝑐 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗 2 𝐽 𝑗=1 − (𝑃 − 𝑄)2 𝑁 𝐼 𝑖=1 } The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Stuart-Kendall's tau-c Stuart-Kendall’s tau-c makes an adjustment for table size in addition to a correction for ties. Tau-c is appropriate only when both variables lie on an ordinal scale. The range of tau-c is {-1, 1}. Stuart- Kendall’s tau-c is estimated by 𝜏 𝑐 = 𝑚(𝑃 − 𝑄) 𝑁2(𝑚 − 1) Where m =min {I, J}. The asymptotic variance is 𝑣𝑎𝑟( 𝜏 𝑐) = 4𝑚2 𝑁4 (𝑚 − 1)2 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗 2 𝐽 𝑗=1 − (𝑃 − 𝑄)2 𝑁 𝐼 𝑖=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance. Sommers’ D Somers’ D(C/R) and Somers’ D(R/C) are asymmetric modifications of tau-b. C/R indicates that the row variable X is regarded as the independent variable and the column variable Y is regarded as dependent. Similarly, R/C indicates that the column variable Y is regarded as the independent variable and the row variable X is regarded as dependent. Somers’ D differs from tau-b in that it uses a correction only for pairs that are tied on the independent variable. Somers’ D is appropriate only when both variables lie on an ordinal scale. The range of Somers’ D is {-1, 1}. Somers’ D is computed as 𝐷(𝐶/𝑅) = 𝐷 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 = 𝑃 − 𝑄 𝑤𝑟 The asymptotic variance is
  • 17. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 17 𝑣𝑎𝑟( 𝐷(𝐶/𝑅)) = 4 𝑤 𝑟 4 { ∑ ∑ 𝑛𝑖𝑗[𝑤𝑟 𝑑𝑖𝑗 − (𝑃 − 𝑄)(𝑁 − 𝑛𝑖.)]2 𝐽 𝑗=1 𝐼 𝑖=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that D(C/R) equals zero is computed as 𝑣𝑎𝑟0( 𝐷(𝐶/𝑅)) = 4 𝑤 𝑟 2 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗 2 𝐽 𝑗=1 − (𝑃 − 𝑄)2 𝑁 𝐼 𝑖=1 } The asymptotic standard error under the null hypothesis that D(C/R) equals zero is the root square of the variance. Formulas for Somers’ D(R/C) are obtained by interchanging the indices. Symmetric version of Somers’ d is 𝑑 = 𝑃 − 𝑄 𝑤𝑟 + 𝑤𝑐 2 The standard error is 𝐴𝑆𝐸(𝑑) = 2𝜎 𝜏𝑏 𝑤 𝑤 𝑟 + 𝑤 𝑐 where στb is the asymptotic standard error of Kendall’s tau-b. The variance under the null hypothesis that d equals zero is computed as 𝑣𝑎𝑟0(𝑑) = 16 ( 𝑤 𝑟 + 𝑤 𝑐)2 { ∑ ∑ 𝑛𝑖𝑗 ∗ 𝑑𝑖𝑗 2 𝐽 𝑗=1 − (𝑃 − 𝑄)2 𝑁 𝐼 𝑖=1 } The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Confidence Bounds and One-Sided Tests Suppose you are testing the null hypothesis H0:  ≥ 0 against the one-sided alternative H1:  < 0. Rather than give a two-sided confidence interval for , the more appropriate procedure is to give an upper confidence bound in this setting. This upper confidence bound has a direct relationship to the one-sided test, namely:
  • 18. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 18 1. A level  test of H0:  ≥ 0 against the one-sided alternative H1:  < 0 rejects H0 exactly when the value 0 is above the 1–α upper confidence bound. 2. A level  test of H0:  ≤ 0 against the one-sided alternative H1:  > 0 rejects H0 exactly when the value 0 is above the 1–α lower confidence bound. ANOVA Test 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ∑ ∑(𝑦𝑖𝑗 − 𝑌.. ̅)2 𝑁𝑖 𝑗=1 𝑘 𝑖=1 𝑆𝑆𝐼𝑛𝑡𝑒𝑟 = ∑ 𝑛𝑖(𝑌̅𝑖. − 𝑌.. ̅)2 𝑘 𝑖=1 𝑆𝑆𝐼𝑛𝑡𝑟𝑎 = ∑ ∑(𝑦𝑖𝑗 − 𝑌𝑖. ̅ )2 𝑛 𝑖 𝑗=1 𝑘 𝑖=1 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐼𝑛𝑡𝑒𝑟 DF Total = N – 1 DF Inter = k – 1 DF Intra = N – k 𝑀𝑆𝑇𝑜𝑡𝑎𝑙 = SSTotal DFTotal 𝑀𝑆𝐼𝑛𝑡𝑒𝑟 = SSInter DFInter 𝑀𝑆𝐼𝑛𝑡𝑟𝑎 = SSIntra DFIntra 𝐹 = MSInter MSIntra where  F is the result of the test  k is the number of different groups to which the sampled cases belong  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  ni is the number of cases in the i-th group  yij is the value of the measured variable for the j-th case from the i-th group  𝑌̅.. is the mean of all yij  𝑌̅𝑖. is the mean of the yij for group i. The test statistic has a F-distribution with DF Inter and DF Intra degrees of freedom. Thus the null hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘 𝑘−1
  • 19. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 19 ANOVA Multiple Comparisons Difference of Means 𝑦̅𝑖 − 𝑦̅𝑗 Standard Error of the Difference of Means Estimator 𝑆𝑡𝑑. 𝐸𝑟𝑟𝑜𝑟 = √𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Scheffe’s Method Confidence Interval for Difference of Means 𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± √𝐷𝐹𝐼𝑛𝑡𝑒𝑟 ∗ 𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ 𝐹(1 − 𝛼) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎 𝐷𝐹 𝐼𝑛𝑡𝑒𝑟 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Source: http://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method Tukey's range test HSD Confidence Interval for Difference of Means 𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑞(1 − 𝛼) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎 𝑘 √ 𝑀𝑆𝐼𝑛𝑡𝑟𝑎 2 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Where q is the studentized range distribution. Source: https://en.wikipedia.org/wiki/Tukey%27s_range_test Fisher's Method LSD If overall ANOVA test is not significant, you must not consider any results of Fisher test, significant or not. Confidence Interval for Difference of Means 𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡(1 − 𝛼 2⁄ ) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎 √𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Where t is the student distribution. Bonferroni's Method The family-wise significance level (FWER) is α = 1 - Confidence Level. Thus any comparison flagged by ISSTATS as significant is based on a Bonferroni Correction:
  • 20. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 20 𝛼′ = 2𝛼 𝑘(𝑘 − 1) 𝑝′ = 𝑝 𝑘(𝑘 − 1) 2 Where k is the number of groups. Confidence Interval for Difference of Means 𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡 (1 − 𝛼′ 2⁄ ) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎 √𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Where t is the student distribution. Sidak's Method The family-wise significance level (FWER) is α = 1 - Confidence Level. So any comparison flagged by ISSTATS as significant is based on a Sidak Correction: 𝛼′ = (1 − 𝛼) 2 𝑘(𝑘−1) 𝑝′ = 1 − 𝑒 log(1−𝑝)𝑘(𝑘−1) 2 Where k is the number of groups. Confidence Interval for Difference of Means 𝐶𝐼 (1 − 𝛼) = 𝑦̅𝑖 − 𝑦̅𝑗 ± 𝑡 (1 − 𝛼′ 2⁄ ) 𝐷𝐹 𝐼𝑛𝑡𝑟𝑎 √𝑀𝑆𝐼𝑛𝑡𝑟𝑎 ∗ ( 1 𝑛𝑖 + 1 𝑛𝑗 ) Where t is the student distribution. Welch’s Test for equality of means The test statistic, F* , is defined as follows: 𝐹∗ = ∑ 𝑤𝑖(𝑥̅𝑖 − 𝑋̃)2𝑘 𝑖=1 𝑘 − 1 1 + 2(𝑘 − 2) 𝑘2 − 1 ∗ ∑ ℎ𝑖 𝑘 𝑖=1 where  F* is the result of the test  k is the number of different groups to which the sampled cases belong  ni is the number of cases in the i-th group  𝑤𝑖 = 𝑛 𝑖 𝑆𝑖 2
  • 21. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 21  𝑊 = ∑ 𝑤𝑖 = ∑ 𝑛 𝑖 𝑆𝑖 2 𝑘 𝑖=1 𝑘 𝑖=1  𝑋̃ = ∑ 𝑤 𝑖 𝑥̅ 𝑖 𝑘 𝑖=1 𝑊  ℎ𝑖 = (1− 𝑤 𝑖 𝑊 ) 2 𝑛 𝑖−1 The test statistic has approximately a F-distribution with k-1 and 𝑑𝑓 = 𝑘2−1 3∗∑ ℎ 𝑖 𝑘 𝑖=1 degrees of freedom. Thus the null hypothesis is rejected if 𝐹∗ ≥ 𝐹(1 − 𝛼) 𝑑𝑓 𝑘−1 Brown–Forsythe Test for equality of means The test statistic, F* , is defined as follows: 𝐹∗ = ∑ 𝑛𝑖(𝑥̅𝑖 − 𝑋̅..)2𝑘 𝑖=1 ∑ (1 − 𝑛𝑖 𝑁) 𝑆𝑖 2𝑘 𝑖=1 where  F* is the result of the test  k is the number of different groups to which the sampled cases belong  ni is the number of cases in the i-th group (sample size of group i)  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  𝑋̅.. = ∑ 𝑛 𝑖 𝑥̅ 𝑖 𝑘 𝑖=1 𝑁 is the overall mean. The test statistic has approximately a F-distribution with k-1 and df degrees of freedom. Where df is obtained with the Satterthwaite (1941) approximation as 1 df = ∑ ci 2 ni − 1 k i=1 with 𝑐𝑗 = (1 − 𝑛𝑗 𝑁) 𝑆𝑗 2 ∑ (1 − 𝑛𝑖 𝑁) 𝑆𝑖 2𝑘 𝑖=1 Thus the null hypothesis is rejected if 𝐹∗ ≥ 𝐹(1 − 𝛼) 𝑑𝑓 𝑘−1 Homoscedasticity Tests Levene's Test The test statistic, F, is defined as follows: 𝐹 = 𝑁 − 𝑘 𝑘 − 1 ∗ ∑ 𝑛𝑖(𝑍̅𝑖. − 𝑍̅..)2𝑘 𝑖=1 ∑ ∑ (𝑍𝑖𝑗 − 𝑍̅𝑖.)2𝑛 𝑖 𝑗=1 𝑘 𝑖=1
  • 22. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 22 where  F is the result of the test  k is the number of different groups to which the sampled cases belong  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  ni is the number of cases in the i-th group  Yij is the value of the measured variable for the j-th case from the i-th group  𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̅𝑖.| where 𝑌̅𝑖. is a mean of i-th group  𝑍̅.. is the mean of all Zij  𝑍̅𝑖. is the mean of the Zij for group i. The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘 𝑘−1 Source: http://en.wikipedia.org/wiki/Levene%27s_test Brown–Forsythe Test for equality of variances The test statistic, F, is defined as follows: 𝐹 = 𝑁 − 𝑘 𝑘 − 1 ∗ ∑ 𝑛𝑖(𝑍̅𝑖. − 𝑍̅..)2𝑘 𝑖=1 ∑ ∑ (𝑍𝑖𝑗 − 𝑍̅𝑖.)2𝑛𝑖 𝑗=1 𝑘 𝑖=1 where  F is the result of the test  k is the number of different groups to which the sampled cases belong  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  ni is the number of cases in the i-th group  Yij is the value of the measured variable for the j-th case from the i-th group  𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̃𝑖.| where 𝑌̃𝑖. is a median of i-th group  𝑍̅.. is the mean of all Zij  𝑍̅𝑖. is the mean of the Zij for group i. The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑁−𝑘 𝑘−1 Source: http://en.wikipedia.org/wiki/Levene%27s_test Bartlett's Test Bartlett's test is used to test the null hypothesis, H0 that all k population variances are equal against the alternative that at least two are different. If there are k samples with size ni and sample variances S2 i then Bartlett's test statistic is 𝜒2 = (𝑁 − 𝑘)𝑙𝑛(𝑆 𝑝 2 ) − ∑ (𝑛𝑖 − 1)𝑙𝑛(𝑆𝑖 2 )𝑘 𝑖=1 1 + 1 3(𝑘 − 1) ∗ (∑ ( 1 𝑛𝑖 − 1)𝑘 𝑖=1 − 1 𝑁 − 𝑘 ) where
  • 23. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 23  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  𝑆 𝑝 2 = ∑ (𝑛 𝑖−1)𝑆𝑖 2𝑘 𝑖=1 𝑁−𝑘 is the pooled estimate for the variance The test statistic has approximately a chi-squared distribution with k-1 degrees of freedom. Thus the null hypothesis is rejected if 𝜒2 ≥ 𝜒 𝑘−1 2 (1 − 𝛼). Source: http://en.wikipedia.org/wiki/Bartlett%27s_test Bivariate Correlation Tests Sample Covariance Sxy = ∑ (xi − x̅)(yi − y̅)N i=1 N − 1 Where 𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size. Source: http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance Sample Pearson Product-Moment Correlation Coefficient r = 1 N − 1 ∗ ∑ (𝑥𝑖 − 𝑥̅)(𝑦𝑖 − 𝑦̅)𝑁 𝑖=1 𝑆 𝑥 𝑆 𝑦 = 𝑆 𝑥𝑦 𝑆 𝑥 𝑆 𝑦 where Sx and Sy are the sample standard deviation of the paired sample (xi, yi), Sxy is the sample covariance and N is the total sample size. Source: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample Test for the Significance of the Pearson Product-Moment Correlation Coefficient Test hypothesis are:  H0: the sample values come from a population in which ρ=0  H1: the sample values come from a population in which ρ≠0 Test statistic is t = r ∗ √N − 2 √1 − r2 where  𝑁 = ∑ 𝑛𝑖 𝑘 𝑖=1 is the total sample size  r is the Sample Pearson Product-Moment Correlation Coefficient
  • 24. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 24 The test statistic has a t-student distribution with N-2 degrees of freedom. Spearman Correlation Coefficient For each of the variables X and Y separately, the observations are sorted into ascending order and replaced by their ranks. Identical values (rank ties or value duplicates) are assigned a rank equal to the average of their positions in the ascending order of the values. Each time t observations are tied (t>1), the quantity t3 −t is calculated and summed separately for each variable. These sums will be designated STx and STy. For each of the N observations, the difference between the rank of X and rank of Y is computed as: di = Rank(Xi) − Rank(Yi) If there are no ties in both samples, Spearman’s rho (ρ) is calculated as ρ = 1 − 6 ∑ 𝑑𝑖 N(𝑁2 − 1) If there are any ties in any of the samples, Spearman’s rho (ρ) is calculated as (Siegel, 1956): ρ = 𝑇𝑥 + 𝑇𝑦 − ∑ di 2√ 𝑇𝑥 ∗ 𝑇𝑦 where 𝑇𝑥 = N(𝑁2 − 1) − 𝑆𝑇𝑥 12 𝑇𝑦 = N(𝑁2 − 1) − 𝑆𝑇𝑦 12 If Tx or Ty is 0, the statistic is not computed. Source: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F alg_nonpar_corr_spearman.htm Test for the Significance of the Spearman’s Correlation Coefficient Test hypothesis are:  H0: the sample values come from a population in which ρ=0  H1: the sample values come from a population in which ρ≠0 Test statistic is t = ρ ∗ √N − 2 √1 − ρ2 The test statistic has a t-student distribution with N-2 degrees of freedom.
  • 25. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 25 Kendall's Tau-b Correlation Coefficient For each of the variables X and Y separately, the observations are sorted into ascending order and replaced by their ranks. In situations where t observations are tied, the average rank is assigned. Each time t > 1, the following quantities are computed and summed over all groups of ties for each variable separately. T1 = ∑ 𝑡2 − 𝑡 T2 = ∑(𝑡2 − 𝑡)(𝑡 − 2) T3 = ∑(𝑡2 − 𝑡)(2𝑡 + 5) Each of the N cases is compared to the others to determine with how many cases its ranking of X and Y is concordant or discordant. The following procedure is used. For each distinct pair of cases (i, j), where i < j the quantity dij=[Rank(Xj)−Rank(Xi)][Rank(Yj)−Rank(Yi)] is computed. If the sign of this product is positive, the pair of observations (i, j) is concordant. If the sign is negative, the pair is discordant. The number of concordant pairs minus the number of discordant pairs is S = ∑ ∑ 𝑠𝑖𝑔𝑛(𝑑𝑖𝑗) 𝑁 𝑗=𝑖+1 𝑁−1 𝑖=1 where sign(dij) is defined as +1 or –1 depending on the sign of dij. Pairs in which dij=0 are ignored in the computation of S. If there are no ties in both samples, Kendall’s tau (τ) is computed as τ = 2S N2 − N If there are any ties in any of the samples, Kendall’s tau (τ) is computed as τ = 2S √N2 − N − 𝑇1 𝑥√N2 − N − 𝑇1 𝑦 If the denominator is 0, the statistic is not computed. Source: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Tau-b Test for the Significance of the Kendall's Tau-b Correlation Coefficient The variance of S is estimated by (Kendall, 1955):
  • 26. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 26 Var = (N2 − N)(2N + 5) − T3x − T3y 18 + T2x ∗ T2y 9(N2 − N)(N − 2) + T1x ∗ T1y 2(N2 − N) The significance level is obtained using Z = S √Var Which, under the null hypothesis, is approximately distributed as a standard normal when the variables are statistically independent. Sources: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Significance_tests http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F alg_nonpar_corr_kendalls.htm Parametric Value at Risk Value at Risk of a single asset Given the time series of daily return rates for an asset, the daily mean of the return rates is μ, the daily variance of the daily return rates is σ2 . Given the position, hold or investment in the asset P. One-day Expected Return is: ER = Pμ The Standard Deviation or Volatility is the square root of the Variance: 𝜎 = √ 𝜎2 One-day Value at Risk is: 𝑉𝑎𝑅1−𝛼 = −(μ + 𝑧 𝛼 𝜎)P where zα is the left-tail α quantile of the normal standard distribution. Total Value at Risk for n trading days is: 𝑉𝑎𝑅1−𝛼 𝑛 𝑑𝑎𝑦𝑠 = 𝑉𝑎𝑅1−𝛼 ∗ √ 𝑛 = −(μ + 𝑧 𝛼 𝜎)P√ 𝑛 Portfolio Value at Risk Given the time series of daily return rates on different assets, the daily mean of the return rates for the i-th asset is μi, the daily variance of the return rate for the i-th asset is σi 2 , the daily standard deviation (or volatility) of the return rates for the i-th asset is σi. The covariance of the daily return rates of i-th and j-th assets is σij. All parameters are unbiased estimates. Given the holds, positions or investments on each of these assets: Pi Total positions is
  • 27. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 27 P = ∑ 𝑃𝑖 𝑁 𝑖=1 The weighting of each position is 𝑤𝑖 = 𝑃𝑖 𝑃 The weighted mean of the portfolio is μ 𝑃 = ∑ 𝑤𝑖 𝜇𝑖 = 𝑁 𝑖=1 1 𝑃 ∑ 𝑃𝑖 𝜇𝑖 𝑁 𝑖=1 One-day Expected Return of the portfolio is the weighted mean of the portfolio multiplied by the total position ER = Pμ 𝑃 = P ∑ 𝑤𝑖 𝜇𝑖 = 𝑁 𝑖=1 ∑ 𝑃𝑖 𝜇𝑖 𝑁 𝑖=1 The Portfolio Variance is 𝜎 𝑃 2 = [𝑤1 … 𝑤𝑖 … 𝑤 𝑛] [ 𝜎1 2 ⋯ 𝜎1𝑛 ⋮ ⋱ ⋮ 𝜎 𝑛1 ⋯ 𝜎 𝑛 2 ] [ 𝑤1 ⋮ 𝑤𝑖 ⋮ 𝑤 𝑛] = 𝑊 𝑇 𝑀𝑊 where W is the vector of weights and M is the covariance matrix. The item i-th in the diagonal of M is the daily variance of the return rates for the i-th asset. The items outside the diagonal are covariances. Portfolio Variance also can be computed as: 𝜎 𝑃 2 = 1 𝑃2 ∗ [𝑃1 … 𝑃𝑖 … 𝑃𝑛] [ 𝜎1 2 ⋯ 𝜎1𝑛 ⋮ ⋱ ⋮ 𝜎 𝑛1 ⋯ 𝜎 𝑛 2 ] [ 𝑃1 ⋮ 𝑃𝑖 ⋮ 𝑃𝑛] = 1 𝑃2 ∗ 𝑋 𝑇 𝑀𝑋 where X is the vector of positions. The Portfolio Standard Deviation or Portfolio Volatility is the square root of the Portfolio Variance: 𝜎 𝑃 = √𝜎 𝑃 2 One-day Value at Risk is: 𝑉𝑎𝑅1−𝛼 = −(μ 𝑃 + 𝑧 𝛼 𝜎 𝑃)P
  • 28. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 28 Where zα is the left-tail α quantile of the normal standard distribution. Total Value at Risk for n trading days is: 𝑉𝑎𝑅1−𝛼 𝑛 𝑑𝑎𝑦𝑠 = 𝑉𝑎𝑅1−𝛼 ∗ √ 𝑛 = −(μ 𝑃 + 𝑧 𝛼 𝜎 𝑃)P√ 𝑛 𝑉𝑎𝑅1−𝛼 𝑛 𝑑𝑎𝑦𝑠 is the minimum potential loss that a portfolio can suffer in the α% worst cases in n days. About the Signs: A positive value of VaR is an expected loss. A negative VaR would imply the portfolio has a high probability of making a profit. Source: http://www.jpmorgan.com/tss/General/Risk_Management/1159360877242 Remark: Some texts about VaR express the covariance as σij = σiσjρij where ρij is the correlation coefficient. Remark: Sometimes VaR is assumed to be the Portfolio Volatility multiplied by the position as expected return is supposed to be approximately zero. ISSTATS does NOT consider VaR as Portfolio Volatility and do NOT suppose expected return is zero. Marginal Value at Risk Marginal Value at Risk is the change in portfolio VaR resulting from a marginal change in the currency (dollar, euro…) position in component i: 𝑀𝑉𝑎𝑅𝑖 = 𝜕𝑉𝑎𝑅 𝜕𝑃𝑖 Assuming the linearity of the risk in the parametric approach, the vector of Marginal Value at Risk is [ 𝑀𝑉𝑎𝑅1 ⋮ 𝑀𝑉𝑎𝑅𝑖 ⋮ 𝑀𝑉𝑎𝑅 𝑛] = − ([ 𝜇1 ⋮ 𝜇𝑖 ⋮ 𝜇 𝑛] + 𝑧 𝛼 𝜎 𝑃 ∗ [ 𝜎1 2 ⋯ 𝜎1𝑛 ⋮ ⋱ ⋮ 𝜎 𝑛1 ⋯ 𝜎 𝑛 2 ] [ 𝑤1 ⋮ 𝑤𝑖 ⋮ 𝑤 𝑛]) [ 𝑀𝑉𝑎𝑅1 ⋮ 𝑀𝑉𝑎𝑅𝑖 ⋮ 𝑀𝑉𝑎𝑅 𝑛] = − ([ 𝜇1 ⋮ 𝜇𝑖 ⋮ 𝜇 𝑛] + 𝑧 𝛼 𝑃 ∗ 𝜎 𝑃 ∗ [ 𝜎1 2 ⋯ 𝜎1𝑛 ⋮ ⋱ ⋮ 𝜎 𝑛1 ⋯ 𝜎 𝑛 2 ] [ 𝑃1 ⋮ 𝑃𝑖 ⋮ 𝑃𝑛]) Total Marginal Value at Risk for n trading days is: 𝑀𝑉𝑎𝑅𝑖 𝑛 𝑑𝑎𝑦𝑠 = 𝑀𝑉𝑎𝑅𝑖 ∗ √ 𝑛 Component Value at Risk Component Value at Risk is a partition of the portfolio VaR that indicates the change of VaR if a given component was deleted.
  • 29. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 29 𝐶𝑉𝑎𝑅𝑖 = 𝜕𝑉𝑎𝑅 𝜕𝑃𝑖 𝑃𝑖 = 𝑀𝑉𝑎𝑅𝑖 ∗ 𝑃𝑖 Note that the sum of all component VaRs (CVaR) is the VaR for the entire portfolio: 𝑉𝑎𝑅 = ∑ 𝐶𝑉𝑎𝑅𝑖 𝑁 𝑖=1 = ∑ 𝜕𝑉𝑎𝑅 𝜕𝑃𝑖 𝑁 𝑖=1 𝑃𝑖 = ∑ 𝑀𝑉𝑎𝑅𝑖 𝑁 𝑖=1 ∗ 𝑃𝑖 Total Component Value at Risk for n trading days is: 𝐶𝑉𝑎𝑅𝑖 𝑛 𝑑𝑎𝑦𝑠 = 𝐶𝑉𝑎𝑅𝑖 ∗ √ 𝑛 Source: http://www.math.nus.edu.sg/~urops/Projects/valueatrisk.pdf Incremental Value at Risk Incremental VaR of a given position is the VaR of the portfolio with the given position minus the VaR of the portfolio without the given position, which measures the change in VaR due to a new position on the portfolio: IVaR (a) = VaR (P) – VaR (P - a) Source: http://www.jpmorgan.com/tss/General/Portfolio_Management_With_Incremental_VaR/1259104336084 Conditional Value at Risk, Expected Shortfall, Expected Tail Loss or Average Value at Risk 𝐸𝑆1−𝛼 1 𝑑𝑎𝑦 is the expected value of the loss of the portfolio in the α% worst cases in one day. Under Multivariate Normal Assumption, Expected Shortfall, also known as Expected Tail Loss (ETL), Conditional Value-at-Risk (CVaR), Average Value at Risk (AVaR) and Worst Conditional Expectation, is computed by ES(−VaR) = −𝐸(𝑥|𝑥 < −𝑉𝑎𝑅) ∗ 𝑃 = −[𝜇 + 𝐸𝑆(𝑧 𝛼)𝜎] ∗ 𝑃 = −[𝜇 + 𝐸(𝑧|𝑧 < 𝑧 𝛼)𝜎] ∗ 𝑃 = − [𝜇 + ∫ 𝑡𝑒− 𝑡2 2 𝑧 𝛼 −∞ 𝑑𝑡 𝛼 𝜎] ∗ 𝑃 = −(𝜇 − 𝑒− 𝑧 𝛼 2 2 𝛼√2𝜋 𝜎) ∗ 𝑃 where zα is the left-tail α quantile of the normal standard distribution. About the Sign: Because VaR is given by ISSTATS with a negative sign, as J.P. Mogan recommend, we take its original value to perform calculations (-VaR = μ + zασ). Once the ES is computed, it is given multiplied by a negative sign. That is mean; a positive value of ES is an expected loss. On the other hand, a negative value of ES would imply the portfolio has a high probability of making a profit even in the worst cases. Source: http://www.imes.boj.or.jp/english/publication/mes/2002/me20-1-3.pdf
  • 30. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 30 Exponentially Weighted Moving Average (EWMA) Forecast Given a series of k daily return rates {r1, …….., rk} computed as Continuously Compounded Return: 𝑟𝑖 = ln ( 𝑠𝑖 𝑠𝑖−1 ) Where r1 corresponds to the earliest date in the series, and rk corresponds to the latest or most recent date. Supposed k > 50, and assuming that the sample mean of daily returns is zero, the EWMA estimates the one-day variance for a given sequence of k returns as: 𝜎2 = (1 − 𝜆) ∑ 𝜆𝑖 𝑟𝑘−𝑖 2 𝑘−1 𝑖=0 where 0 < λ< 1 is the decay factor. The one-day volatility is: 𝜎 = √ 𝜎2 For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the volatility is: 𝜎 𝑇 𝑑𝑎𝑦𝑠 = 𝜎√𝑇 For two return series, assuming that both averages are zero, the EWMA estimate of one-day covariance for a given sequence of k returns is given by 𝑐𝑜𝑣1,2 = 𝜎1,2 = (1 − 𝜆) ∑ 𝜆𝑖 𝑟1,𝑘−𝑖 𝑟2,𝑘−𝑖 𝑘−1 𝑖=0 The corresponding one-day correlation forecast for the two returns is given by 𝜌1,2 = 𝑐𝑜𝑣1,2 𝜎1 𝜎2 = 𝜎1,2 𝜎1 𝜎2 For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the covariance is: 𝑐𝑜𝑣1,2 𝑇 𝑑𝑎𝑦𝑠 = 𝜎1,2 𝑇 Source: http://pascal.iseg.utl.pt/~aafonso/eif/rm/TD4ePt_2.pdf Value at Risk of a single asset, Portfolio Value at Risk, Marginal Value at Risk, Component Value at Risk, Incremental Value at Risk, Incremental Value at Risk by EWMA method. See methods and formulas at Parametric Value at Risk.
  • 31. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 31 Linear Regression Given n equations for a regression model, with p predictor variables. The i-th given equation is yi = β0 + β1xi1 + β2xi2 + …+ βpxip The n equations stacked together and written in vector form is [ 𝑦1 ⋮ 𝑦𝑖 ⋮ 𝑦𝑛] = [ 1 ⋯ 𝑥1𝑝 ⋮ ⋱ ⋮ 1 ⋯ 𝑥 𝑛𝑝 ] [ β0 ⋮ β 𝑖 ⋮ β 𝑝] + [ ԑ0 ⋮ ԑ𝑖 ⋮ ԑ 𝑛] In matrix notation: Y = Xβ + ԑ X is here named the design matrix, of dimensions n-by-(p+1). If constant is not included, the matrix are [ 𝑦1 ⋮ 𝑦𝑖 ⋮ 𝑦𝑛] = [ 𝑥11 ⋯ 𝑥1𝑝 ⋮ ⋱ ⋮ 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑝 ] [ β1 ⋮ β 𝑖 ⋮ β 𝑝] + [ ԑ1 ⋮ ԑ𝑖 ⋮ ԑ 𝑛] If constant is not included, X, the design matrix, has now dimensions n-by-p. The estimated value of the unknown parameter β: 𝛽̂ = (𝑋𝑋 𝑇 )−1 𝑋 𝑇 𝑌 Estimation can be carried out if, and only if, there is no perfect multicollinearity between the predictor variables. If constant is not included, the parameters can also be estimated by 𝛽̂𝑗 = ∑ 𝑥𝑖𝑗 𝑦𝑖 𝑛 𝑖=1 ∑ 𝑥𝑖𝑗 2𝑛 𝑖=1 The standardized coefficients are 𝛽̂𝑖 𝑠𝑡 = 𝛽̂𝑖 ∗ 𝑆 𝑥𝑖 𝑆 𝑦 Where  Sxi is the unbiased standard deviation of the i-th predictor variable
  • 32. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 32  Sy is the unbiased standard deviation of the response variable y The estimate of the standard error of each coefficient is obtained by 𝑠𝑒(𝛽̂𝑖) = √𝑀𝑆𝐸 ∗ (𝑋𝑋 𝑇)𝑖𝑖 −1 Where MSE is the mean squared error of the regression model. It is known that 𝛽̂𝑖 𝑠𝑒(𝛽̂𝑖) ↝ 𝑡 𝑛−𝑝−1 Where  p is the number of predictor variables  n is the total number of observations (number of rows in the design matrix) If constant is not included, the degrees of freedom for the t statistics are n-p. ANOVA for linear regression If the constant is included. Component Sum of squares Degrees of freedom Mean of squares F Model SSM p MSM = SSM/p MSM/MSE Error SSE n-p-1 MSE = SSE/(n-p-1) Total SST n-1 MST = SST/(n-1) Being 𝑆𝑆𝑀 = ∑(𝑦𝑖̂ − 𝑦̅)2 𝑛 𝑖=1 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂)2 𝑛 𝑖=1 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2 𝑛 𝑖=1 Where  p is the number of predictor variables  n is the total number of observations (number of rows in the design matrix)  SSE = sum of squared residuals  MSE = mean squared error of the regression model The test statistic has a F-distribution with p and (n-p-1) degrees of freedom. Thus the ANOVA null hypothesis is rejected if 𝐹 ≥ 𝐹(1 − 𝛼) 𝑛−𝑝−1 𝑝 The coefficient of determination R2 is defined as SSM/SST. It is output as a percentage.
  • 33. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 33 The Adjusted R2 is defined as 1-MSE/MST. It is output as a percentage. The square root of MSE is called the standard error of the regression, or standard error of the Estimate. If the constant is not included. Component Sum of squares Degrees of freedom Mean of squares F Model SSM p MSM = SSM/p MSM/MSE Error SSE n-p MSE = SSE/(n-p) Total SST n SST/n Being 𝑆𝑆𝑀 = ∑ 𝑦𝑖̂2 𝑛 𝑖=1 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂)2 𝑛 𝑖=1 𝑆𝑆𝑇 = ∑ 𝑦𝑖 2 𝑛 𝑖=1 Unstandardized Predicted Values The fitted values (or unstandardized predicted values) from the regression will be 𝑌̂ = 𝑋𝛽̂ = 𝑋(𝑋𝑋 𝑇 )−1 𝑋 𝑇 𝑌 = HY where H is the projection matrix (also known as hat matrix) H = X(XXT )-1 XT Standardized Predicted Values Once computed the mean and unbiased standard deviation of the unstandardized predicted values, we standardize the fitted values as 𝑦̂𝑖 𝑠𝑡 = 𝑦̂𝑖 − 𝑦̂̅ 𝑆 𝑦̂ When new predictions are included outside of the design matrix, they are standardized with the above values. Prediction Intervals for Mean Let define the vector of given predictors as
  • 34. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 34 Xh = (1, xh,1, xh,2, …, xh, p)T We define the standard error of the fit at Xh given by: 𝑠𝑒(𝑦̂ℎ) = √𝑀𝑆𝐸 ∗ 𝑋ℎ 𝑇 (𝑋 𝑇 𝑋)−1 𝑋ℎ Then, the Confidence Interval for the Mean Response is 𝑦̂ℎ ± 𝑡 𝛼 2 ;𝑛−𝑝−1 ∗ 𝑠𝑒(𝑦̂ℎ) Where  X is the design matrix  ŷh is the "fitted value" or "predicted value" of the response when the predictor values are Xh.  MSE is the mean squared error of the regression model  n is the total number of observations  p is the number of predictor variables Prediction Intervals for Individuals Let define the vector of given predictors as Xh = (1, xh,1, xh,2, …, xh, p)T We define the standard error of the fit at Xh given by: 𝑠𝑒(𝑦̂ℎ) = √𝑀𝑆𝐸 ∗ [1 + 𝑋ℎ 𝑇(𝑋 𝑇 𝑋)−1 𝑋ℎ] Then, the Confidence Interval for individuals or new observations is 𝑦̂ℎ ± 𝑡 𝛼 2 ;𝑛−𝑝−1 ∗ 𝑠𝑒(𝑦̂ℎ) Where  X is the design matrix  ŷh is the "fitted value" or "predicted value" of the response when the predictor values are Xh.  MSE is the mean squared error of the regression model  n is the total number of observations  p is the number of predictor variables Unstandardized Residuals
  • 35. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 35 The Unstandardized Residual for the i-th data unit is defined as: êi = yi - ŷi In matrix notation Ê = Y - Ȳ = Y – HY = (Inxn – H)Y Where H is the hat matrix. Standardized Residuals The standardized Residual for the i-th data unit is defined as: eŝ 𝑖 = ê 𝑖 √ 𝑀𝑆𝐸 Where  êi is the unstandardized residual for the i-th data unit.  MSE is the mean squared error of the regression model Studentized Residuals (internally studentized residuals) The leverage score for the i-th data unit is defined as: hii = [H]ii the i-th diagonal element of the projection matrix (also known as hat matrix) H = X(XXT )-1 XT where X is the design matrix. The Studentized Residual for the i-th data unit is defined as: 𝑡𝑖 = 𝑒̂ 𝑖 √𝑀𝑆𝐸 ∗ (1 − ℎ𝑖𝑖) Where  êi is the unstandardized residual for the i-th data unit.  MSE is the mean squared error of the regression model
  • 36. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 36 Source:https://en.wikipedia.org/wiki/Studentized_residual Centered Leverage Values The regular leverage score for the i-th data unit is defined as: hii = [H]ii the i-th diagonal element of the projection matrix (also known as hat matrix) H = X(XXT )-1 XT where X is the design matrix. The centered leverage value for the i-th data unit is defined as: clvi = hii – 1/n Where n is the number of observations. If the intercept is not included, then the centered leverage value for the i-th data unit is defined as: clvi = hii Source:https://en.wikipedia.org/wiki/Leverage_(statistics) Mahalanobis Distance The Mahalanobis Distance for the i-th data unit is defined as: Di2 = (n - 1)*(hii – 1/n) = (n - 1)*clvi Where  hii is the i-th diagonal element of the projection matrix.  n is the number of observations If the intercept is not included, the Mahalanobis Distance for the i-th data unit is defined as: Di2 = n*hii Source: https://en.wikipedia.org/wiki/Mahalanobis_distance Cook’s Distance
  • 37. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 37 The Cook’s Distance for the i-th data unit is defined as: 𝐷𝑖 = 𝑒̂ 𝑖 2 ℎ𝑖𝑖 𝑀𝑆𝐸 ∗ (𝑝 + 1) ∗ (1 − ℎ𝑖𝑖)2 Where  hii is the i-th diagonal element of the projection matrix.  p is the number of predictor variables  êi is the unstandardized residual for the i-th data unit.  MSE is the mean squared error of the regression model If the intercept is not included, the Cook’s Distance for the i-th data unit is defined as: 𝐷𝑖 = 𝑒̂ 𝑖 2 ℎ𝑖𝑖 𝑀𝑆𝐸 ∗ 𝑝 ∗ (1 − ℎ𝑖𝑖)2 Source: https://en.wikipedia.org/wiki/Cook%27s_distance Curve Estimation Models Linear. Model whose equation is Y = b0 + (b1 * t). The series values are modeled as a linear function of time. Quadratic. Model whose equation is Y = b0 + (b1 * t) + (b2 * t**2). The quadratic model can be used to model a series that "takes off" or a series that dampens. Cubic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3). Quartic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4). Quintic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4) + (b5 * t**5). Sextic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4) + (b5 * t**5) + (b6 * t**6). Logarithmic. Model whose equation is Y = b0 + (b1 * ln(t)). Inverse. Model whose equation is Y = b0 + (b1 / t). Power. Model whose equation is Y = b0 * (t**b1) or ln(Y) = ln(b0) + (b1 * ln(t)). Compound. Model whose equation is Y = b0 * (b1**t) or ln(Y) = ln(b0) + (ln(b1) * t). S-curve. Model whose equation is Y = e**(b0 + (b1/t)) or ln(Y) = b0 + (b1/t).
  • 38. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 38 Logistic. Model whose equation is Y = 1 / (1/u + (b0 * (b1**t))) or ln(1/y-1/u) = ln (b0) + (ln(b1) * t) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to use in the regression equation. The value must be a positive number that is greater than the largest dependent variable value. Growth. Model whose equation is Y = e**(b0 + (b1 * t)) or ln(Y) = b0 + (b1 * t). Exponential. Model whose equation is Y = b0 * (e**(b1 * t)) or ln(Y) = ln(b0) + (b1 * t).
  • 39. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 39 © Copyright InnerSoft 2017. All rights reserved. Los hijos perdidos del Sinclair ZX Spectrum 128K (RANDOMIZE USR 123456) innersoft@itspanish.org innersoft@gmail.com http://isstats.itspanish.org/