Hypothesis Testing in MLAB

Hypothesis Testing

Hypothesis testing is a major paradigm in statistics. It is closely linked with the computation of specified probability distribution functions. The basic notion is simple. We obtain a sample value v of a random variable T and we ask how probable it is that the sample value v would appear under a hypothesis H which makes the distribution of T well-defined. If the probability that, under the hypothesis H, v or a more extreme value of T appears is small, we take this as evidence that the hypothesis H is unlikely to be true. In other words, we conclude that the test of the hypothesis H has not supported H.

The random variable T is called the test statistic. If many samples of various random variables are taken, they are often combined in some, possibly quite elaborate, manner to obtain a single sample of a derived test statistic T. In other cases, the test statistic may be a vector-valued random variable with a multivariate distribution function. For example, the test statistic associated with the famous t-test for testing the hypothesis that two normally-distributed random variables have the same mean is the difference between the means, or variance-adjusted means, of two sets of sample values corresponding to the two random variables being studied.

In order to compute the probability p that, under the hypothesis H, the sample value v or a more extreme value of T appears, we must be able to compute P(T ≤ v | H), which is the distribution of the test statistic T under the hypothesis H. We shall denote a random variable with this distribution by T_H. The hypothesis H must be such that the distribution function of T_H is known; this means that H is often of the form: "there is no difference between two sets of samples'', since it is generally easier to deduce the distribution of T_H in this case. Thus, H is called the null hypothesis, meaning the "no difference'' hypothesis.

Suppose that the distribution function of T_H is G(x) : = P(T_H Ł x). Also suppose that the density function dG(x)/dx is a unimodal "bell-shaped'' curve, so that the extreme sample values of T_H lie toward +Ą and - Ą . Suppose the value v is given as a sample value of T. We may compute, for example, p = P(| T_H - E(T_H)| ł | v- E(T_H)| ). This is a particular form of a so-called two-tailed test. p is the probability that the value v or a "more extreme'' value occurs as a sample value of T, given H. If p is sufficiently small, we may reject the null hypothesis H as implausible in the face of the "evidence'' v. We call such a probability p the plausibility probability of H, given v.

If the test statistic T_H were known to be non-negative and the density function dG(x)/dx were a function, such as the exponential density function, which decreases on [0,Ą ), then we might use a so-called one-tail test, where we compute the probability p = P(T_H ł v).

In general, we may specify a particular value a as our criterion of "sufficiently small'' and we may choose any subset S of the range of T such that P(T_H Ď S) = a . Then if v Ď S, the null hypothesis H may be judged implausible. S is called the acceptance set, because, when v Î S, the null hypothesis H is not rejected. The value a = P(T_H Ď S) is the probability that we make a mistake if we reject H when v Ď S.

How should the acceptance set S be chosen? S should be chosen to minimize the chance of making the mistake of accepting H when H is, in fact, false. But, this can only be done rigorously with respect to an alternative hypothesis H_a such that the distribution of T given H_a is known. We must postulate that H and H_a are the only non-negligable possibilities. Sometimes, H_a = \lnot H is a suitable alternate hypothesis, but more often, this is not suitable. Given H_a, the probability we falsely accept H when the alternate hypothesis H_a is true is P(T_{H_a} ∉ S) = : b , and we can choose S such that P(T_H ∉ S) = a while P(T_{H_a} ∈ S) = b is minimized.

The value P(T_{H_a} Ď S) = 1- b is called the power of the test of the null hypothesis H versus the alternate hypothesis H_a. Choosing S to minimize b is the same as choosing S to maximize the power 1- b .

If we don't care about achieving the optimal power of the test with respect to a specific alternate hypothesis, but merely wish to compute the plausibility probability that v or a more extreme sample value of T would occur given H, in a fair manner, then we may proceed as follows.

Let m = median(T_H); thus, P(T_H ł m) = 0.5. Now, if v < m, choose r₁ = v and r₂ as the value such that P(m < T_H < r₂) = P(v < T_H < m), otherwise choose r₂ = v and choose r₁ as the value such that P(r₁ < T_H < m) = P(m < T_H < v). Then the two-tail plausibility probability a = 1- P(r₁ < T_H < r₂). If v < m, a = 2P(T_H Ł v), otherwise, if v ł m, a = 2(1- P(T_H Ł v)).

If we know that the only values more extreme than v which we wish to consider as possible are those in the same tail of the density function that v lies in, then we may compute the one-tail plausibility probability as a = P(T_H Ł v) if v Ł m and a = P(T_H ł v) if v > m.

Consider testing the null hypothesis H versus the alternate hypothesis H_a using a sample value v of the test random variable T with the acceptance set S. We have the following outcomes.

v Î S v Ď S

accept H reject H

H prob 1-a prob a

correct rejection error

accept H reject H

H_aprob b prob 1-b

acceptance error correct

a
=
P(T_H Ď S) = P(we falsely reject H | H)
(rejection error)

1- a
=
P(T_H Î S) = P(we correctly accept H | H)
(acceptance power)

b
=
P(T_{H_a} Î S) = P(we falsely accept H | H_a)
(acceptance error)

1- b
=
P(T_{H_a} Ď S) = P(we correctly reject H | H_a)
(rejection power)

Let Q be the sample-space of the test statistic T. We assumed above that either H(q)=1 for all q Î Q or H(q)=0 for all q Î Q, but this universal applicability of H or H_a may be relaxed. Suppose the hypothesis H and the alternate hypothesis H_a may each hold at different points of Q, so that H and H_a define corresponding complementary Bernouilli random variables on Q. Thus H(q)=1 if H holds at the sample point q Î Q and H(q)=0 if H does not hold at the sample point q; H_a is defined on Q in the same manner.

Let P({q Î Q | H(q) = 1}) be denoted by P(H) and let P({q Î Q | H_a(q) = 1}) be denoted by P(H_a). P(H) is called the incidence probability of H and P(H_a) is called the incidence probability of H_a. As before, we postulate that P(H)=1- P(H_a). Often P(H) is 0 or 1 as we assumed above, but it may be that 0 < P(H) < 1. In this latter case, our test of hypothesis can be taken as a test of whether H(q)=1 or H_a(q)=1 for the particular sample point q at hand for which T(q)=v; the test statistic value v may be taken as evidence serving to increase or diminish the probability of H(q)=1.

Note that we cannot compute the "posterior'' probability that H(q)=1 (and that H_a(q)=0), or conversely, unless we have the "prior'' incidence probability of H being true in the sample-space Q. In particular, if we assume the underlying sample-space point q is chosen at random, then:

P(H(q)=1    &    T(q) Î S)
=
(1- a )P(H)

P(H(q)=1    &    T(q) Ď S)
=
a P(H)

P(H(q)=0    &    T(q) Î S)
=
b (1- P(H))

P(H(q)=0    &    T(q) Ď S)
=
(1- b )(1- P(H))

If we take the occurrence of H(q)=0 as being a determination of a "positive state'' of the random sample-space point q, then (1- a )P(H) is the probability of a true negative sample, a P(H) is the probability of a false positive sample, b (1- P(H)) is the probability of a false negative sample, and (1- b )(1- P(H)) is the probability of a true positive sample.

Now let us look at a particular case of an hypothesis test, namely the so-called F-test for equal variances of two normal populations.

Suppose X₁₁, X₁₂, ..., X_1n₁ are independent identically-distributed random variables distributed as N(m ₁,s ₁²), and X₂₁, X₂₂, ..., X_2n₂ are independent identically-distributed random variables distributed as N(m ₂,s ₂²). The corresponding sample-variance random variables are

S₁²= n₁
ĺ
j=1
(X_ij-
-

X

1
)²/(n₁- 1) and S₂²= n₂
ĺ
j=1
(X_2j-
-

X

2
)²/(n₂- 1),

where [` X]₁=ĺ _j=1^n₁X_1j/n₁ and [` X]₂=ĺ _j=1^n₂X_2j/n₂.

Let R denote the sample variance ratio S₁²/S₂². Then R ~ (s ₁²/s ₂²)F_{n₁-
1,n₂-
1}, where F_{n₁-
1,n₂-
1} is a random variable having the F-distribution with (n₁- 1,n₂- 1) degrees of freedom.

We take the null hypothesis H to be s ₁/s ₂ = 1, so that, given H, the test statistic R is known to be distributed as F_{n₁-
1,n₂-
1}. In order to determine the acceptance region S with maximal power for a fixed, we take the alternate hypothesis H_a to be s ₁/s ₂ = a. Then S is the interval [r₁,r₂] where P(r₁/a² Ł F_{n₁-
1,n₂-
1} Ł r₂/a²) = b is minimal, subject to 1- P(r₁ Ł F_{n₁-
1,n₂-
1} Ł r₂) = a .

Let G(z) = P(F_{n₁-
1,n₂-
1} Ł z), the distribution function of F_{n₁-
1,n₂-
1}, and let g(z) = G˘ (z), the probability density function of F_{n₁-
1,n₂-
1}. Then, we have r₁ = root_z[g(z)g(h(z)/a²)- g(h(z)) g(z/a²)] and r₂ = h(r₁), where h(z) = G^-
1(1- a +G(z)).

A simplified, slightly less powerful, way to choose the acceptance region S is to take S = [r₁,r₂] where r₁ is the value such that P(F_{n₁-
1,n₂-
1} Ł r₁) = a /2 and r₂ is the value such that P(F_{n₁-
1,n₂-
1} ł r₂) = a /2. Another way to select the acceptance region is to take S = [1/r,r], where r is the value such that P(1/r Ł F_{n₁-
1,n₂-
1} Ł r) = 1- a . When n₁ = n₂, the acceptance region [r₁, r₂] and the acceptance region [1/r, r] are identical.

The foregoing clearly exemplifies the fact that there is a trade-off among the acceptance error probability b , the rejection error probability a , and the sample sizes (n₁,n₂). If we wish to have a smaller a , then we must have a greater b or greater values of n₁ and n₂. Similarly, b can only be reduced if we allow a or n₁ and n₂ to increase. In general, given any two of the test parameters a , b , or (n₁,n₂), we can attempt to determine the third, although a compatible value need not exist. Actually, in most cases, a fourth variable representing the distinction between the null hypothesis and the alternate hypothesis, such as the value a above, enters the trade-off balancing relations.

For the simplified two-tailed F-test with S = [r₁, r₂], the relations among a , b , a, n₁, n₂, r₁, and r₂ are listed below.

P(F_{n₁-
1,n₂-
1} Ł r₁) = a /2,

P(F_{n₁-
1,n₂-
1} ł r₂) = a /2,

P(r₁ Ł a²F_{n₁-
1,n₂-
1} Ł r₂) = b .

For the alternate case with S = [1/r, r], the relations among a , b , a, n₁, n₂, and r are:

P(1/r Ł F_{n₁-
1,n₂-
1} Ł r) = 1-a

P(1/r Ł a²F_{n₁-
1,n₂-
1} Ł r) = b .

In order to reduce the number of unknowns, we may postulate that n₁ and n₂ are related as n₂ = q n₁, where q is a fixed constant.

The value of b is determined above by the distribution of a²F_{n₁-
1,n₂-
1}, because for the F-test, the test statistic, assuming the alternate hypothesis s ₁/s ₂ = a, is the random variable a²F_{n₁-
1,n₂-
1} whose distribution function is just the F-distribution with the argument scaled by 1/a². In other cases, the distribution of the alternate hypothesis test statistic T_{H_a} is more difficult to obtain.

If we take the dichotomy s ₁/s ₂ = 1 vs. s ₁/s ₂ = a as the only two possibilities, then a one-tailed F-test is most appropriate. Suppose a > 1. Then we take S = [- Ą ,r₂], and we have the relations: P(F_{n₁-
1,n₂-
1} ł r₂) = a , and P(a²F_{n₁-
1,n₂-
1} Ł r₂) = b . If a < 1, then with the null hypothesis s ₁/s ₂ = 1 and the alternate hypothesis s ₁/s ₂ = a, we should take S = [r₁,Ą ]. Then, P(F_{n₁-
1,n₂-
1} Ł r₁) = a, and P(a²F_{n₁-
1,n₂-
1} ł r₁) = b .

Generally, hypothesis testing is most useful when a decision is to be made. Instead, for example, suppose we are interested in the variance ratio (s ₁/s ₂)² between two normal populations for computational purposes. Then it is preferable to use estimation techniques and confidence intervals to characterize (s ₁/s ₂), rather than to use a hypothesis test whose only useful outcome is "significantly implausible'', or "not significantly implausible'' with the significance level a (which is the same as the rejection error probability).

Let r₁ satisfy P(F_{n₁-
1,n₂-
1} Ł r₁) = a ₁, and let r₂ satisfy P(F_{n₁-
1,n₂-
1} Ł r₂) = 1- a ₂, with a ₁+a ₂ = a < 1. Then P((s ₁/s ₂)²r₁ > R or R > (s ₁/s ₂)²r₂) = a ₁+a ₂, and P((s ₁/s ₂)²r₁ < R and R < (s ₁/s ₂)²r₂) = 1- a ₁- a ₂ = P((s ₁/s ₂)² < R/r₁ and R/r₂ < (s ₁/s ₂)²) = P(R/r₂ < (s ₁/s ₂)² < R/r₁).

Thus, [R/r₂, R/r₁] is a (1- a )-confidence interval which is an interval-valued random variable that contains the true value (s ₁/s ₂) with probability 1- a . The length of this interval is minimized for n₂ > 2 by choosing a ₁ and a ₂, subject to a ₁+a ₂ = a , such that G^-
1(a ₁)²g(G^-
1(1- a ₂))- G^-
1(1- a ₂)²g(G^-
1(a ₁)) = 0, where G(x) = P(F_{n₁-
1,n₂-
1} Ł x) and where g(x) = G˘ (x), the probability density function of F_{n₁-
1,n₂-
1}. Then a ₁ = root_z(G^-
1(z)²g(G^-
1(1+a - z))- G^-
1(1- a +z)²g(G^-
1(z)), and a ₂ = a - a ₁, r₁ = G^-
1(a ₁), and r₂ = G^-
1(1- a ₂).

Let v denote the observed sample value of R. Then [v/r₂,v/r₁] is a sample (1- a )-confidence interval for (s ₁/s ₂)².

The MLAB mathematical and statistical modeling system contains functions for various statistical tests and also functions to compute associated power and sample-size values. Let us consider an example focusing on the simplified F-test discussed above. We are given the following data:

x1: -1.66, 0.46, 0.15, 0.52, 0.82, -0.58, -0.44, -0.53, 0.4, -1.1
x2: 3.02, 2.88, 0.98, 2.01, 3.06, 2.95, 3.4, 2.76, 3.92, 5.02, 4, 4.89, 2.64, 3.08

We may read this data into two vectors in MLAB and test whether the two data sets x₁ and x₂ have equal variances by using the MLAB F-test function /QFT/, which implements the [1/r,r] simplified F-test specified above. The MLAB dialog to do this is exhibited below

*x1 = read(x1file); x2 = read(x2file);
*qft(x1,x2)

[F-test: are the variances v1 and v2 of 2 normal populations plausibly equal?]

null hypothesis H0: v1/v2 = 1. Then v1/v2 is F-distributed with (n1-1,n2-1) degrees of freedom. n1 n2 are the sample sizes. The sample value of v1/v2 = 0.577562, n1 = 10, n2 = 15

The probability P(F < 0.577562) = 0.205421 This means that a value of v1/v2 smaller than 0.577562 arises about 20.542096 percent of the time, given H0.

The probability P(F > 0.577562) = 0.794579 This means that a value of v1/v2 larger than 0.577562 arises about 79.457904 percent of the time, given H0.

The probability: 1-P(0.577562 < F < 1.731416) = 0.377689 This means that a value of v1/v2 more extreme than 0.577562 arises about 37.768896 percent of the time, given H0.

The a = .05 simplified F-test acceptance set of the form [1/r, r] can be computed directly as follows. /QFF/ is the name of the F-distribution function in MLAB.

* n1 = nrows(x1)-1; n2 = nrows(x2)-1;
* fct rv(a) = root(z,.001,300,qff(1/z,n1,n2)+1-qff(z,n1,n2)-a)
* r = rv(.05)
* type 1/r,r
= .285471776
R = 3.50297326

Thus, a sample of F_9,14 will lie in [.2855,3.503] with probability .95.

The rejection error probability b can be plotted as a function of the acceptance error probability a for the sample sizes 10 and 15 by using the builtin function /QFB/ as follows. The function /QFB/(a ,n,q ,e) returns the rejection error probability value b that corresponds to the sample sizes n and q n, with the acceptance error probability a and the alternate hypothesis variance ratio e.

* fct b(a) = qfb(a,10,3/2,e)
* for e = 1:3!10 do draw points(b,.01:.4!100)
* left title "rejection error (beta)"
* bottom title äcceptance error (alpha)"
* top title "beta vs. alpha curves for n1=10, n2=15, e=1:3!10"
* view

Suppose we want to take n samples from each of two populations to be used to test whether these populations have the variance ratio 1 versus the variance ratio e, with acceptance error probability a = .05 and rejection error b = .05. We can use the builtin function /QFN/ to compute the sample size n as a function of e as follows. The function /QFN/(a,b,q,e) returns the sample size n that corresponds to the variance ratio numerator sample size, assuming the denominator sample size qn, and given that the acceptance error probability is a, the rejection error probability is b, and the alternate hypothesis variance ratio value is e.

* fct n(e) = qfn(.05,.05,1,e)
* draw points(n,.1:2.5!50)
* top title ßample size vs. variance ratio (with a=b=.05,t=1)"
* left title ßample size (n)"
* bottom title "variance ratio (e)"
* view