The Wilcoxon Rank Sum Test
The Wilcoxon Rank Test is a procedure for comparing independent samples of sizes n_1 and n_2 from two populations. It is a nonparametric procedure based on the ranks of all N (= n_1 + n_2) observations after the two samples are combined and the observations are ordered from smallest to largest. Generally, nonparametric tests replace assumptions about normality with less stringent assumptions, such as symmetry and continuity of distributions.
Where to Find Information About the Wilcoxon Rank Sum Test:
Textbook Companion Chapter 15 (Read section 15.1 & Optionally 15.2)
Web:
Excel and TI-83 Manuals Supplementary Reading Material:
EXAMPLE
Does the presence of small numbers of weeds reduce the yield of corn? Lamb's-quarter is a common weed in cornfields. A researcher planted corn at the same rate in eight small plots of ground, then weeded the corn rows by hand to allow no weeds in four randomly selected plots and exactly three lamb's-quarter plants per meter of row in the other four plots. The yield of corn (bushels per acre) for the eight plots, are shown in the table below.
Weeds per meter |
Yield (bu/acre) | |||
0 |
166.7 |
172.2 |
165.0 |
176.9 |
3 |
158.6 |
176.4 |
153.1 |
156.0 |
Assume that the shapes of the two distributions of yields are symmetric (or approximately symmetric).
Null hypothesis for this example:
Ho : There is no difference in the distribution of yields. (i.e., m_{1 }=_{ }m_{2 })_{}
Alternative hypothesis for this example:
Ha : Yields are significantly higher in weed-free plots
(i.e., m_{1 }>_{ }m_{2 })
The Wilcoxon Rank Sum Statistic
Let W be the sum of the ranks of the observations in the first sample (for our example, weed-free plots.) If the two populations have the same continuous distribution and the observations in both samples take different values (i.e., there are no ties when the observations are ranked), the exact distribution of W has mean
m_{w} = n_1*(N+1)/2
and standard deviation:
s_{w} = SQRT[n_1*n_2*(N+1)/12]
The Wilcoxon rank sum test rejects the hypothesis that the two populations have identical distributions when the observed rank sum W is far from its mean.
The Normal Approximation
The rank sum statistic W becomes approximately normal as the two sample sizes increase. That is, we can use the z-statistic
z = (W - m_{w})/ s_{w} = [W - n_1*(N+1)/2 ] / SQRT[n_1*n_2*(N+1)/12]
to carry out the test.
For a fixed level a test, reject Ho if:
Note: z* represents the corresponding upper critical value of a standard normal distribution.
Wilcoxon Rank Sum Test : EXAMPLE | ||||
Sample1 |
Sample2 |
Combined Sample |
Population |
Rank |
166.7 |
158.6 |
153.1 |
2 |
1 |
172.2 |
176.4 |
156.0 |
2 |
2 |
165.0 |
153.1 |
158.6 |
2 |
3 |
176.9 |
156.0 |
165.0 |
1 |
4 |
166.7 |
1 |
5 | ||
172.2 |
1 |
6 | ||
176.4 |
2 |
7 | ||
176.9 |
1 |
8 | ||
Test Values |
Excel Formulas | |||
alpha |
0.05 | |||
Sample Size |
||||
n_1 |
4 | =COUNT(Sample1) | ||
n_2 |
4 | =COUNT(Sample2) | ||
Test Parameters & Statistics |
||||
W |
23 | =SUMIF(Population, "=1", Rank) | ||
mu |
18 | =n_1*(n_1+n_2+1)/2 | ||
sigma |
3.4641 | =SQRT(n_1*n_2*(n_1+n_2+1)/12) | ||
z |
1.443 | =(W-mu)/sigma | ||
Lower Test Results |
||||
lower_z |
-1.645 | =NORMSINV(alpha) | ||
Decision |
Do Not Reject Ho |
=IF(z<lower_z, "Reject Ho", "Do Not Reject Ho) | ||
Pvalue |
0.9255 | =NORMSDIST(z) | ||
Upper Test Results |
||||
upper _z |
1.645 | =NORMSINV(1-alpha) | ||
Decision |
Do Not Reject Ho |
=IF(z>upper_z, "Reject Ho", "Do Not Reject Ho) | ||
Pvalue |
0.0745 | =1-NORMSDIST(z) | ||
Two-Sided Test Results |
||||
two_z |
1.960 | =ABS(NORMSINV(alpha/2)) | ||
Decision |
Do Not Reject Ho |
=IF(ABS(z)>two_z, "Reject Ho", "Do Not Reject Ho) | ||
Pvalue |
0.1489 | =2*(1-NORMSDIST(ABS(z))) |
Interpreting The Results
Since the alternative hypothesis for this test is
Ha : Yields are significantly higher in weed-free plots (i.e., m_{1 }> m_{2 })
We read off the Excel output that corresponds to the Upper Test Results rows in the example output above.
Conclusion: At the 5% level, we conclude that the data are not significant since both the Pvalue of 0.0745 is not less than alpha (0.05 or 5%), and the observed z value of 1.443 is not greater than the upper_z value of 1.645, as required for rejecting the null hypothesis for this type of test.
The answer to the researcher's question: Does the presence of small numbers of weeds reduce the yield of corn? Our response should be that based on the data at hand, the presence of small numbers of weeds DOES NOT reduce the yield of corn.
EXPLANATION OF EXCEL FORMULAS
COUNT( ) |
SUMIF( ) |
||
SQRT( ) |
NORMSDIST( ) |
||
NORMSINV( ) |
ABS( ) |
||
COUNT
Counts the number of cells that contain numbers and numbers within the list of arguments. Use COUNT to get the number of entries in a number field in a range or array of numbers.
Syntax
COUNT(value1,value2, ...)
Value1, value2, ... are 1 to 30 arguments that can contain or refer to a variety of different types of data, but only numbers are counted.
SUMIF
Adds the cells specified by a given criteria.
Syntax
SUMIF(range,criteria,sum_range)
· Range is the range of cells you want evaluated.
· Criteria is the criteria in the form of a number, expression, or text that defines which cells will be added. For example, criteria can be expressed as 32, "32", ">32", "apples".
· Sum_range are the actual cells to sum. The cells in sum_range are summed only if their corresponding cells in range match the criteria. If sum_range is omitted, the cells in range are summed.
Remark
Microsoft Excel provides additional functions that can be used to analyze your data based on a condition. For example, to count the number of occurrences of a string of text or a number within a range of cells, use the COUNTIF function. To have a formula return one of two values based on a condition, such as a sales bonus based on a specified sales amount, use the IF worksheet function. Learn about calculating a value based on a condition.
Example
Suppose A1:A4 contain the following property values for four homes: $100,000, $200,000, $300,000, $400,000, respectively. B1:B4 contain the following sales commissions on each of the corresponding property values: $7,000, $14,000, $21,000, $28,000.
SUMIF(A1:A4,">160000",B1:B4)
equals $63,000
SQRT
Returns the positive square root of a positive real number.
Syntax
SQRT(number)
· Number is the number for which you want the square root. If number is negative, SQRT returns the #NUM! error value.
Examples
SQRT(16)
equals 4
SQRT(-16)
equals #NUM!
SQRT(ABS(-16))
equals 4
NORMDIST
Returns the standard normal cumulative distribution function. The distribution has a mean of 0 (zero) and a standard deviation of 1 (one.) Use this function in place of a table of standard normal curve areas.
Syntax
NORMSDIST(z)
· Z is the value for which you want the distribution.
Remarks
Example
NORMSDIST(1.333333)
equals 0.908789
NORMSINV
Returns the inverse of the standard normal cumulative distribution. The distribution has a mean of zero and a standard deviation of one.
Syntax
NORMSINV(probability)
· Probability is a probability corresponding to the normal distribution and must be a positive number between 0 and 1, inclusive.
Remarks
NORMSINV uses an iterative technique for calculating the function. Given a probability value, NORMSINV iterates until the result is accurate to within ± 3x10^-7. If NORMSINV does not converge after 100 iterations, the function returns the #N/A error value.
Example: NORMSINV(0.908789)
equals 1.3333
ABS
Returns the absolute value of a number. The absolute value of a number is the number without its sign.
Syntax
ABS(number)
· Number is the real number of which you want the absolute value.
Examples
ABS(2)
equals
2
ABS(-2)
equals
2
If A1 contains -
16, then:
SQRT(ABS(A1))
equals 4