The Wilcoxon Rank Sum Test

The Wilcoxon Rank Test is a procedure for comparing independent samples of sizes n_1 and n_2 from two populations. It is a nonparametric procedure based on the ranks of all N (= n_1 + n_2) observations after the two samples are combined and the observations are ordered from smallest to largest. Generally, nonparametric tests replace assumptions about normality with less stringent assumptions, such as symmetry and continuity of distributions.

Where to Find Information About the Wilcoxon Rank Sum Test:

Textbook Companion Chapter 15 (Read section 15.1 & Optionally 15.2)

Web:

Excel and TI-83 Manuals Supplementary Reading Material:

  1. Chapter 15: Nonparametric Tests (sections 15.1),
    Excel Manual for Moore and McCabe's Introduction to the Practice of Statistics, 5e.

    (Note: This is chapter 14 in the 4th edition manual.)
  2. Chapter 15: Nonparametric Tests (sections 15.1),
    TI-83 Graphing Calculator Manual for Moore and McCabe's Introduction to the Practice of Statistics, 5e.
    (Note: This is chapter 14 in the 4th edition manual.)


EXAMPLE

Does the presence of small numbers of weeds reduce the yield of corn? Lamb's-quarter is a common weed in cornfields. A researcher planted corn at the same rate in eight small plots of ground, then weeded the corn rows by hand to allow no weeds in four randomly selected plots and exactly three lamb's-quarter plants per meter of row in the other four plots. The yield of corn (bushels per acre) for the eight plots, are shown in the table below.

Weeds per meter

Yield (bu/acre)

0

166.7

172.2

165.0

176.9

3

158.6

176.4

153.1

156.0

Assume that the shapes of the two distributions of yields are symmetric (or approximately symmetric).

  1. Let m1 be the true mean value of the distribution of yields of corn that comes from weed free plots.
  2. Let m2 be the true mean value of the distribution of yields of corn that comes from plots with 3 weeds per meter.

Null hypothesis for this example:

Ho :  There is no difference in the distribution of yields. (i.e.,  m1 = m2 )

Alternative hypothesis for this example:

Ha :  Yields are significantly higher in weed-free plots (i.e.,  m1 > m2 )

The Wilcoxon Rank Sum Statistic

Let W be the sum of the ranks of the observations in the first sample (for our example, weed-free plots.)  If the two populations have the same continuous distribution and the observations in both samples take different values (i.e., there are no ties when the observations are ranked), the exact distribution of W has mean               

mw =  n_1*(N+1)/2

and standard deviation:

sw =  SQRT[n_1*n_2*(N+1)/12]

 The Wilcoxon rank sum test rejects the hypothesis that the two populations have identical distributions when the observed rank sum W is far from its mean.

The Normal Approximation

The rank sum statistic W becomes approximately normal as the two sample sizes increase. That is, we can use the z-statistic

z = (W -  mw)/ sw = [W - n_1*(N+1)/2 ] / SQRT[n_1*n_2*(N+1)/12]

to carry out the test.

For a fixed level a test, reject Ho if:

Note:  z* represents the corresponding upper critical value of a standard normal distribution.

EXCEL Formulas for carrying out a Wilcoxon Rank Sum test using normal approximation, without the continuity correction.

Wilcoxon Rank Sum Test : EXAMPLE

         
Sample1
Sample2
Combined Sample
Population
Rank
166.7
158.6
153.1
2
1
172.2
176.4
156.0
2
2
165.0
153.1
158.6
2
3
176.9
156.0
165.0
1
4
166.7
1
5
172.2
1
6
176.4
2
7
176.9
1
8
   
 
Test Values
Excel Formulas
alpha
0.05
Sample Size
n_1
4 =COUNT(Sample1)
n_2
4 =COUNT(Sample2)
Test Parameters & Statistics
W
23 =SUMIF(Population, "=1", Rank)
mu
18 =n_1*(n_1+n_2+1)/2
sigma
3.4641 =SQRT(n_1*n_2*(n_1+n_2+1)/12)
z
1.443 =(W-mu)/sigma
Lower Test Results
lower_z
-1.645 =NORMSINV(alpha)
Decision
Do Not Reject Ho
=IF(z<lower_z, "Reject Ho", "Do Not Reject Ho)
Pvalue
0.9255 =NORMSDIST(z)
Upper Test Results
upper _z
1.645 =NORMSINV(1-alpha)
Decision
Do Not Reject Ho
=IF(z>upper_z, "Reject Ho", "Do Not Reject Ho)
Pvalue
0.0745 =1-NORMSDIST(z)
Two-Sided Test Results
two_z
1.960 =ABS(NORMSINV(alpha/2))
Decision
Do Not Reject Ho
=IF(ABS(z)>two_z, "Reject Ho", "Do Not Reject Ho)
Pvalue
0.1489 =2*(1-NORMSDIST(ABS(z)))


Interpreting The Results

Since the alternative hypothesis for this test is

Ha :  Yields are significantly higher in weed-free plots (i.e.,  m1 > m2 )

We read off the Excel output that corresponds to the Upper Test Results rows in the example output above.

Conclusion: At the 5% level, we conclude that the data are not significant since both the Pvalue of 0.0745 is not less than alpha (0.05 or 5%), and the observed z value of 1.443 is not greater than the upper_z value of 1.645, as required for rejecting the null hypothesis for this type of test.

The answer to the researcher's question: Does the presence of small numbers of weeds reduce the yield of corn? Our response should be that based on the data at hand, the presence of small numbers of weeds DOES NOT reduce the yield of corn.

EXPLANATION OF EXCEL FORMULAS

COUNT( )

SUMIF( )

   

SQRT( )

NORMSDIST( )

   

NORMSINV( )

ABS( )

   
       

COUNT

Counts the number of cells that contain numbers and numbers within the list of arguments. Use COUNT to get the number of entries in a number field in a range or array of numbers.

Syntax

COUNT(value1,value2, ...)

Value1, value2, ...   are 1 to 30 arguments that can contain or refer to a variety of different types of data, but only numbers are counted.

SUMIF

Adds the cells specified by a given criteria.

Syntax

SUMIF(range,criteria,sum_range)

        Range   is the range of cells you want evaluated.

        Criteria   is the criteria in the form of a number, expression, or text that defines which cells will be added. For example, criteria can be expressed as 32, "32", ">32", "apples".

        Sum_range   are the actual cells to sum. The cells in sum_range are summed only if their corresponding cells in range match the criteria. If sum_range is omitted, the cells in range are summed.

Remark

Microsoft Excel provides additional functions that can be used to analyze your data based on a condition. For example, to count the number of occurrences of a string of text or a number within a range of cells, use the COUNTIF function. To have a formula return one of two values based on a condition, such as a sales bonus based on a specified sales amount, use the IF worksheet function. Learn about calculating a value based on a condition.

Example

Suppose A1:A4 contain the following property values for four homes: $100,000, $200,000, $300,000, $400,000, respectively. B1:B4 contain the following sales commissions on each of the corresponding property values: $7,000, $14,000, $21,000, $28,000.

SUMIF(A1:A4,">160000",B1:B4) equals $63,000

SQRT

Returns the positive square root of a positive real number.

Syntax

SQRT(number)

        Number   is the number for which you want the square root. If number is negative, SQRT returns the #NUM! error value.

Examples

SQRT(16) equals 4

SQRT(-16) equals #NUM!

SQRT(ABS(-16)) equals 4

NORMDIST

Returns the standard normal cumulative distribution function. The distribution has a mean of 0 (zero) and a standard deviation of 1 (one.) Use this function in place of a table of standard normal curve areas.

Syntax

NORMSDIST(z)

        Z   is the value for which you want the distribution.

Remarks

Example

NORMSDIST(1.333333) equals 0.908789

NORMSINV

Returns the inverse of the standard normal cumulative distribution. The distribution has a mean of zero and a standard deviation of one.

Syntax

NORMSINV(probability)

        Probability   is a probability corresponding to the normal distribution and must be a positive number between 0 and 1, inclusive.

Remarks

NORMSINV uses an iterative technique for calculating the function. Given a probability value, NORMSINV iterates until the result is accurate to within 3x10^-7. If NORMSINV does not converge after 100 iterations, the function returns the #N/A error value.

Example:        NORMSINV(0.908789) equals 1.3333

 ABS

Returns the absolute value of a number. The absolute value of a number is the number without its sign.

Syntax

ABS(number)

        Number   is the real number of which you want the absolute value.

Examples

ABS(2) equals 2

ABS(-2) equals 2

If A1 contains -16, then:

SQRT(ABS(A1)) equals 4