__Section
1.2:__

__Describing Distributions
with numbers.__

- Finding the center of a distribution
- Comparing the mean and the median
- Measuring spread (deviation from the center) in a set of data
- The Variance and Standard Deviation

__Finding
the center of a distribution__

Two of the most important features of the distribution of a quantitative variable can be described using numerical measures,

- Its center, and
- The spread of its values about the center
The numbers we use to describe the center of a distribution ( i.e., the location where roughly half the values are below it and the other half above it) are:

- The mean
- The median
-- The mean of the distribution of a quantitative variable is the arithmetic average of its values.

-- The median is the "middle value". It is located after all the observations have been arranged in

ascendingorder .

- Arrange all observations in increasing (or ascending) order
- If n is odd, then the median is the (n+1)/2 th observation
- If n is even, the median is the mean of the two center observations
(See:

)Example 1.12 on page 39, andExample 1.13 on page 40; Introduction to the Practice of Statistics by D. Moore and G. McCabe; 4th edition.

__Comparing
the mean and the median__

-- Because the mean is the arithmetic average of all the values in a set of data, it is strongly influenced by any extreme observations (Outliers) that are included in the set. The mean always misrepresents (either underestimates or overestimate) the center of distributions that are skewed either to the left or to the right.

-- The median on the other hand is resistant to any extreme observations (Outliers) that the data set may include. It is always a better choice to use the

medianto describe the location of the center forskewed distributions.-- For symmetric distributions the mean and median should both be fairly close ( or even equal ) to each other.

__Measuring
spread (deviation from the center) of a distribution__

-- The mean and median alone do not describe the distribution of a variable completely. Numerical measures of spread give an idea of the variability in the values of a variable.

Common measures of spread

- Range (= maximum value - minimum value)
- Quartiles
- The Standard Deviation
Computing the Quartiles

-- list the observations in increasing order

-- the first (lower) quartile is the median of the first half of the data (Q_{1})

-- the second quartile is the median

-- the third (upper) quartile is the median of the second half of the data (Q_{3})Computing the Interquartile Range (IQR)

-- IQR = Q

_{3}- Q_{1}The Five-Number Summary

Minimum, Q

_{1}, Median, Q_{3}, MaximumThe Boxplot

-- The five numbers (two extremes, two quartiles, and the median) tell us a great deal about a dataset. These five numbers are also used to draw a different kind of plot, the BOX PLOT.

Drawing a boxplot

- Find the five-number-summary.
- Mark the locations of the median, quartiles, and extremes below a number line.
- Draw a box between the two quartiles. Mark the median with a line across the box. Draw two "whiskers" from the quartiles to the extremes.
Outliers

-- Data values that are substantially larger or smaller than the other values are referred to as outliers.

The 1.5 x IQR rule for outliers

-- Observations that fall below Q

_{1}- 1.5 x (IQR), or above Q_{3}+ 1.5 x (IQR) are, according to this rule, identified as potential outliers.

Note:When outliers are present in the data, then a modified boxplot must be drawn. Draw a modified boxplot by ending the whiskers at the most extreme observations still within 1.5 x IQR of the quartiles and plot all of the outliers individually.

Example 1.The following is a

numerical summaryof exam scores on TEST 1 from a previous Math 115 class.

Variable N Mean Median Min Max Q_{1}Q_{3}

TEST1 73 79.21 83.00 38.00 100.00 67.50 93.00

A boxplot and a histogram for these results is shown below

Properties of s:

- -- Measures the spread around the mean
- -- Should be used together with the mean not the median
- -- If s = 0, then all the observations have the same value
- -- The larger the s the more spread out the data is
- -- s is strongly influenced by outliers

NOTES:

- -- For describing
skeweddistributions thefive-number summary, instead of the Mean and Standard Deviation, is preferred.- -- For describing
symmetricdistributions themeanandstandard deviation are preferred.

An alternative formula for computing thevariance (s^2), which is easier to use if you were to compute thestandard deviation (s)by hand, is:

NOTE:I recommend that you use a calculator to compute the mean and the standard deviation for a set of data. It could be educational though to see how the formulas are used to compute these numbers also. Click here to see an example.

Copyright (c), 2003 by Nikos Psomas