Describing Distributions with numbers.
Finding the center of a distribution
Two of the most important features of the distribution of a quantitative variable can be described using numerical measures,
- Its center, and
- The spread of its values about the center
The numbers we use to describe the center of a distribution ( i.e., the location where roughly half the values are below it and the other half above it) are:
- The mean
- The median
-- The mean of the distribution of a quantitative variable is the arithmetic average of its values.
-- The median is the "middle value". It is located after all the observations have been arranged in ascending order .
- Arrange all observations in increasing (or ascending) order
- If n is odd, then the median is the (n+1)/2 th observation
- If n is even, the median is the mean of the two center observations
(See: Example 1.12 on page 39, and Example 1.13 on page 40; Introduction to the Practice of Statistics by D. Moore and G. McCabe; 4th edition.)
Comparing the mean and the median
-- Because the mean is the arithmetic average of all the values in a set of data, it is strongly influenced by any extreme observations (Outliers) that are included in the set. The mean always misrepresents (either underestimates or overestimate) the center of distributions that are skewed either to the left or to the right.
-- The median on the other hand is resistant to any extreme observations (Outliers) that the data set may include. It is always a better choice to use the median to describe the location of the center for skewed distributions.
-- For symmetric distributions the mean and median should both be fairly close ( or even equal ) to each other.
Measuring spread (deviation from the center) of a distribution
-- The mean and median alone do not describe the distribution of a variable completely. Numerical measures of spread give an idea of the variability in the values of a variable.
Common measures of spread
- Range (= maximum value - minimum value)
- The Standard Deviation
Computing the Quartiles
-- list the observations in increasing order
-- the first (lower) quartile is the median of the first half of the data (Q1)
-- the second quartile is the median
-- the third (upper) quartile is the median of the second half of the data (Q3)
Computing the Interquartile Range (IQR)
-- IQR = Q3 - Q1
The Five-Number Summary
Minimum, Q1, Median, Q3, Maximum
-- The five numbers (two extremes, two quartiles, and the median) tell us a great deal about a dataset. These five numbers are also used to draw a different kind of plot, the BOX PLOT.
Drawing a boxplot
- Find the five-number-summary.
- Mark the locations of the median, quartiles, and extremes below a number line.
- Draw a box between the two quartiles. Mark the median with a line across the box. Draw two "whiskers" from the quartiles to the extremes.
-- Data values that are substantially larger or smaller than the other values are referred to as outliers.
The 1.5 x IQR rule for outliers
-- Observations that fall below Q1 - 1.5 x (IQR), or above Q3 + 1.5 x (IQR) are, according to this rule, identified as potential outliers.
Note: When outliers are present in the data, then a modified boxplot must be drawn. Draw a modified boxplot by ending the whiskers at the most extreme observations still within 1.5 x IQR of the quartiles and plot all of the outliers individually.
The following is a numerical summary of exam scores on TEST 1 from a previous Math 115 class.
Variable N Mean Median Min Max Q1 Q3
TEST1 73 79.21 83.00 38.00 100.00 67.50 93.00
A boxplot and a histogram for these results is shown below
Properties of s:
- -- Measures the spread around the mean
- -- Should be used together with the mean not the median
- -- If s = 0, then all the observations have the same value
- -- The larger the s the more spread out the data is
- -- s is strongly influenced by outliers
- -- For describing skewed distributions the five-number summary, instead of the Mean and Standard Deviation, is preferred.
- -- For describing symmetric distributions the mean and standard deviation are preferred.
An alternative formula for computing the variance (s^2) , which is easier to use if you were to compute the standard deviation (s) by hand, is:
NOTE: I recommend that you use a calculator to compute the mean and the standard deviation for a set of data. It could be educational though to see how the formulas are used to compute these numbers also. Click here to see an example.