AP Statistics
Sections: 1.Introduction  |  2. Data  |  3. Displaying Distributions   |  4. Inspecting Distributions  | 5. Time Plots  |  6. Measuring Center | 7. Measuring Spread  | 8. Linear Transformations |  9. Comparing Distributions

  Measuring Spread

Measures of location summarize what is typical of elements of a list, but not every element is typical. Are all the elements close to each other? Are most of the elements close to each other? What is the biggest difference between elements? On the average, how far are the elements from each other? Measures of spread or variability tell us.

The three most common measures of spread or variability are the range, the interquartile range (IQR), and the standard deviation.

Measuring spread: the quartiles

The mean and median provide two different measures of the center of a distribution. But a measure of center alone can be misleading. The Census Bureau reports that in 2000 the median income of American households was $41,345. Half of all households had incomes below $41,345, and half had higher incomes. But these figures do not tell the whole story. Two nations with the same median household income are very different if one has extremes of wealth and poverty and the other has little variation among households. A drug with the correct mean concentration of active ingredient is dangerous if some batches are much too high and others much too low. We are interested in the spread or variability of incomes and drug potencies as well as their centers. The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.

One way to measure spread is to calculate the range, which is the difference between the largest and smallest observations. For example, the number of home runs Barry Bonds has hit in a season has a range of 73 – 16 = 57. The range shows the full spread of the data. But it depends on only the smallest observation and the largest observation, which may be outliers. We can improve our description of spread by also looking at the spread of the middle half of the data. The quartiles mark out the middle half. Count up the ordered list of observations, starting from the smallest. The first quartile lies one-quarter of the way up the list. The third quartile lies three-quarters of the way up the list. In other words, the first quartile is larger than 25% of the observations, and the third quartile is larger than 75% of the observations. The second quartile is the median, which is larger than 50% of the observations. That is the idea of quartiles. We need a rule to make the idea exact. The rule for calculating the quartiles uses the rule for the median. Here is an example that shows how the rules for the quartiles work for both odd and even numbers of observations.

 

THE QUARTILES Q1 and Q3

To calculate the quartiles

1. Arrange the observations in increasing order and locate the median M in the ordered list of observations.

2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median.

3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

Example

Following is an ordered list of Barry Bonds' home run counts up to and including his record year in which he hit 73 home runs.

16    19    24    25    25    33    33    34   34    37    37   40    42    46    49    73

There is an even number of observations, so the median lies midway between the 8th and 9th number in the ordered list (34, 34). In this case the median is 34 since both values are equal. The first quartile is the median of the 8 observations to the left of the median, 34. So Q1 = 25. The third quartile is the median of the 8 observations to the right of the median. So  Q3 = 41. Note that we don't include the median when we are computing the quartiles.

The quartiles are resistant. For example,  Q3 would have the same value if Bonds' record 73 was 703.

Be careful when several observations take on the same numerical value. Write down all of the observations and apply the rules just as if they all had distinct values. Some software packages use a slightly different rule to find quartiles, so computer results may be a bit different. The difference will be small and insignificant.

The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range.

The Interquartile Range (IQR)

The interquartile range (IQR) is distance between the first and third quartiles,

IQR = Q3 - Q1

If an observation falls between Q1 and Q3 then you know it's neither unusually high (upper 25%) or unusually low (lower 25%). The IQR is the basis of a rule for identifying suspected outliers.

Outliers: The 1.5 X IQR Criterion

Call an observation an outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile.

Example

Let's revisit Barry Bonds' homerun data above. The IQR for the data set is Q3 - Q1,   41 - 25 = 16. We are suspect that Barry Bond's 73 is an outlier, let's apply the outlier test.

Q3 + 1.5 X IQR = 41 + (1.5 X 16) = 65 (upper cutoff)

Q- 1.5 X IQR = 25 - (1.5 X 16) = 1 (lower cutoff)

Since 73 is above the upper cutoff, 65, Bonds' record setting year was an outlier. Make sure you do the calculations every time. Don't rely on the graphical representation of the data, it can be misleading.

The Five-number summary and boxplots

The smallest and largest observations tell us little about the distribution as a whole, but they give information about the tails of the distribution that is missing if we know only Q1, M, and Q3. To get a quick summary of both center and spread, combine all five numbers.

The Five-number Summary

The five-number summary of a data set consists of the smallest observation, the first quartile, the median, the third quartile, and the maximum. It is customary to write the five-number summary from least to greatest.

In symbols, the five number summary is

Minimum      Q1      Median    Q3     Maximum

The five-number summary of a distribution leads to a new graph, the boxplot.

Following is an example of how to construct a boxplot. Boxplots are also known as box-and-whiskers plots.

A box-and-whisker plot can be useful for handling many data values. They allow people to explore data and to draw informal conclusions when two or more variables are present. It shows only certain statistics rather than all the data. Five-number summary is another name for the visual representations of the box-and-whisker plot. The five-number summary consists of the median, the quartiles, and the smallest and greatest values in the distribution. Immediate visuals of a box-and-whisker plot are the center, the spread, and the overall range of distribution.

To construct a box-and-whisker plot we must find the median, the lower quartile, the upper quartile, the minimum, and the maximum of a given set of data. Example: The following set of numbers are the amount of marbles fifteen different boys own (they are arranged from least to greatest).

18 27 34 52 54 59 61 68 78 82 85 87 91 93 100

  • First find the median. The median is the value exactly in the middle of an ordered set of numbers.

    68 is the median

  • Next, we consider only the values to the left of the median: 18 27 34 52 54 59 61. We now find the median of this set of numbers. Remember, the median is the value exactly in the middle of an ordered set of numbers. Thus 52 is the median of the scores less than the median of all scores, and therefore is the lower quartile.

    52 is the lower quartile

  • Now consider only the values to the right of the median: 78 82 85 87 91 93 100. We now find the median of this set of numbers. The median 87 is therefore called the upper quartile.

    87 is the upper quartile

  • You are now ready to find the interquartile range (IQR). The interquartile range is the difference between the upper quartile and the lower quartile. In our case the IQR = 87 - 52 = 35. The IQR is a very useful measurement. It is useful because it is less influenced by extreme values, it limits the range to the middle 50% of the values.

          35 is the interquartile range

  • Lastly look at the minimum (lower extreme) and the maximum (upper extreme).

         18 is the minimum and 100 is the maximum

Now we draw our graph.

A boxplot also gives an indication of the symmetry or skewness of a distribution. In a symmetric distribution, the first and third quartiles are equally distant from the median. In most distributions that are skewed to the right, however, the third quartile will be farther above the median than the first quartile is below it. Likewise in left skewed distributions the first quartile will be farther below the median than the third quartile is above it. The extremes behave the same way, but remember that they are just single observations and may say little about the distribution as a whole.

Outliers usually deserve special attention. Because the regular boxplot conceals outliers, we will adopt the modified boxplot, which plots outliers as isolated points.

Example: Modified boxplots

Data: 2.2  2.7  1.7  3.2  1.9  0.9  2.0  2.0  1.6  2.1  1.6  1.1  4.5  2.0  3.4  1.9  2.1  2.6  1.2  0.4  1.8  1.6  0.8
 3.1  1.6  1.8  2.2  2.9  1.2  1.8  1.9  2.0  3.3  1.5  4.1  2.1  1.4

The five number summary is   Min   Q1   Median   Q3   Max
                                              0.4     1.6       1.9       2.4      4.5

Modified boxplots attempt to identify possible outliers. Values are classified as outliers if they are more than a distance of 1.5 X(Q3 - Q1) above Q3 or below Q1 For our data this distance is 1.5X(2.4 - 1.6) = 1.5X(.8) = 1.2. Any value below 0.4 or above 3.6 is classified an outlier. 4.1 and 4.6 are outliers.

The whiskers of a modified boxplot extend to the last data value in this range and possible outliers are marked separately:

so we get

Because the modified boxplot shows more detail, when we say “boxplot” from now on, we will mean “modified boxplot.”

Measuring spread: the standard deviation

The five-number summary is not the most common numerical description of a distribution. That distinction belongs to the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation measures spread by looking at how far the observations are from their mean.

Before we define the standard deviation we must first look at another measure of dispersion or spread, the variance. Following is an example of how to calculate variance. Once the variance is calculated the standard deviation is simply the the positve square root of the variance.

To calculate variance another term is introduced, the sum of squares.

 Sum of squares is the sum of the squared deviations of observations from the mean of the distribution.

  • The  sum of squares is commonly represented as SS
  • The formula is        
  • This quantity is also known as the variation.
    • It appears in the numerator of formulas for standard deviation and variance (below).
    • Here is a "worksheet" for computing the  sum of squares:

      Data set { 5, 9, 2, 8, 6, 5, 4, 7, 4, 3, 1, 6}      = 5

      Values of Xi

      (1)

      Deviations of Xi from the mean

      (2)

      Deviations squared

      (3)

      _______________________________________________________

      5

      0

      0

      9

      4

      16

      2

      -3

      9

      8

      3

      9

      6

      1

      1

      5

      0

      0

      4

      -1

      1

      7

      2

      4

      4

      -1

      1

      3

      -2

      4

      1

      -4

      16

      6

      1

      1

         

      Sum of the Squared Deviations

 

Variance is the mean of the squared and summed deviations (i.e., the average of the squared deviations)

  • The symbol for variance is s2 accompanied by a subscript for the corresponding variable.

              The formula for variance of variable x:                                         

  • For the data above, the variance computes as s2 = 62 /(12-1) = 5.64
 

Standard deviation, is the square root of the variance

  • The symbol for standard deviation is s

                      The formula for the standard deviation of variable x:                                                               

  • For the data above, the standard deviation, s = 2.37
  • This is the most important formula in statistics
  • But for now, learn this formula as the definition of the standard deviation.
  • s measures spread about the mean and should be used only when the mean is chosen as the measure of center.
  • s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger.

  • s, like the mean, is not resistant. Strong skewness or a few outliers can make s very large.

  • We will get to the "theoretical reasons" later.

Because the variance involves squaring the deviations, it does not have the same unit of measurement as the original observations. Lengths measured in centimeters, for example, have a variance measured in squared centimeters. Taking the square root remedies this. The standard deviation s measures spread about the mean in the original scale.

If the variance is the average of the squares of the deviations of the observations from their mean, why do we average by dividing by n – 1 rather than n? Because the sum of the deviations is always zero, the last deviation can be found once we know the other n – 1 deviations. So we are not averaging n unrelated numbers. Only n – 1 of the squared deviations can vary freely, and we average by dividing the total by n – 1. The number n – 1 is called the degrees of freedom of the variance or of the standard deviation. Many calculators offer a choice between dividing by n and dividing by n – 1, so be sure to use n – 1. Leaving the arithmetic to a calculator allows us to concentrate on what we are doing and why. What we are doing is measuring spread. Here are the basic properties of the standard deviation s as a measure of spread.

You may rightly feel that the importance of the standard deviation is not yet clear. We will see in the next chapter that the standard deviation is the natural measure of spread for an important class of symmetric distributions, the normal distributions. The usefulness of many statistical procedures is tied to distributions of particular shapes. This is certainly true of the standard deviation.

Choosing measures of center and spread

How do we choose between the five-number summary and and s to describe the center and spread of a distribution? Because the two sides of a strongly skewed distribution have different spreads, no single number such as s describes the spread well. The five-number summary, with its two quartiles and two extremes, does a better job.

The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use and s only for reasonably symmetric distributions that are free of outliers.

Do remember that a graph gives the best overall picture of a distribution. Numerical measures of center and spread report specific facts about a distribution, but they do not describe its entire shape. Numerical summaries do not disclose the presence of multiple peaks or gaps, for example. Always plot your data.

Try Self-Check 6

You are now ready for Statistics Assignment 3: Measures of Central Tendencies

Review Measuring Center and Measuring Spread and take Multiple Choice 2.

 

© 2004 Aventa Learning. All rights reserved.