AP Statistics
Sections: 1.Introduction  |  2. Data  |  3. Displaying Distributions   |  4. Inspecting Distributions  | 5. Time Plots  |  6. Measuring Center | 7. Measuring Spread  | 8. Linear Transformations |  9. Comparing Distributions

  Measuring Center

Summarizing data can help us understand them, especially when the number of data is large. This section presents several ways to summarize data by a typical value (a measure of location).

What single number is most representative of an entire list of numbers? We cannot say without making "representative" a more precise notion. The mean, the median, and the mode make "representative" precise in three different, but related, ways. The median, mode and mean are defined as follows:

  • The median is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median of a distribution:

    1. Arrange all observations in order of size, from smallest to largest.

    2. If the number of observations n is odd, the median M is the center observation in the ordered list.

    3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list.

  • The mode of a set of data (as opposed to the mode of a histogram) is the most common value among the data. It is rare that several data coincide exactly, unless the variable is discrete, or the measurements are reported with low precision.
  • The mean (more precisely, the arithmetic mean) is commonly called the average. It is the sum of the data, divided by the number of data:
        sum of data   total
    mean = ---------------------- = ------------------- .
        number of data   number of data

For qualitative and categorical data, the mode makes sense, but the mean and median do not  It is hard to see the connection between the mean, median, and mode from their definitions. However, the mean, the median, and the mode all are "as close as possible" to all the data: For each of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how distance is defined. The three notions of distance are as follows:

The mean, the median, and the mode are "as close as possible" to all the data: for each of these three measures of location, the sum of the distances between each datum and the measure of location is as small as it can be. The differences among the three measures of location are in how "distance" is defined.

  • For the mean, the distance between two numbers is defined to be the square of their difference. That is, the sum of the squares of the differences between the data and the mean is smaller than the sum of squares of the differences between the data and any other number.
  • For the median, the distance between two numbers is defined to be the absolute value of their difference. That is, the sum of the absolute values of the differences between a median and the data is no larger than the sum of the absolute values of the differences between any other number and the data.
  • For the mode, the distance between two numbers is defined to be zero if the numbers are equal, and one if they are not equal. That is, the number of data that differ from a mode is no larger than the number of data that differ from any other value. Equivalently, a mode is a number from which the fewest possible data differ: a "most common" value.

All are three of these measures of location are examples of statistics: numbers computed from data.

The mean, median, and mode can be related to the histogram: loosely speaking, the mode is the highest bump, the median is where half the area is to the right and half is to the left, and the mean is where the histogram would balance, were it a solid object cut out of a uniform block of metal. (All these measures of center are approximate, and depend on the class intervals.)

Random data to illustrate calculating measures of location.

data 3 5 0 1 -2 -3

The table below shows these random data sorted into increasing order, which makes it easier to calculate the median.

Sorted random data to illustrate calculating measures of location.

sorted data -3 -2 0 1 3 5

Half the data are less than or equal to every number between 0 (inclusive) and 1 (exclusive). By our definition, the median is the average between the two middle numbers, 0.5. For these hypothetical data, every value in the list is a mode: each value occurs exactly once, so all are "most common."

Computing the mean is familiar:

3 + 5 + 0 + 1 + (-2) + (-3)    
------------------------------- = 0.667
6    

In general, the mean and the median need not be close together. For exactly symmetric distributions, they are exactly equal, but for skewed distributions, they can be arbitrarily far apart. This is because data far from the "center" of the distribution have a lot of leverage on the mean, just as a light person can balance a much heavier one on a teeter-totter if she sits much farther from the fulcrum than the heavier person does. The median is smaller than the mean if the data are skewed to the right, and larger than the mean if the data are skewed to the left. Because the mean is (essentially) the balance point of the histogram, a small number of data can affect it a great deal, if they are very large (positive or negative). Corrupting just one datum can make the mean arbitrarily large or small.

The median is affected much less by small subsets of the data. To make the median arbitrarily large or small, one must corrupt half the data. Corrupting just one datum changes the median by a limited amount, and not at all if one of the observations above the median is made larger, or one of the observations below the median is made smaller. Statistics that are not affected too much by small subsets of the data are resistant. The median is resistant; the mean is not.

Which measure of location is the most appropriate depends on the intended use of the summary. If we are interested in a total, the mean tends to be the most meaningful, because the mean is the total divided by the number of data. For example, the mean income of the individuals in a family indicates how much the family can spend on each family member's necessities of life. On the other hand, the median can be much more informative in other situations.

Suppose we want to know how much money a family can afford to spend on housing. That depends on the total family income, which is the mean income of the family members, times the number of family members. For a family of five, consisting of two parents who work and three children with no income, the mean income, times five, is the total amount of money the family makes each year. The median income of these five family members is zero, because more than half of them make nothing.

On the other hand, suppose we want to decide whether a country is affluent. At issue, in some sense, is whether most of the citizens have a high income. The mean family income could be quite high even if most families earn essentially nothing--if income is highly concentrated in a few very wealthy families. Then the median family income would be a more meaningful measure: At least half the families make no more than the median, and at least half make at least as much as the median.

Similarly, suppose you are applying for a job as an architect at several large firms, and you want to get an idea of how much money you should expect to be making in five years if you join a particular firm. Consider the salaries of architects in each firm five years after they are hired. Just one very high salary could make the mean salary high, so the mean might not reflect what is typical. On the other hand, half the architects make the median salary or less, and half make the median salary or more, so the median would give you a better idea of a typical salary.

Try Self Check 5

 

© 2004 Aventa Learning. All rights reserved.