AP Statistics
Sections: 1.Introduction  |  2. Data  |  3. Displaying Distributions   |  4. Inspecting Distributions  | 5. Time Plots  |  6. Measuring Center         | 7. Measuring Spread  | 8. Linear Transformations |  9. Comparing Distributions

  Displaying Distributions

Displaying categorical variables: bar graphs and pie charts

The values of a categorical variable are labels for the categories, such as "male" and "female." The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall in each category.

Example: THE MOST POPULAR SOFT DRINK

The following table displays the sales figures and market share (percent of total sales) achieved by several major soft drink companies in 1999. That year, a total of 9930 million cases of soft drink were sold.

Company Cases sold (millions) Market share (percent)
Coca-Cola Co.

4377.5

44.1

Pepsi-Cola Co.

3119.5

31.4

Dr. Pepper/7-Up (Cadbury)

1455.1

14.7

Cott Corp.

310.0

3.1

National Beverage

205.0

2.1

Royal Crown

115.4

1.2

Other

347.5

3.4

How to construct a bar graph:

Step 1: Label your axes and title your graph. Draw a set of axes. Label the horizontal axis "Company" and the vertical axis "Cases sold." Title your graph.

Step 2: Scale your axes. Use the counts in each category to help you scale your vertical axis. Write the category names at equally spaced intervals beneath the horizontal axis.

Step 3: Draw a vertical bar above each category name to a height that corresponds to the count in that category. For example, the height of the "Pepsi-Cola Co." bar should be at 3119.5 on the vertical scale. Leave a space between the bars in a bar graph. (See Figure 1.1a)

Figure 1.1a

How to construct a pie chart:

Use a computer! Any statistical software package and many spreadsheet programs will construct these plots for you. (See Figure 1.1b)

Figure 1.1b

The bar graph in Figure 1.1a quickly compares the soft drink sales of the companies. The heights of the bars show the counts in the seven categories. The pie chart in Figure 1.1b helps us see what part of the whole each group forms. For example, the Coca-Cola "slice" makes up 44.1% of the pie because the Coca-Cola Company sold 44.1% of all soft drinks in 1999.

Pie charts can be constructed by hand. Simply calculate the decimal equivalent for each slice and the multiply the decimal by 360 to obtain the degree measure for that slice. In the above example Royal Crown has a decimal equivalent of .012 (1.2%). Multiplying .012 by 360 yields 4.32 degrees. You would continue this process for all of the soft drinks listed. You may see that it would be difficult to draw the slices by hand so that is why using a computer is suggested.

Bar graphs and pie charts help an audience grasp the distribution quickly. To make a pie chart, you must include all the categories that make up a whole. Bar graphs are more flexible.

In 1998, the National Highway and Traffic Safety Administration (NHTSA) conducted a study on seat belt use. The table below shows the percentage of automobile drivers who were observed to be wearing their seat belts in each region of the United States. The graph shows the same information.

                                                                Region             Percent wearing seat belts

Northeast                        66.4

Midwest                          63.6

South                               78.9

West                                80.8

Figure 1.2

Drivers in the South and West seem to be more concerned about wearing seat belts than those in the Northeast and Midwest. It is not possible to display these data in a single pie chart, because the four percentages cannot be combined to yield a whole (their sum is well over 100%).

Displaying quantitative variables: dotplots and stemplots

Several types of graphs can be used to display quantitative data. One of the simplest to construct is a dotplot.

The number of goals scored by each team in the first round of the California Southern Section Division V high school soccer playoffs is shown in the following table.

5 0 1 0 7 2 1 0 4 0 3 0 2 0

3 1 5 0 3 0 1 0 1 0 2 0 3 1

How to construct a dotplot:

Step 1: Label your axis and title your graph. Draw a horizontal line and label it with the variable (in this case, number of goals). Title your graph.

Step 2: Scale the axis based on the values of the variable.

Step 3: Mark a dot above the number on the horizontal axis corresponding to each data value. Figure 1.3 displays the completed dotplot.

Figure 1.3

Investigating the dotplot it is clear zero was the most common number of goals scored. Dotplots give a quick view of the data and work especially well data that has a small range; this example only includes integral values from 0 to 7.

Stem and Leaf Plots

A stem and leaf plot (also known as a stem plot and stem and leaf graph) is a graphical method of displaying data. It is particularly useful when your data are not too numerous. In this section, we will explain how to construct and interpret this kind of graph.

As usual, an example will get us started. Consider Table 1. It shows the number of touchdown passes thrown by each of the 31 teams in the National Football League in the 2000 season.

Table 1. Number of touchdown passes.
37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6

A stem and leaf plot display of the data is shown in Figure 1.4. The left portion of Figure 1.4 contains the stems. They are the numbers 3, 2, 1, and 0, arranged as a column to the left of the bars. Think of these numbers as 10's digits. A stem of 3 (for example) can be used to represent the 10's digit in any of the numbers from 30 to 39. The numbers to the right of the bar are leaves, and they represent the 1's digits. Every leaf in the graph therefore stands for the result of adding the leaf to 10 times its stem.

Figure 1.4

To make this clear, let us examine Figure 1.4 more closely. In the bottom row, the four leaves to the right of stem 3 are 2, 3, 3, and 7. Combined with the stem, these leaves represent the numbers 32, 33, 33, and 37, which are the numbers of TD passes for the first four teams in Table 1. The next row up has a stem of 2 and 12 leaves. Together, they represent 12 data points, namely, two occurrences of 20 TD passes, three occurrences of 21 TD passes, three occurrences of 22 TD passes, one occurrence of 23 TD passes, two occurrences of 28 TD passes, and one occurrence of 29 TD passes. We leave it to you to figure out what the second row represents. The top row has a stem of 0 and two leaves. It stands for the last two entries in Table 1, namely 9 TD passes and 6 TD passes. (The latter two numbers may be thought of as 09 and 06.).

We can make our figure even more revealing by splitting each stem into two parts. Figure 1.5 shows how to do this. The bottom row is reserved for numbers from 35 to 39 and holds only the 37 TD passes made by the first team in Table 1. The next row up is reserved for the numbers from 30 to 34 and holds the 32, 33, and 33 TD passes made by the next three teams in the table. You can see for yourself what the other rows represent.

Figure 1.5

Figure 1.5 is more revealing than Figure 1.4 because the latter figure lumps too many values into a single row. Whether you should split stems in a display depends on the exact form of your data. If rows get too long with single stems, you might try splitting them into two or more parts.

There is a variation of stem and leaf plots that is useful for comparing distributions. The two distributions are placed back to back along a common column of stems. The result is a "back to back stem and leaf graph." Figure 1.6 shows such a graph. It compares the numbers of TD passes in the 1998 and 2000 seasons. The stems are in the middle, the leaves to the left are for the 1998 data, and the leaves to the right are for the 2000 data. For example, the second row shows that in 1998 there were teams with 11, 12, and 13 TD passes, and in 2000 there were two teams with 12 and three teams with 14 TD passes.

Figure 1.6

Figure 1.6 helps us see that the two seasons were similar, but that only in 1998 did any teams throw more than 40 TD passes.

There are two things about the football data that make them easy to graph with stems and leaves. First, the data are limited to whole numbers that can be represented with a one-digit stem and a one-digit leaf. Second, all the numbers are positive. If the data include numbers with three or more digits, or contain decimals, they can be rounded to two-digit accuracy. Negative values are also easily handled. Let us look at another example.

Table 2 shows data from a study on aggressive thinking. Each value is the mean difference over a series of trials between the time it took an experimental subject to name aggressive words (like "punch") under two conditions. In one condition the words were preceded by a non-weapon word such as "bug." In the second condition, the same words were preceded by a weapon word such as "gun" or "knife." The issue addressed by the experiment was whether a preceding weapon word would speed up (or prime) pronunciation of the aggressive word, compared to a non-weapon priming word. A positive difference implies greater priming of the aggressive word by the weapon word. Negative differences imply that the priming by the weapon word was less than for a neutral word.

Table 2

You see that the numbers range from 43.2 to -27.4. The first value indicates that one subject was 43.2 milliseconds faster pronouncing aggressive words when they were preceded by weapon words than when preceded by neutral words. The value -27.4 indicates that another subject was 27.4 milliseconds slower pronouncing aggressive words when they were preceded by weapon words.

The data are displayed with stems and leaves in Figure 1.7. Stem and leaf displays can portray more than two digits (or numbers greater than two digits). It is done by putting the right-most digit in the leaf and all other digits in the stem. In this example, it is safe to round because we will not loose any valuable information about the data. However, had the data contained many decimals that were very close in value, we may have lost valuable information by rounding.

Thus, the value 43.2 is rounded to 43 and represented with a stem of 4 and a leaf of 3. Similarly, 42.9 is rounded to 43. To represent negative numbers, we simply use negative stems. For example, the top row of the figure represents the number -27. The second row represents the numbers -10, -10, -15, etc. Once again, we have rounded the original values from Table 2.

figure 1.7

Observe that the figure contains a row headed by "0" and another headed by "-0". The stem of 0 is for numbers between 0 and 9 whereas the stem of -0 is for numbers between 0 and -9. For example, the fourth row of the table holds the numbers 1, 2, 4, 5, 5, 8, 9 and the third row holds 0, -6, -7, and -9. Values that are exactly 0 before rounding should be split as evenly as possible between the "0" and "-0" rows. In Table 2, none of the values are 0 before rounding. The "0" that appears in the "-0" row comes from the original value of -0.2 in the table.

Although stem and leaf displays are unwieldy for large datasets, they are often useful for datasets with up to 200 observations. Figure 1.8 portrays the distribution of populations of 185 US cities in 1998. To be included, a city had to have between 100,000 and 500,000 residents.

Figure 1.8. Stem and leaf display of populations of US cities with populations between 100,000 and 500,000.

Since the data contained values that were in the hundred thousands and rounding did not cause us to loose any valuable information about the data, we chose to round the numbers to the nearest 10,000 to make the stem plot easier to create. For example the largest number (493,559) was rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9. The fourth highest number (463,201) was rounded to 460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of 100,000 and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1, 2-3, 4-5, 6-7, and 8-9.

We gain flexibility in creating stem and leaf plots when our data can be rounded without loss of important information. When this is the case, we try to fit extreme values into two successive digits, as the data in Figure 1.8 fit into the 10,000 and 100,000 places (for leaves and stems, respectively).

Here are a few tips for you to consider when you want to construct a stemplot:

? There is no magic number of stems to use. Too few stems will result in a skyscraper shaped plot, while too many stems will yield a very flat "pancake" graph.

? Five stems is a good minimum.

? You can get more flexibility by rounding the data so that the final digit after rounding is suitable as a leaf. Do this when the data have too many digits.

The chief advantages of dotplots and stemplots are that they are easy to construct and they display the actual data values (unless we round). Neither will work well with large data sets. Most statistical software packages will make dotplots and stemplots for you. That will allow you to spend more time making sense of the data. Deciding what kind of graph is best suited to displaying your data thus requires good judgment. Statistics is not just recipes!

Try Self-Check 2

Displaying quantitative variables: histograms

Quantitative variables often take many values. A graph of the distribution is clearer if nearby values are grouped together. The most common graph of the distribution of one quantitative variable is a histogram.

Example: PRESIDENTIAL AGES AT INAUGURATION

How old are presidents at their inaugurations? Was Bill Clinton, at age 46, unusually young? Table 3 gives the data, the ages of all US presidents when they took office.

TABLE 3 Ages of the Presidents at inauguration

 Washington

57

Lincoln

52

Hoover

54

J. Adams

61

A. Johnson

56

F. D. Roosevelt

51

Jefferson

57

Grant

46

Truman

60

Madison

57

Hayes

54

Eisenhower

61

Monroe

58

Garfield

49

Kennedy

43

J. Q. Adams

57

Arthur

51

L.B. Johnson

55

Jackson

61

Cleveland

47

Nixon

56

Van Buren

54

B. Harrison

55

Ford

61

W. H. Harrison

68

Cleveland

55

Carter

52

Tyler

51

McKinley

54

Reagan

69

Polk

49

T. Roosevelt

42

G. Bush

64

Taylor

64

Taft

51

Clinton

46

Fillmore

50

Wilson

56

G.W. Bush

54

Pierce

48

Harding

55

 

 

 

How to make a histogram:

Step 1: Divide the range of the data into classes of equal width. Count the number of observations in each class. The data has a range from 42 to 69, so we choose as our classes

40 president's age at inauguration < 45

45 president's age at inauguration < 50

50 president's age at inauguration < 56

55 president's age at inauguration < 60

60 president's age at inauguration < 65

65 president's age at inauguration < 70

Be sure to specify the classes precisely so that each observation falls into exactly one class. This is done by making one of the inequalities in each of the statements above a "less than or equal to" inequality and the other a "strictly less than" inequality. Martin Van Buren, who was age 54 at the time of his inauguration, would fall into the third class interval. Grover Cleveland, who was age 55, would be placed in the fourth class interval.

Now that we have determined how we will group the data into classes, we can create a frequency table of the data. Frequency tables display the number of times a class or group of numbers occurs in the data. Here is the frequency table showing the counts for each class interval:

 Class              Count

40-44                     2

45-49                     6

50-54                   13

55-59                   12

60-64                     7

65-69                     3

Step 2: Label and scale your axes and title your graph. Label the horizontal axis "Ages" and the vertical axis "Frequency."  For the classes we chose, we should scale the horizontal axis from 40 to 70, with tick marks 5 units apart. The vertical axis contains the scale of counts and should range from 0 to at least 15.

Step 3: Draw a bar that represents the count in each class. The base of a bar should cover its class, be centered over the class mark (or midpoint of the interval), and the height of the bar is the class count. Leave no horizontal space between the bars (unless a class is empty, so that its bar has height 0). The figure below shows the completed histogram.

Histogram tips:

? There is no one right choice of the classes in a histogram. Too few classes will give a "skyscraper" graph, with all values in a few classes with tall bars. Too many will produce a "pancake" graph, with most classes having one or no observations. Neither choice will give a good picture of the shape of the distribution.

? Five classes is a good minimum.

? Our eyes respond to the area of the bars in a histogram, so be sure to choose classes that are all the same width. Then area is determined by height and all classes are fairly represented.

? If you use a computer or graphing calculator, beware of letting the device choose the classes.

Try Self-Check 3

Summary

We have just looked at 5 ways to display data. By no means is this exhaustive as there are other ways to graphically display data.

For categorical data we looked at a bar graphs (also known as bar charts) and pie charts. These charts are used when the data  are declared categorical (color, shape, gender, names, etc.).

Next we looked at stem plots and dot plots. These type of graphs are used to graph small data sets. There is no true definition of small data sets. When plotting a dot plot by hand it will become increasing harder to keep an even spacing above each value. Either move to a computer program or consider changing to a histogram. The same could also be said for a stem plot. The Pittsburgh Steelers stem plot example may be a little large to do by hand but it is doable. Every point in the data set can be obtained from a dot plot or a stem plot. On the other hand histograms are graphs where quantitative data is grouped into classes. When the data is grouped into classes the individuality of data points is lost. This is not necessarily a bad feature as we move on to study the meaning of the shapes of graphs.

You are now ready for Statistics Assignment 2: Working with Graphs

Review the content and take Multiple Choice 1.