AP Statistics
Sections:  1.| Transforming Relationships 2.| Cautions about Correlation and Regression  3| Relations in Categorical Data

Relations in Categorical Data

To this point we have concentrated on relationships in which at least the response variable was quantitative. Now we will shift to describing relationships between two or more categorical variables. Some variables—such as sex, race, and occupation—are inherently categorical. Other categorical variables are created by grouping values of a quantitative variable into classes. Published data are often reported in grouped form to save space. To analyze categorical data, we use the counts or percents of individuals that fall into various categories

Example:

EDUCATION AND AGE

The table below presents Census Bureau data on the years of school completed by Americans of different ages. Many people under 25 years of age have not completed their education, so they are left out of the table. Both variables, age and education, are grouped into categories. This is a two-way table because it describes two categorical variables. Education is the row variable because each row in the table describes people with one level of education. Age is the column variable because each column describes one age group. The entries in the table are the counts of persons in each age-by-education class. Although both age and education in this table are categorical variables, both have a natural order from least to most. The order of the rows and the columns in the table reflects the order of the categories.

TABLE Years of school completed, by age, 2000 (thousands of persons)

Education 25-34 35-54 55+ Total
Did not complete high school 4,474 9,155 14,224 27,853
Completed high school 11,546 26,481 20,060 58,087
1 to 3 years of college 10,700 22,618 11,127 44,445
4 or more years of college 11,066 23,183 10,596 44,845
Total 37,786 81,435 56,008 175,230

Marginal distributions

How can we best grasp the information contained in the table above. First, look at the distribution of each variable separately. The distribution of a categorical variable just says how often each outcome occurred. The “Total” column at the right of the table contains the totals for each of the rows. These row totals give the distribution of education level (the row variable) among all people over 25 years of age: 27,853,000 did not complete high school, 58,087,000 finished high school but did not attend college, and so on. In the same way, the “Total” row on the bottom gives the age distribution. If the row and column totals are missing, the first thing to do in studying a two-way table is to calculate them. The distributions of education alone and age alone are often called marginal distributions because they appear at the right and bottom margins of the two-way table.

If you check the column totals in the table, you will notice a few discrepancies. For example, the sum of the entries in the “35 to 54” column is 81,437. The entry in the “Total” row for that column is 81,435. The explanation is roundoff error. The table entries are in the thousands of persons, and each is rounded to the nearest thousand. The Census Bureau obtained the “Total” entry by rounding the exact number of people aged 35 to 54 to the nearest thousand. The result was 81,435,000. Adding the column entries, each of which is already rounded, gives a slightly different result.

Percents are often more informative than counts. We can display the marginal distribution of education level in terms of percents by dividing each row total by the table total and converting to a percent.

Years of school completed, by age, 2000 (thousands of persons), marginal distribution

Education Number Percent
Did not complete high school 27,853 15.9%
Completed high school 58,087 33.1%
1 to 3 years of college 44,445 25.4%
4 or more years of college 44,845 25.6%
Total 175,130 100.0%

Each marginal distribution from a two-way table is a distribution for a single categorical variable. In working with two-way tables, you must calculate lots of percents. Here’s a tip to help decide what fraction gives the percent you want. Ask, “What group represents the total that I want a percent of?” The count for that group is the denominator of the fraction that leads to the percent.

Example:

What percent of the group has 1 to 3 years of college?

The table shows 44,445,000 people have 1 to 3 years of college. The group contains 175,130,000 people. Therefore 44,445,000 is the numerator and 175,130,000 is the denominator.

Describing relationships

The marginal distributions of age and of education separately do not tell us how the two variables are related. That information is in the body of the table. How can we describe the relationship between age and years of school completed? No single graph (such as a scatterplot) portrays the form of the relationship between categorical variables, and no single numerical measure (such as the correlation) summarizes the strength of an association. To describe relationships among categorical variables, calculate appropriate percents from the counts given. We use percents because counts are often hard to compare. For example, 11,066,000 people age 25 to 34 have completed college, and only 10,596,000 people in the 55 and over age group have done so. But the older age group is larger, so we can’t directly compare these counts.

Example:

HOW COMMON IS COLLEGE EDUCATION?

What percent of people aged 25 to 34 have completed 4 years of college? This is the count who are 25 to 34 and have 4 years of college as a percent of the age group total:

“People aged 25 to 34” is the group we want a percent of, so the count for that group is the denominator. In the same way, the percent of people in the 55 and over age group who completed college is

Here are the results for all three age groups:

Age group: 25 to 34 35 to 54 55+
Percent with 4 years of college 29.3 28.5 18.9

These percents help us see how the education of Americans varies with age. Older people are less likely to have completed college.

Conditional distributions

The last example does not compare the complete distributions of years of schooling in the three age groups. It compares only the percents who finished college. Let’s look at the complete picture.

CONDITIONAL DISTRIBUTIONS

Information about the 25 to 34 age group occupies the first column in the original table. To find the complete distribution of education in this age group, look only at that column. Compute each count as a percent of the column total: 37,786. Here is the distribution:

Education Did not finish high school Completed high school 1-3 years college 4 or more years of college Total
Percent 11.8 30.6 28.3 29.3 100.0

These percents add to 100% because all 25- to 34-year-olds fall in one of the educational categories. The four percents together are the conditional distribution of education, given that a person is 25 to 34 years of age. We use the term “conditional” because the distribution refers only to people who satisfy the condition that they are 25 to 34 years old.

For comparison, here is the conditional distribution of years of school completed among people age 55 and over. To find these percents, look only at the “55+” column in the table. The column total is the denominator for each percent calculation.

Education Did not finish high school Completed high school 1-3 years college 4 or more years of college Total
Percent 25.4 35.8 19.9 18.9 100

The percent who did not finish high school is much higher in the older age group, and the percents with some college and who finished college are much lower. Comparing the conditional distributions of education in different age groups describes the association between age and education. There are three different conditional distributions of education given age, one for each of the three age groups. All of these conditional distributions differ from the marginal distribution of education found in an earlier example.

Statistical software can speed the task of finding each entry in a two-way table as a percent of its column total. The figure below displays the result. The software found the row and column totals from the table entries, so they may differ slightly from those in the original table.

Education 25 to 34 35 to 54 55 + Total
Did not complete high school 4474 9155 14224 27853
  11.84% 11.24% 25.40%  
Completed high school 11546 26481 20060 58087
  30.56% 32.52% 35.82%  
1 to 3 years of college 10700 22618 11127 85940
  28.32% 27.77% 19.87%  
4 or more years of college 11066 23183 10596 44845
  29.29% 28.47% 18.92%  
Total 37786 81437 56007 175230

Each cell in this table contains a count from the original table along with that count as a percent of the column total. The percents in each column form the conditional distribution of years of schooling for one age group.

The percents in each column add to 100% because everyone in the age group is accounted for. Comparing the conditional distributions reveals the nature of the association between age and education. The distributions of education in the two younger groups are quite similar, but higher education is less common in the 55 and over group.

No single graph (such as a scatterplot) portrays the form of the relationship between categorical variables. No single numerical measure (such as the correlation) summarizes the strength of the association. For numerical measures, we rely on well-chosen percents. You must decide which percents you need.

We compared the education of different age groups. That is, we thought of age as the explanatory variable and education as the response variable. We might also be interested in the distribution of age among persons having a certain level of education. To do this, look only at one row in the table. Calculate each entry in that row as a percent of the row total, the total of that education group. The result is another conditional distribution, the conditional distribution of age given a certain level of education. A two-way table contains a great deal of information in compact form. Making that information clear almost always requires finding percents. You must decide which percents you need. If you are studying trends in the training of the American workforce, comparing the distributions of education for different age groups reveals the more extensive education of younger people. If, on the other hand, you are planning a program to improve the skills of people who did not finish high school, the age distribution within this educational group is important information.

Example:

Number of deaths in Australia (1989)

The table below shows the numbers of deaths in Australia in 1995 for people aged 15-24 years (Source: Australian Bureau of Statistics, 3303.0, pp.33-35):

Cause of death Males Females Total
Motor vehicle accident 448 146 594
Suicide 350 84 434
Other accident 257 74 331
Malignant cancer 86 50 136
Other diseases 267 153 420

Total

1,408 507 1,915

Each person who died was categorized by sex (M or F) and by cause of death. A cross-classified table is sometimes called a contingency table.

Do males and females in this age group die from the same causes?

To compare patterns of cause of death you need to consider relative frequencies or percentages because the total numbers of deaths are not the same for males and females.

Marginal frequency distribution for Deaths By Gender
  Males Females Totals
Numbers 1408 507 1915
Percent 73.5 26.5 100

Conclusion - In this age group there are about 3 times more male deaths than female deaths.

Marginal frequency distribution for Cause of Death
Cause Number Percent
Motor vehicle accident 594 31.0
Suicide 434 22.7
Other accidents 331 17.3
Malignant neoplasms 136 7.1
Other diseases 420 21.9
Total 1915 100

This table is obtained by collapsing the original table over the factor 'sex'. Conclusion - The main causes of death in this age group are motor vehicle accidents, which cause 31% of all deaths, and suicides, which account for 22.7%.

Conditional frequency distribution
  Males Females
Cause No. Percent No. Percent
Motor vehicle accidents 448 31.8 146 28.8
Suicide 350 24.8 84 16.5
Other accidents 257 18.3 74 14.6
Cancer 86 6.1 50 9.9
Other diseases 267 19.0 153 30.2

Total

1408 100 507 100

Conclusion - Motor vehicle accidents were the major cause of death for both males and females in the age group 15-24 years, accounting for about 30% of deaths. Suicides were more common for males than for females.

Simpson's Paradox

Simpson’s Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group.

Here is an example of Simpson’s Paradox, from the Ask Marilyn column of Parade Magazine, 28 April 1996, p6.(Ask Marilyn ® by Marilyn vos Savant is a column in Parade Magazine, published by PARADE, 711 Third Avenue, New York, NY 10017, USA. According to Parade, Marilyn vos Savant is listed in the "Guinness Book of World Records Hall of Fame" for "Highest IQ." )

A reader poses the following question:

A company decided to expand, so it opened a factory generating 455 jobs. For the 70 white collar positions, 200 males and 200 females applied. Of the females who applied, 20% were hired, while only 15% of the males were hired. Of the 400 males applying for the blue collar positions, 75% were hired, while 85% of the 100 applying females were hired.

A federal Equal Employment enforcement official noted that many more males were hired than females, and decided to investigate. Responding to charges of irregularities in hiring, the company president denied any discrimination, pointing out that in both the white collar and blue collar fields, the percentage of female applicants hired was greater than it was for males.

But the government official produced his own statistics, which showed that a female applying for a job had a 58% chance of being denied employment while male applicants had only a 45% denial rate. As the current law is written, this constituted a violation....Can you explain how two opposing statistical outcomes are reached from the same raw data?

What we have, of course, is an example of Simpson's paradox: The direction of association between gender and hiring rate appears to reverse when the data are aggregated across job classes. Marilyn correctly notes that, even though all the figures presented are correct, the two outcomes are not opposing.

She also presents her own analogy.

Say a company tests two treatments for an illness. In trial No. 1, treatment A cures 20% of its cases (40 out of 200) and treatment B cures 15% of its cases (30 out of 200). In trial No. 2, treatment A cures 85% of its cases (85 out of 100) and treatment B cures 75% of its cases (300 out of 400)....

So, in two trials, treatment A scored 20% and 85%. Also in two trials, treatment B scored only 15% and 75%. No matter how many people were in those trials, treatment A (at 20% and 85%) is surely better than treatment B (at 15% and 75%), right?

Wrong! Treatment B performed better. It cured 330 (300+30) out of the 600 cases.

(200+400) in which it was tried--a success rate of 55%...By contrast, treatment A cured 125 (40+85) out of the 300 cases (200+100) in which it was tried, a success rate of only about 42%.

She notes that this is exactly what happened to the employer. Because so many more men applied for the blue collar positions, even if the employer hired all the women who had applied for blue collar positions, it couldn't satisfy the government regulations.

Simpson's Paradox in Real Life

Simpson’s Paradox actually arises in real-life situations. The National Science Foundation in the US conducted a study of persons who received a degree in science or engineering in 1977 or 1978. The study found that at the bachelor’s degree level the average woman with a full-time job earned an average of 77% of the average male salary. But comparing salaries within each field, the average salary for women was in each case at least 92% of the average male salary. The explanation here is what is called a lurking variable - women were concentrated in the life sciences and social sciences which had lower salaries in general.

The lurking variables in Simpson’s paradox are categorical. That is, they break the individuals into groups, as when applicants are classified as by gender, male or female.  Simpson’s paradox is just an extreme form of the fact that observed associations can be misleading when there are lurking variables.

We have come to the end of Exploratory Data Analysis. The tools and concepts presented to this point provided a foundation for the remainder of the course.

Try Self-Check 21

Proceed to AP Statistics Assignment 9: Two - Way Tables

Review the content of Unit 4 and proceed to the Unit Exam.