|
Sections: 1.| Transforming Relationships 2.| Cautions about Correlation and Regression 3| Relations in Categorical Data |
![]() |
Relations in Categorical Data To this point we have concentrated on relationships in which at least the response variable was quantitative. Now we will shift to describing relationships between two or more categorical variables. Some variables—such as sex, race, and occupation—are inherently categorical. Other categorical variables are created by grouping values of a quantitative variable into classes. Published data are often reported in grouped form to save space. To analyze categorical data, we use the counts or percents of individuals that fall into various categories Example: EDUCATION AND AGE
TABLE Years of school completed, by age, 2000 (thousands of persons)
Marginal distributions How can we best grasp the information contained in the table above. First, look at the distribution of each variable separately. The distribution of a categorical variable just says how often each outcome occurred. The “Total” column at the right of the table contains the totals for each of the rows. These row totals give the distribution of education level (the row variable) among all people over 25 years of age: 27,853,000 did not complete high school, 58,087,000 finished high school but did not attend college, and so on. In the same way, the “Total” row on the bottom gives the age distribution. If the row and column totals are missing, the first thing to do in studying a two-way table is to calculate them. The distributions of education alone and age alone are often called marginal distributions because they appear at the right and bottom margins of the two-way table. If you check the column totals in the table, you will notice a few discrepancies. For example, the sum of the entries in the “35 to 54” column is 81,437. The entry in the “Total” row for that column is 81,435. The explanation is roundoff error. The table entries are in the thousands of persons, and each is rounded to the nearest thousand. The Census Bureau obtained the “Total” entry by rounding the exact number of people aged 35 to 54 to the nearest thousand. The result was 81,435,000. Adding the column entries, each of which is already rounded, gives a slightly different result. Percents are often more informative than counts. We can display the marginal distribution of education level in terms of percents by dividing each row total by the table total and converting to a percent. Years of school completed, by age, 2000 (thousands of persons), marginal distribution
Each marginal distribution from a two-way table is a distribution for a single categorical variable. In working with two-way tables, you must calculate lots of percents. Here’s a tip to help decide what fraction gives the percent you want. Ask, “What group represents the total that I want a percent of?” The count for that group is the denominator of the fraction that leads to the percent. Example:
Describing relationships The marginal distributions of age and of education separately do not tell us how the two variables are related. That information is in the body of the table. How can we describe the relationship between age and years of school completed? No single graph (such as a scatterplot) portrays the form of the relationship between categorical variables, and no single numerical measure (such as the correlation) summarizes the strength of an association. To describe relationships among categorical variables, calculate appropriate percents from the counts given. We use percents because counts are often hard to compare. For example, 11,066,000 people age 25 to 34 have completed college, and only 10,596,000 people in the 55 and over age group have done so. But the older age group is larger, so we can’t directly compare these counts. Example: HOW COMMON IS COLLEGE EDUCATION?
Conditional distributions The last example does not compare the complete distributions of years of schooling in the three age groups. It compares only the percents who finished college. Let’s look at the complete picture. CONDITIONAL DISTRIBUTIONS
Statistical software can speed the task of finding each entry in a two-way table as a percent of its column total. The figure below displays the result. The software found the row and column totals from the table entries, so they may differ slightly from those in the original table.
Each cell in this table contains a count from the original table along with that count as a percent of the column total. The percents in each column form the conditional distribution of years of schooling for one age group. The percents in each column add to 100% because everyone in the age group is accounted for. Comparing the conditional distributions reveals the nature of the association between age and education. The distributions of education in the two younger groups are quite similar, but higher education is less common in the 55 and over group. No single graph (such as a scatterplot) portrays the form of the relationship between categorical variables. No single numerical measure (such as the correlation) summarizes the strength of the association. For numerical measures, we rely on well-chosen percents. You must decide which percents you need. We compared the education of different age groups. That is, we thought of age as the explanatory variable and education as the response variable. We might also be interested in the distribution of age among persons having a certain level of education. To do this, look only at one row in the table. Calculate each entry in that row as a percent of the row total, the total of that education group. The result is another conditional distribution, the conditional distribution of age given a certain level of education. A two-way table contains a great deal of information in compact form. Making that information clear almost always requires finding percents. You must decide which percents you need. If you are studying trends in the training of the American workforce, comparing the distributions of education for different age groups reveals the more extensive education of younger people. If, on the other hand, you are planning a program to improve the skills of people who did not finish high school, the age distribution within this educational group is important information. Example:Number of deaths in Australia (1989)
Simpson's Paradox Simpson’s Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group.
A reader poses the following question:
What we have, of course, is an example of Simpson's paradox: The direction of association between gender and hiring rate appears to reverse when the data are aggregated across job classes. Marilyn correctly notes that, even though all the figures presented are correct, the two outcomes are not opposing. She also presents her own analogy.
She notes that this is exactly what happened to the employer. Because so many more men applied for the blue collar positions, even if the employer hired all the women who had applied for blue collar positions, it couldn't satisfy the government regulations. Simpson's Paradox in Real Life Simpson’s Paradox actually arises in real-life situations. The National Science Foundation in the US conducted a study of persons who received a degree in science or engineering in 1977 or 1978. The study found that at the bachelor’s degree level the average woman with a full-time job earned an average of 77% of the average male salary. But comparing salaries within each field, the average salary for women was in each case at least 92% of the average male salary. The explanation here is what is called a lurking variable - women were concentrated in the life sciences and social sciences which had lower salaries in general. The lurking variables in Simpson’s paradox are categorical. That is, they break the individuals into groups, as when applicants are classified as by gender, male or female. Simpson’s paradox is just an extreme form of the fact that observed associations can be misleading when there are lurking variables. We have come to the end of Exploratory Data Analysis. The tools and concepts presented to this point provided a foundation for the remainder of the course. Try Self-Check 21 Proceed to AP Statistics Assignment 9: Two - Way Tables Review the content of Unit 4 and proceed to the Unit Exam. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||