AP Statistics
Sections:  1.| Transforming Relationships 2.| Cautions about Correlation and Regression  3| Relations in Categorical Data

Cautions about Correlation and Regression

Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, you must be aware of their limitations, beginning with the fact that correlation and regression describe only linear relationships. Also remember that the correlation r and the least-squares regression line are not resistant. One influential observation or incorrectly entered data point can greatly change these measures. Always plot your data before interpreting regression or correlation. Here are some other cautions to keep in mind when you apply correlation and regression or read accounts of their use.

Interpolation and Extrapolation

We will consider interpolation and extrapolation as applied to bivariate data; that is, data involving two variables. When two variables are measured that describe the same object or individual, one of them is usually identified as the explanatory variable and the other as the response variable. If the data was produced in an experiment, the explanatory variable is the factor that was manipulated by the experimenter, while the values of the response variable are paired with (and possibly caused by) the parameter described by the explanatory variable. If no variables are manipulated, the explanatory and response variables will be designated in such a way as to preserve the proper causal relationship between them if possible, though it may sometimes be a mistake to assume that one is caused by the other.

Example:

Suppose that a consumer group purchased various kinds of chocolate candies.  Each chocolate was rated on a scale of 1 to 10 according to the taste.  The cost per ounce for each different type was also recorded.  The following table was constructed using the rating and the cost per ounce.

Brand Cost Rating
A 10 6
B 8 6
C 13 8
D 10 7
E 9 5
F 14 9
G 12 8
H 6 5

In this example we will designate cost as the explanatory variable, since it seems reasonable to suppose that expensive chocolate will taste better, though one can see the difficulties involved when making assumptions about causality. For instance, might it not be more accurate to say that the cost and the taste are both caused by the quality of the ingredients? In another famous example, one can consider the relationship between shoe size and reading ability for children in grades K-4. There is certainly a relationship between the two variables, with better readers generally having larger shoe sizes. But you might not want to conclude that large feet cause reading aptitude.

A scatterplot of the data may assist in determining some relationship among the data, if any exists.  If the data appears to have some relationship, a mathematical model can be used to make predictions about cost and rating. Note that in a scatterplot, the explanatory variable is generally graphed on the x-axis. Here are the scatterplots:

  Scatterplot of the data                 Linear model for the data

The tick marks on each set of axes are two units apart.  The x-axis represents the cost per ounce and the y-axis represents the rating. The points in the scatterplot appear to follow a fairly linear trend.  The relationship among the data can therefore be modeled by a linear function.  Any line which appears to fit the data could be used as a model, but in this case the linear model is the statistically-defined line of best fit.  This "best-fit" line was generated by software that uses a statistical analysis technique, though it is visually apparent that the line follows the trend of the data fairly well. The equation of the line is

y = 0.5x + 1.5, or

rating = 0.5 (cost) + 1.5.

A general observation about the graph would be that the greater the cost of the chocolate, the better the rating. Since our best-fit line is a reasonably suitable model for the data, we can use it to make predictions about the explanatory and response variables. When we make predictions using our model, we don't care about thorny causality issues. We are merely following the trend that is present in the data, whatever the actual cause.

What rating would a 11˘ chocolate have?  From the equation of the model,  y = 0.5(11) + 1.5 = 7.  Thus, we would predict that a chocolate costing 11˘ per ounce would have a rating of 7.  This prediction is called an interpolation

Interpolation

Any prediction made between known values of data is called interpolation.

What would be the cost of a chocolate that received a rating of 3? Substituting a value of 3 into the rating variable in our equation we get 3 = 0.5x + 1.5, so 0.5 x = 1.5 and x = 3 cents. This seems to be a reasonable prediction.  However, it could be a questionable conclusion.  We have extended the prediction beyond the domain of the data.  In this case 3 is not between the lower and upper bound of the cost, for which we have only collected data between 6˘ and 14˘.

What is the rating for a chocolate costing 18˘ per ounce? We get y = 0.5 (18) + 1.5, so y = 10.5. We now estimate a rating of  10.5, which is beyond the rating scale.  In fact, any cost over 17˘ per ounce yields a value that is not on the rating scale.  A prediction made beyond the known values of data is called extrapolation.

Extrapolation:

Any prediction made beyond known values of data is called extrapolation.

Basing predictions upon extrapolation is often more risky than interpolation, because it depends on the assumption that a pattern will continue past the range of the known data. As shown above, our model cannot possibly hold for values of x greater than 17. One must use caution when making predictions through extrapolation.

In this example, we considered data that followed a linear trend. Some phenomena are modeled more effectively with more complicated functions. There are statistical analysis techniques that can be used to generate a multitude of modeling functions, including quadratic, polynomial, exponential and logarithmic functions.

Lurking Variables

A variable that has a very important effect on the relationship of the variables under study, but is not one of the variables being studied is called a lurking variable.

One could safely conclude that there is a negative linear association between car engine size (measured in cubic inches or Liters) and miles per gallon (mpg). The smaller the engine, the higher the mpg and vice versa. Manufacturers place stickers on new car windows indicating the estimated city and road mpg. You and your neighbor buy the same exact model and after several months you compare mpg. You tell your neighbor that you are averaging about 25 miles per gallon. He is surprised by the number, he averages 19 mpg. What could be going on here? You do notice that your neighbor driving habits are different than yours. He is constantly "racing" the motor, drives fast, and "stomps" the pedal to accelerate. Your driving habits are more conservative.

Your neighbor thinks that the correlation between engine size and mpg is not true. If he is only getting 19 mpg then he should have a bigger engine. This is an example of a lurking variable affecting  fuel economy (mpg). A variable not usually considered is the way the operator drives the car. The driver and his style of driving loiters in the background and varies the fuel economy.

The Question of Causation

In many studies of the relationship between two variables, the goal is to establish that changes in the explanatory variable cause changes in the response variable. Even when a strong association is present, the conclusion that this association is due to a causal link between the variables is often elusive. What ties between two variables (and others lurking in the background) can explain an observed association? What constitutes good evidence for causation? We begin our consideration of these questions with a set of examples. In each case, there is a clear association between an explanatory variable x and a response variable y. Moreover, the association is positive whenever the direction makes sense.

The following are some examples of observed associations between x and y:

1. x = mother’s body mass index

    y = daughter’s body mass index

2. x = amount of the artificial sweetener saccharin in a rat’s diet

    y = count of tumors in the rat’s bladder

3. x = a high school senior’s SAT score

    y = the student’s first-year college grade point average

4. x = monthly flow of money into stock mutual funds

    y = monthly rate of return for the stock market

5. x = whether a person regularly attends religious services

    y = how long the person lives

6. x = the number of years of education a worker has

    y = the worker’s income

Items 1 and 2 above are examples of direct causation. Thinking about these examples, however, shows that “causation” is not a simple idea.

1. A study of Mexican American girls aged 9 to 12 years recorded body mass index (BMI), a measure of weight relative to height, for both the girls and their mothers. People with high BMI are overweight or obese. The study also measured hours of television, minutes of physical activity, and intake of several kinds of food. The strongest correlation (r = 0.506) was between the BMI of daughters and the BMI of their mothers. Body type is in part determined by heredity. Daughters inherit half their genes from their mothers. There is therefore a direct causal link between the BMI of mothers and daughters. Yet the mothers’ BMIs explain only 25.6% (that’s r2 ) of the variation among the daughters’ BMIs. Other factors, such as diet and exercise, also influence BMI. Even when direct causation is present, it is rarely a complete explanation of an association between two variables.

2. The best evidence for causation comes from experiments that actually change x while holding all other factors fixed. If y changes, we have good reason to think that x caused the change in y. Experiments show conclusively that large amounts of saccharin in the diet cause bladder tumors in rats. Should we avoid saccharin as a replacement for sugar in food? Rats are not people. Although we can’t experiment with people, studies of people who consume different amounts of saccharin show little association between saccharin and bladder tumors. Even well-established causal relations may not generalize to other settings.

Explaining Association: Common Response

“Beware the lurking variable” is good advice when thinking about an association between two variables. The diagram below illustrates common response. The observed association between the variables x and y is explained by a lurking variable z. Both x and y change in response to changes in z. This common response creates an association even though there may be no direct causal link between x and y.

The third and fourth items in the example above illustrate how common response can create an association.

3. Students who are smart and who have learned a lot tend to have both high SAT scores and high college grades. The positive correlation is explained by this common response to students’ ability and knowledge.

4. There is a strong positive correlation between how much money individuals add to mutual funds each month and how well the stock market does the same month. Is the new money driving the market up? The correlation may be explained in part by common response to underlying investor sentiment: when optimism reigns, individuals send money to funds and large institutions also invest more. The institutions would drive up prices even if individuals did nothing. In addition, what causation there is may operate in the other direction: when the market is doing well, individuals rush to add money to their mutual funds.

Explaining Association: Confounding

As noted in the example with Mexican American girls inheritance no doubt explains part of the association between the body mass indexes (BMIs) of daughters and their mothers. Can we use r or r2 to say how much inheritance contributes to the daughters’ BMIs? No. It may well be that mothers who are overweight also set an example of little exercise, poor eating habits, and lots of television. Their daughters pick up these habits to some extent, so the influence of heredity is mixed up with influences from the girls’ environment. We call this mixing of influences confounding.

CONFOUNDING

Two variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory variables or lurking variables.

When many variables interact with each other, confounding of several variables often prevents us from drawing conclusions about causation. The diagram below illustrates confounding. Both the explanatory variable x and the lurking variable z may influence the response variable y. Because x is confounded with z, we cannot distinguish the influence of x from the influence of z. We cannot say how strong the direct effect of x on y is. In fact, it can be hard to say if x influences y at all.

The last two associations in the example above (Items 5 and 6) are explained in part by confounding.

5. Many studies have found that people who are active in their religion live longer than nonreligious people. But people who attend church or mosque or synagogue also take better care of themselves than non-attendees. They are less likely to smoke, more likely to exercise, and less likely to be overweight. The effects of these good habits are confounded with the direct effects of attending religious services.

6. It is likely that more education is a cause of higher income—many highly paid professions require advanced education. However, confounding is also present. People who have high ability and come from prosperous homes are more likely to get many years of education than people who are less able or poorer. Of course, people who start out able and rich are more likely to have high earnings even without much education. We can’t say how much of the higher income of well-educated people is actually caused by their education.

Many observed associations are at least partly explained by lurking variables. Both common response and confounding involve the influence of a lurking variable (or variables) z on the response variable y. The distinction between these two types of relationships is less important than the common element, the influence of lurking variables. The most important lesson of these examples is one we have already emphasized: even a very strong association between two variables is not by itself good evidence that there is a cause-and-effect link between the variables.

Establishing Causation

How can a direct causal link between x and y be established? The best method—indeed, the only fully compelling method—of establishing causation is to conduct a carefully designed experiment in which the effects of possible lurking variables are controlled. Much of Unit 5 is devoted to the art of designing convincing experiments. Many of the sharpest disputes in which statistics plays a role involve questions of causation that cannot be settled by experiment. Does gun control reduce violent crime? Does living near power lines cause cancer? Has increased free trade helped to increase the gap between the incomes of more educated and less educated American workers? All of these questions have become public issues. All concern associations among variables. And all have this in common: they try to pinpoint cause and effect in a setting involving complex relations among many interacting variables. Common response and confounding, along with the number of potential lurking variables, make observed associations misleading. Experiments are not possible for ethical or practical reasons. We can’t assign some people to live near power lines or compare the same nation with and without free-trade agreements.

Example:

DO POWER LINES INCREASE THE RISK OF LEUKEMIA?

Electric currents generate magnetic fields. So living with electricity exposes people to magnetic fields. Living near power lines increases exposure to these fields. Really strong fields can disturb living cells in laboratory studies. What about the weaker fields we experience if we live near power lines? It isn’t ethical to do experiments that expose children to magnetic fields. It’s hard to compare cancer rates among children who happen to live in more and less exposed locations, because leukemia is rare and locations vary in many ways other than magnetic fields. We must rely on studies that compare children who have leukemia with children who don’t.

A careful study of the effect of magnetic fields on children took five years and cost $5 million. The researchers compared 638 children who had leukemia and 620 who did not. They went into the homes and actually measured the magnetic fields in the children’s bedrooms, in other rooms, and at the front door. They recorded facts about nearby power lines for the family home and also for the mother’s residence when she was pregnant. Result: no evidence of more than a chance connection between magnetic fields and childhood leukemia.

“No evidence” that magnetic fields are connected with childhood leukemia doesn’t prove that there is no risk. It says only that a careful study could not find any risk that stands out from the play of chance that distributes leukemia cases across the landscape. Critics continue to argue that the study failed to measure some lurking variables, or that the children studied don’t fairly represent all children. Nonetheless, a carefully designed study comparing children with and without leukemia is a great advance over haphazard and sometimes emotional counting of cancer cases.

Example:

DOES SMOKING CAUSE LUNG CANCER?

Despite the difficulties, it is sometimes possible to build a strong case for causation in the absence of experiments. The evidence that smoking causes lung cancer is about as strong as non-experimental evidence can be. Doctors had long observed that most lung cancer patients were smokers. Comparison of smokers and similar nonsmokers showed a very strong association between smoking and death from lung cancer. Could the association be due to common response? Might there be, for example, a genetic factor that predisposes people both to nicotine addiction and to lung cancer? Smoking and lung cancer would then be positively associated even if smoking had no direct effect on the lungs. Or perhaps confounding is to blame. It might be that smokers live unhealthy lives in other ways (diet, alcohol, lack of exercise) and that some other habit confounded with smoking is a cause of lung cancer. How were these objections overcome?

Let’s answer this question in general terms: What are the criteria for establishing causation when we cannot do an experiment?

The association is strong. The association between smoking and lung cancer is very strong.

The association is consistent. Many studies of different kinds of people in many countries link smoking to lung cancer. That reduces the chance that a lurking variable specific to one group or one study explains the association.

Higher doses are associated with stronger responses. People who smoke more cigarettes per day or who smoke over a longer period get lung cancer more often. People who stop smoking reduce their risk.

The alleged cause precedes the effect in time. Lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women began to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer death among women.

The alleged cause is plausible. Experiments with animals show that tars from cigarette smoke do cause cancer.

Medical authorities do not hesitate to say that smoking causes lung cancer. The U.S. Surgeon General states that cigarette smoking is “the largest avoidable cause of death and disability in the United States.”  The evidence for causation is overwhelming---but it is not as strong as the evidence provided by well-designed experiments.

   

Be careful when using a statistical test to test for a dependency or correlation between variables.  The  apparent dependent relationship or correlation may be coincidental rather than causal.   This point is demonstrated by the following articles. Click on the caution sign.

Try Self-Check 20

Review the content from lessons Transforming Relationships and Cautions about Correlation and Regression and take Multiple Choice 7.