AP Statistics
Sections:  1.| Transforming Relationships 2.| Cautions about Correlation and Regression  3| Relations in Categorical Data

Transforming Relationships

We will see in this section that understanding how simple functions work helps us choose and use transformations. Applying a function such as the logarithm or square root to a quantitative transforming variable is called transforming or reexpressing the data.  Because we may want to transform either the explanatory variable x or the response variable y in a scatterplot, or both, we will call the variable t when talking about transforming in general.

Transforming data amounts to changing the scale of measurement that was used when the data were collected. We can choose to measure temperature in degrees Fahrenheit or in degrees Celsius, distance in miles or in kilometers. These changes of units are linear transformations, discussed earlier in the course.

Linear transformations cannot straighten a curved relationship between two variables. We will resort to functions that are not linear.

 How shall we measure the size of a sphere or of such roughly spherical objects as grains of sand or bubbles in a liquid? The size of a sphere can be expressed in terms of the diameter t, in terms of surface area (proportional to t2), or in terms of volume (proportional to t3). Any one of these powers of the diameter may be natural in a particular application.

We commonly measure the fuel consumption of a car in miles per gallon, which is how many miles the car travels on 1 gallon of fuel. Engineers prefer to measure in gallons per mile, which is how many gallons of fuel the car needs to travel 1 mile. This is a reciprocal transformation. A car that gets 25 miles per gallon        uses                        

                        

The reciprocal is a negative power 1/t = t–1.

The transformations we have mentioned—linear, positive and negative powers, and logarithms—are those used in most statistical problems. They are all monotonic.

MONOTONIC FUNCTIONS

A monotonic function f(t) moves in one direction as its argument t increases.

A monotonic increasing function preserves the order of data. That is, if a > b, then f(a) > f(b).

A monotonic decreasing function reverses the order of data. That is, if a > b, then f(a) < f(b).

The graph of a linear function is a straight line. The graph of a monotonic increasing function is increasing everywhere. A monotonic decreasing function has a graph that is decreasing everywhere. A function can be monotonic over some range of t without being everywhere monotonic. For example, the square function t2 is monotonic increasing for t > 0. If the range of t includes both positive and negative values, the square is not monotonic—it decreases as t increases for negative values of t and increases as t increases for positive values. The figure below compares monotonic increasing functions and monotonic decreasing functions for positive values of the argument t. Many variables take only 0 or positive values, so we are particularly interested in how functions behave for positive values of t. The increasing functions for t > 0 are

         

Linear a + bt, slope b > 0                                           Square t2                    

                   

Logarithm log t                                                Exponential bet

                   

                      Quadratic t2 + t + b                                                Power atb

The decreasing functions for t > 0 in the lower panel of the figure above are

                    

                   Linear a + bt, slope b < 0                         Reciprocal square root 1 / √t or t -1/2

                 Reciprocal 1/t, or t–1

Nonlinear monotonic transformations change data enough to alter the shape of distributions and the form of relations between two variables, yet are simple enough to preserve order allow recovery of the original data. We will concentrate on powers and logarithms. The even-numbered powers t2, t4, and so on are monotonic increasing for t > 0, but not when t can take both negative and positive values. The logarithm is not even defined unless t > 0. Our strategy for transforming data is therefore as follows:

1. If the variable to be transformed takes values that are 0 or negative, first apply a linear transformation to make the values all positive. Often we just add a constant to all the observations.

2. Then choose a power or logarithmic transformation that simplifies the data, for example, one that approximately straightens a scatterplot.

Table of Common Transformations

Example:

The data at the below shows the cooling temperatures of a freshly brewed cup of coffee after it is poured from the brewing pot into a serving cup.  The brewing pot temperature is approximately 180º F.


 

Time (mins)    Temp ( º F)   
0 179.5
5 168.7
8 158.1
11 149.2
15 141.7
18 134.6
22 125.4
25 123.5
30 116.3
34 113.2
38 109.1
42 105.7
45 102.2
50 100.5

First graph the data and check to see if a linear model is appropriate.

It is obvious that a linear model is not appropriate. It is now time to try some transformations. The plot has a gentle downward curve. Let's transform the response variable, temperature, by taking the natural log.

There is still a slight curvature in the graph. Let's take the natural log of the explanatory variable, time, and use the real values for temperature.

The graph now looks linear. Check out the same graph with the line of best fit superimposed.

Even though the fit is not perfect we have a good fit and a high r2, 99.4%.

The equation that transforms the data to a linear model is given by

Temp = 222.161 - 30.9843 Ln Time.

We now have an equation that can be used to make predictions. What is the predicted temperature when t = 8 seconds?

Temp = 222.161 - 30.9843(Ln 8) =  157.73 º F

How close is this to the actual temperature collected for 8 seconds in the data table? The reported value was 158.1 º F, a difference of 0.37 º F.  Following is original table with the predicted values added.  Checking the third column with the second column shows a decent "match". Why does the first cell in the last column have an *?  The reason is the natural log of 0 is undefined. Be careful when making transformations where some of the values may be undefined by the transforming function.

 

Time (mins)    Temp ( º F)    Predicted from regression equation
0 179.5 *
5 168.7 172.294
8 158.1 157.731
11 149.2 147.864
15 141.7 138.254
18 134.6 132.605
22 125.4 126.387
25 123.5 122.426
30 116.3 116.777
34 113.2 112.899
38 109.1 109.453
42 105.7 106.352
45 102.2 104.214
50 100.5 100.950

Check the normal probability plot.

This is not a perfect model by any means but it give a fair linear model based on transforming the data.  Linear models allow us to predict because of their constancy, i.e., a constant slope.

Another way to use the equation is to calculate the time based on a given temperature. For example, what is the time at which the temperature is 120 º F?

120 º F = 222.161 - 30.9843(Ln t) where t is the natural log of the unknown time.

120  -222.161= - 30.9843(Ln t)

-102.161 = - 30.9843(Ln t)

-102.161/-30.9843 =  Ln t

3.3972 = Ln t

This is not the answer! The value 3.3972 must be backed transformed from the natural log.

e3.3972 = t = 29.88 seconds

Make sure you always do the final back transformation.

Besides the transformations mentioned above you may run across a graph that needs more than one transformation. These are called complex.

Complex

To handle these type of situations break the data into two or more groups and perform transformations on the each group.

Try Self Check 18

Proceed to AP Statistics Assignment 8: Transforming Data

It is much more satisfactory to begin with a theory or mathematical model  that we expect to describe a relationship. The transformation needed to make the relationship linear is then a consequence of the model. One of the most common models is exponential growth.

Table of Common Transformations

Exponential growth

A variable grows linearly over time if it adds a fixed increment in each equal time period. Exponential growth occurs when a variable is multiplied by a fixed number in each time period. To grasp the effect of multiplicative growth, consider a population of bacteria in which each bacterium splits into two each hour.

Beginning with a single bacterium, we have 2 after one hour, 4 at the end of two hours, 8 after three hours, then 16, 32, 64, 128, and so on. These first few numbers are deceiving. After 1 day of doubling each hour, there are 224 (16,777,216) bacteria in the population. That number then doubles the next hour! Try successive multiplications by 2 on your calculator to see for yourself the very rapid increase after a slow start.  The figure below shows the growth of the bacteria population over 24 hours.

For the first 15 hours, the population is too small to rise visibly above the zero level on the graph. It is characteristic of exponential growth that the increase appears slow for a long period, then seems to explode.

LINEAR VERSUS EXPONENTIAL GROWTH

Linear growth increases by a fixed amount in each equal time period.

Exponential growth increases by a fixed percentage of the previous total.

The logarithm transformation

The growth curve for the number of cell phone subscribers does look somewhat like the exponential curve in bacteria example above,  but our eyes are not very good at comparing curves of roughly similar shape. We need a better way to check whether growth is exponential. If you suspect exponential growth, you should first calculate ratios of consecutive terms. In Table below, we have divided each entry in the “Subscribers” column (the y variable) by its predecessor, leaving out both the first value of y, because it doesn’t have a predecessor, and the second value, because the x increment is not 1. Notice that the ratios are not exactly the same, but they are approximately the same.

TABLE  Ratios of consecutive y-values and the logarithms of the y-values for the cell phone data

Year   Subscribers      Ratios       log(y)

1990       5,283            —         3.72288

1993     16,009            —         4.20436

1994      24,134        1.51         4.38263

1995      33,786        1.40         4.52874

1996       44,043       1.30         4.64388

1997       55,312       1.26         4.74282

1998       69,209       1.25         4.84016

1999       86,047       1.24         4.93474

The next step is to apply a mathematical transformation that changes exponential growth into linear growth—and patterns of growth that are not exponential into something other than linear. But before we do the transformation, we need to review the properties of logarithms. The basic idea of a logarithm is this: log28 = 3 because 3 is the exponent to which the base 2 must be raised to yield 8. Here is a quick summary of algebraic properties of logarithms:

ALGEBRAIC PROPERTIES OF LOGARITHMS

logbx = y if and only if by = x

The rules for logarithms are

1. log(AB) = logA + logB

2. log(A/B) = logA – logB

3. log Xp = p logX

Prediction in the exponential growth model

Regression is often used for prediction. When we fit a least-squares regression line, we find the predicted response y for any value of the explanatory variable x by substituting our x-value into the equation of the line. In the case of exponential growth, the logarithms rather than the actual responses follow a linear pattern. To do prediction, we need to “undo” the logarithm transformation to return to the original units of measurement. The same idea works for any monotonic transformation. There is always exactly one original value behind any transformed value, so we can always go back to our original scale.

Make sure that you understand the big idea here. The necessary transformation is carried out by taking the logarithm of the response variable. Your calculator and most statistical software will calculate the logarithms of all the values of a variable with a single command. The essential property of the logarithm for our purposes is that it straightens an exponential growth curve. If a variable grows exponentially, its logarithm grows linearly.

Power law models

When you visit a pizza parlor, you order a pizza by its diameter, say 10 inches, 12 inches, or 14 inches. But the amount you get to eat depends on the area of the pizza. The area of a circle is π times the square of its radius. So the area of a round pizza with diameter x is

area = πr2 = π(x/2)2 = π(x2/4) = (π/4)x2

This is a power law model of the form

y = axp

When we are dealing with things of the same general form, whether circles or fish or people, we expect area to go up with the square of a dimension such as diameter or height. Volume should go up with the cube of a linear dimension. That is, geometry tells us to expect power laws in some settings. Biologists have found that many characteristics of living things are described quite closely by power laws. There are more mice than elephants, and more flies than mice—the abundance of species follows a power law with body weight as the explanatory variable. So do pulse rate, length of life, the number of eggs a bird lays, and so on. Sometimes the powers can be predicted from geometry, but sometimes they are mysterious. Why, for example, does the rate at which animals use energy go up as the 3/4 power of their body weight? Biologists call this relationship Kleiber’s law. It has been found to work all the way from bacteria to whales. The search goes on for some physical or geometrical explanation for why life follows power laws. There is as yet no general explanation, but power laws are a good place to start in simplifying relationships for living things.

Exponential growth models become linear when we apply the logarithm transformation to the response variable y. Power law models become linear when we apply the logarithm transformation to both variables. Here are the details:

1. The power law model is

y = axp

2. Take the logarithm of both sides of this equation. You see that

log y = log a + p log x

That is, taking the logarithm of both variables straightens the scatterplot of y against x.

3. Look carefully: The power p in the power law becomes the slope of the straight line that links log y to log x.

Prediction in power law models

If taking the logarithms of both variables makes a scatterplot linear, a power law is a reasonable model for the original data. We can even roughly estimate what power p the law involves by regressing log y on log x and using the slope of the regression line as an estimate of the power. Remember that the slope is only an estimate of the p in an underlying power model. The greater the scatter of the points in the scatterplot about the fitted line, the smaller our confidence that this estimate is accurate.

Transformations require patience and practice. Do be overwhelmed by all of the possibilities. Just approach each problem methodically and you will arrive at a reasonable solution.

Example

The true antelopes are found only in Africa and Asia.  They range in size from the pygmy antelopes, which are 12 inches (30 cm) high at the shoulder, to the giant elands, which are over 6 feet (180 cm) high at the shoulder.  Most antelopes stand between 3 to 4 feet (90-120 cm) high at the shoulder. The horns of antelopes, unlike the antlers of deer, are unbranched, are made of a chitinous shell with a bony core, and are not shed.  The majority of antelopes reside in Africa..

Data:  The data below represents the length and mid-shaft diameters of the humerus bones of African Antelopes.

Diameter (mm) Length (mm)
17.6 159.9
26.0 206.9
31.9 236.8
38.9 269.9
45.8 300.6
51.2 323.6
58.1 351.7
64.7 377.6
66.7 384.1
80.8 437.2
82.9 444.7

 

Task: a.) Determine a power regression model equation to represent this data.
  b.) Graph the new equation.
  c.) Decide whether the new equation is a "good fit" to represent this data.
  d.) What length will correspond to a diameter of 47 mm?

 

 

This is the graph of the raw data, notice the slight curvature.

 

To straighten this curve take the natural log of both the x and y variable. Below left is the new transformed graph. Also displayed is the linear regression statistics for the transformed data. Notice the high r value.

 

The linear equation for the transformed data is ln y = 3.183 + .6595 ln x

You can convert this to a power equation, y = 24.1299x0.6595

Substituting 49 for x the y value is 314.2 mm.

You can also substitute 49 in the linear equation and arrive at the same answer.

ln y = 3.183 + .6595 ln(49)

ln y = 5.750  back transform and y = 314.2 mm

It would to laborious to go through every type of transformation. Do be afraid to try different methods but keep in mind the reason for the transformation. The goal is to produce a linear model that can be used to make predictions.

   Transformation Applet


Press the "Begin" button to start the applet in another window. The window opens and displays four scatterplots of the data set "Body/brain" weight.

You can transform the X and Y axes by clicking on the appropriate radio button. You can select a different dataset from the pop-up menu.

The pop-up menu at the top of the window can be used to select other datasets. To enter or paste in your own data, choose "Enter data" from the pop-up menu.

Try Self Check 19