AP Statistics
Sections:  1.| Introduction 2.| Designing Samples 3.| Designing Experiments 4| Simulating Experiments

Introduction

Exploratory data analysis seeks to discover and describe what data say by using graphs and numerical summaries. The conclusions we draw from data analysis apply to the specific data that we examine. Often, however, we want to answer questions about some large group of individuals. To get sound answers, we must produce data in a way that is designed to answer our questions.

Suppose our question is “What percent of American adults agree that the United Nations should continue to have its headquarters in the United States?”  To answer the question, we interview American adults. We can’t afford to ask all adults, so we put the question to a sample chosen to represent the entire adult population. How shall we choose a sample that truly represents the opinions of the entire population? Statistical designs for choosing samples are the topic of the second lesson in this unit.

Our goal in choosing a sample is a picture of the population, disturbed as little as possible by the act of gathering information. Sample surveys are one kind of observational study. In other settings, we gather data from an experiment. In doing an experiment, we don’t just observe individuals or ask them questions. We actively impose some treatment in order to observe the response. Experiments can answer questions such as “Does aspirin reduce the chance of a heart attack?” and “Does a majority of college students prefer Pepsi to Coke when they taste both without knowing which they are drinking?” Experiments, like samples, provide useful data only when properly designed. We will discuss statistical design of experiments in the third lesson. The distinction between experiments and observational studies is one of the most important ideas in statistics.

OBSERVATION VERSUS EXPERIMENT

An observational study observes individuals and measures variables of interest but does not attempt to influence the responses.

An experiment, on the other hand, deliberately imposes some treatment on individuals in order to observe their responses.

Observational studies are essential sources of data about topics from the opinions of voters to the behavior of animals in the wild. But an observational study, even one based on a statistical sample, is a poor way to gauge the effect of an intervention. To see the response to a change, we must actually impose the change. When our goal is to understand cause and effect, experiments are the only source of fully convincing data.

Example:

HELPING WELFARE MOTHERS FIND JOBS

Most adult recipients of welfare are mothers of young children. Observational studies of welfare mothers show that many are able to increase their earnings and leave the welfare system. Some take advantage of voluntary job-training programs to improve their skills. Should participation in job-training and job-search programs be required of all able-bodied welfare mothers? Observational studies cannot tell us what the effects of such a policy would be. Even if the mothers studied are a properly chosen sample of all welfare recipients, those who seek out training and find jobs may differ in many ways from those who do not. They are observed to have more education, for example, but they may also differ in values and motivation, things that cannot be observed.

To see if a required jobs program will help mothers escape welfare, such a program must actually be tried. Choose two similar groups of mothers when they apply for welfare. Require one group to participate in a job-training program, but do not offer the program to the other group. This is an experiment. Comparing the income and work record of the two groups after several years will show whether requiring training has the desired effect.

When we simply observe welfare mothers, the effect of job-training programs on success in finding work is confounded with (mixed up with) the characteristics of mothers who seek out training on their own. Recall that two variables (explanatory variables or lurking variables) are said to be confounded when their effects on a response variable cannot be distinguished from each other.

Example:

 

Teenage driving safety

Sometimes correct numbers are given incorrect interpretations, particularly when an apparent cause-and-effect relationship is caused by a "hidden or confounding variable".

A California study of teenage drivers showed that the accident rates for males and females were 0.162 and 0.075 respectively. That is, for the period

Male teenagers have twice the accident rate that females do. Are male teenagers worse drivers?

A third variable, number of miles driven, was taken into account. The results are shown below.

An insurance company is interested only in line 1 of the table. Since males have more accidents and hence cost the company more in claims, they may charge males higher premiums. However, an employer hiring a teenager driver for a delivery route may review line 3 of the table and decide that a male or female driver would have the same likelihood of having an accident in the company car.

The relationship between the variables can be illustrated below:

Observational studies of the effect of one variable on another often fail because the explanatory variable is confounded with lurking variables. We will see that well designed experiments take steps to defeat confounding. Because experiments allow us to pin down the effects of specific variables of interest to us, they are the preferred method of gaining knowledge in science, medicine, and industry. In some situations, it may not be possible to observe individuals directly or to perform an experiment. In other cases, it may be logistically difficult or simply inconvenient to obtain a sample or to impose a treatment. Simulations provide an alternative method for producing data in such circumstances. The fourth lesson in this unit introduces techniques for simulating experiments.

Statistical techniques for producing data open the door to formal statistical inference, which answers specific questions with a known degree of confidence. The later units of this course book are devoted to inference. We will see that careful design of data production is the most important prerequisite for trustworthy inference.