Description
Please discuss, elaborate, and reflect on the following from chapters 4 & 5. Below listed are the important topics that you have to include in your discussions. Give examples and elaborate on the applications of the topic.
Chapter-6 topics and questions:
Explain the differences between univariate and bivariate distribution.
Compare correlation and Pearson’s correlation.
Where do you apply correlation?
Explain Spurious Correlation?
Compare correlation and regression.Describe the situations where they can be applied.
What are the conditions where the regression results will give reliable results?
12th Edition
Exploring Statistics
Tales of Distributions
Chris Spatz
Outcrop Publishers
Conway, Arkansas
Exploring Statistics: Tales of Distributions
12th Edition
Chris Spatz
Cover design: Grace Oxley
Answer Key: Jill Schmidlkofer
Webmaster & Ebook: Fingertek Web Design, Tina Haggard
Managers: Justin Murdock, Kevin Spatz
Online study guide available at
http://exploringstatistics.com/studyguide.php
Copyright © 2019 by Outcrop Publishers, LLC
All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any
means, including photocopying, recording, or other electronic or mechanical methods, without the prior written
permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other
noncommercial uses permitted by copyright law. For permission requests, contact info@outcroppublishers.com or
write to the publisher at the address below.
Outcrop Publishers
615 Davis Street
Conway, AR 72034
Email: info@outcroppublishers.com
Website: outcroppublishers.com
Library of Congress Control Number: [Applied for]
ISBN-13 (hardcover): 978-0-9963392-2-3
ISBN-13 (ebook): 978-0-9963392-3-0
ISBN-13 (study guide): 978-0-9963392-4-7
Examination copies are provided to academics and professionals to consider for adoption as a course textbook.
Examination copies may not be sold or transferred to a third party. If you adopt this textbook, please accept it as
your complimentary desk copy.
Ordering information:
Students and professors – visit exploringstatistics.com
Bookstores – email info@outcroppublishers.com
Photo Credits – Chapter 1
Karl Pearson – Courtesy of Wellcomeimages.org
Ronald A. Fisher – R.A. Fisher portrait, 0006973, Special Collections Research Center, North Carolina State
University Libraries, Raleigh, North Carolina
Jerzy Neyman – Paul R. Halmos Photograph Collection, e_ph 0223_01, Dolph Briscoe Center for American History,
The University of Texas at Austin
Jacob Cohen – New York University Archives, Records of the NYU Photo Bureau
Printed in the United States of America by Walsworth ®
234567
24 23 22 21 20
About The Author
About The Author
Chris Spatz is at Hendrix College where he twice served as chair of
the Psychology Department. Dr. Spatz’s undergraduate education
was at Hendrix, and his PhD in experimental psychology is from
Tulane University in New Orleans. He subsequently completed
postdoctoral fellowships in animal behavior at the University of
California, Berkeley, and the University of Michigan. Before
returning to Hendrix to teach, Spatz held positions at The University
of the South and the University of Arkansas at Monticello.
Spatz served as a reviewer for the journal Teaching of Psychology
for more than 20 years. He co-authored a research methods textbook,
wrote several chapters for edited books, and was a section editor for the
Encyclopedia of Statistics in Behavioral Science.
In addition to writing and publishing, Dr. Spatz enjoys the outdoors,
especially canoeing, camping, and gardening. He swims several times
a week (mode = 3). Spatz has been an opponent of high textbook prices for years, and he is
happy to be part of a new wave of authors who provide high-quality textbooks to students at
affordable prices.
v
vi
Dedication
With love and affection,
this textbook is dedicated to
Thea Siria Spatz, Ed.D., CHES
Brief Contents
Brief Contents
Preface xiv
1 Introduction 1
2 Exploring Data: Frequency Distributions and Graphs 29
3 Exploring Data: Central Tendency 45
4 Exploring Data: Variability 59
5 Other Descriptive Statistics 77
6 Correlation and Regression 94
7 Theoretical Distributions Including the Normal Distribution 127
8 Samples, Sampling Distributions, and Confidence Intervals 150
9 Effect Size and NHST: One-Sample Designs 175
10 Effect Size, Confidence Intervals, and NHST:
Two-Sample Designs 200
11 Analysis of Variance: Independent Samples 231
12 Analysis of Variance: Repeated Measures 259
13 Analysis of Variance: Factorial Design 271
14 Chi Square Tests 303
15 More Nonparametric Tests 328
16 Choosing Tests and Writing Interpretations 356
Appendixes
A Getting Started 371
B Grouped Frequency Distributions and Central Tendency 376
C Tables 380
D Glossary of Words 401
E Glossary of Symbols 405
F Glossary of Formulas 407
G Answers to Problems 414
References 466
Index 472
vii
viii
Contents
Contents
Preface
xiv
chapter 1 Introduction 1
Disciplines That Use Quantitative Data 5
What Do You Mean, “Statistics� 6
Statistics: A Dynamic Discipline 8
Some Terminology 9
Problems and Answers 12
Scales of Measurement 13
Statistics and Experimental Design 16
Experimental Design Variables 17
Statistics and Philosophy 20
Statistics: Then and Now 21
How to Analyze a Data Set 22
Helpful Features of This Book 22
Computers, Calculators, and Pencils 24
Concluding Thoughts 25
Key Terms 27
Transition Passage to Descriptive Statistics 28
chapter 2 Exploring Data: Frequency Distributions
and Graphs 29
Simple Frequency Distributions 31
Grouped Frequency Distributions 33
Graphs of Frequency Distributions 35
Describing Distributions 39
Contents
The Line Graph 41
More on Graphics 42
A Moment to Reflect 43
Key Terms 44
chapter 3 Exploring Data: Central Tendency
45
Measures of Central Tendency 46
Finding Central Tendency of Simple Frequency Distributions 49
When to Use the Mean, Median, and Mode 52
Determining Skewness From the Mean and Median 54
The Weighted Mean 55
Estimating Answers 56
Key Terms 58
chapter 4 Exploring Data: Variability
59
Range 61
Interquartile Range 61
Standard Deviation 63
Standard Deviation as a Descriptive Index of Variability
Ã…Â as an Estimate of ÃÆ’ 69
Variance 73
Statistical Software Programs 74
Key Terms 76
chapter 5 Other Descriptive Statistics
Describing Individual Scores 78
Boxplots 82
Effect Size Index 86
The Descriptive Statistics Report 89
Key Terms 92
77
Transition Passage to Bivariate Statistics
chapter 6 Correlation and Regression
Bivariate Distributions 96
Positive Correlation 96
Negative Correlation 99
Zero Correlation 101
Correlation Coefficient 102
Scatterplots 106
64
94
93
ix
x
Contents
Interpretations of r 106
Uses of r 110
Strong Relationships but Low Correlation Coefficients
Other Kinds of Correlation Coefficients 115
Linear Regression 116
The Regression Equation 117
Key Terms 124
What Would You Recommend? Chapters 2-6 125
112
Transition Passage to Inferential Statistics
126
chapter 7 Theoretical Distributions Including the
Normal Distribution 127
Probability 128
A Rectangular Distribution 129
A Binomial Distribution 130
Comparison of Theoretical and Empirical Distributions 131
The Normal Distribution 132
Comparison of Theoretical and Empirical Answers 146
Other Theoretical Distributions 146
Key Terms 147
Transition Passage to the Analysis of Data From
Experiments 149
chapter 8 Samples, Sampling Distributions, and
Confidence Intervals 150
Random Samples 152
Biased Samples 155
Research Samples 156
Sampling Distributions 157
Sampling Distribution of the Mean 157
Central Limit Theorem 159
Constructing a Sampling Distribution When ÃÆ’ Is Not Available
The t Distribution 165
Confidence Interval About a Population Mean 168
Categories of Inferential Statistics 172
Key Terms 173
164
Contents
Transition Passage to Null Hypothesis Significance
Testing 174
chapter 9 Effect Size and NHST: One-Sample Designs
175
Effect Size Index 176
The Logic of Null Hypothesis Significance Testing (NHST) 179
Using the t Distribution for Null Hypothesis Significance Testing 182
A Problem and the Accepted Solution 184
The One-Sample t Test 186
An Analysis of Possible Mistakes 188
The Meaning of p in p < .05 191
One-Tailed and Two-Tailed Tests 192
Other Sampling Distributions 195
Using the t Distribution to Test the Significance of a Correlation
Coefficient 195
t Distribution Background 197
Why .05? 198
Key Terms 199
chapter 10 Effect Size, Confidence
Intervals, and NHST: Two-Sample Designs 200
A Short Lesson on How to Design an Experiment 201
Two Designs: Paired Samples and Independent Samples
Degrees of Freedom 206
Paired-Samples Design 208
Independent-Samples Design 212
The NHST Approach 217
Statistical Significance and Importance 222
Reaching Correct Conclusions 222
Statistical Power 225
Key Terms 228
What Would You Recommend? Chapters 7-10 229
Transition Passage to More Complex Designs
202
230
xi
xii
Contents
chapter 11 Analysis of Variance: Independent Samples
231
Rationale of ANOVA 233
More New Terms 240
Sums of Squares 240
Mean Squares and Degrees of Freedom 245
Calculation and Interpretation of F Values Using the F Distribution
Schedules of Reinforcementâ€â€A Lesson in Persistence 248
Comparisons Among Means 250
Assumptions of the Analysis of Variance 254
Random Assignment 254
Effect Size Indexes and Power 255
Key Terms 258
chapter 12 Analysis of Variance: Repeated Measures
246
259
A Data Set 260
Repeated-Measures ANOVA: The Rationale 261
An Example Problem 262
Tukey HSD Tests 265
Type I and Type II Errors 266
Some Behind-the-Scenes Information About Repeated-Measures ANOVA
267
Key Terms 270
chapter 13 Analysis of Variance: Factorial Design
Factorial Design 272
Main Effects and Interaction 276
A Simple Example of a Factorial Design 282
Analysis of a 2 × 3 Design 291
Comparing Levels Within a Factorâ€â€Tukey HSD Tests
Effect Size Indexes for Factorial ANOVA 299
Restrictions and Limitations 299
Key Terms
301
297
Transition Passage to Nonparametric Statistics
chapter 14 Chi Square Tests
303
The Chi Square Distribution and the Chi Square Test
Chi Square as a Test of Independence 307
Shortcut for Any 2 × 2 Table 310
Effect Size Indexes for 2 × 2 Tables 310
Chi Square as a Test for Goodness of Fit 314
271
305
302
Contents
Chi Square With More Than One Degree of Freedom
Small Expected Frequencies 321
When You May Use Chi Square 324
Key Terms 327
chapter 15 More Nonparametric Tests
316
328
The Rationale of Nonparametric Tests 329
Comparison of Nonparametric to Parametric Tests 330
Mann-Whitney U Test 332
Wilcoxon Signed-Rank T Test 339
Wilcoxon-Wilcox Multiple-Comparisons Test 344
Correlation of Ranked Data 348
Key Terms 353
What Would You Recommend? Chapters 11-15 353
chapter 16 Choosing Tests and Writing Interpretations
A Review 356
My (Almost) Final Word 357
Future Steps 358
Choosing Tests and Writing Interpretations
Key Term 368
Appendixes
A
B
C
D
E
F
G
359
Getting Started
371
Grouped Frequency Distributions and Central
Tendency 376
Tables
380
Glossary of Words
401
Glossary of Symbols
405
Glossary of Formulas
407
Answers to Problems
414
References
466
Index
472
356
xiii
xiv
Preface
Preface
Even if our statistical appetite is far from keen, we all of us should like to know enough
to understand, or to withstand, the statistics that are constantly being thrown at us in
print or conversationâ€â€much of it pretty bad statistics. The only cure for bad statistics is
apparently more and better statistics. All in all, it certainly appears that the rudiments of
sound statistical sense are coming to be an essential of a liberal education.
– Robert Sessions Woodworth
Exploring Statistics: Tales of Distributions (12th edition) is a textbook for a one-term statistics
course in the social or behavioral sciences, education, or an allied health/nursing field.
Its focus is conceptualization, understanding, and interpretation, rather than computation.
Designed to be comprehensible and complete for students who take only one statistics course,
it also includes elements that prepare students for additional statistics courses. For example,
basic experimental design terms such as independent and dependent variables are explained
so students can be expected to write fairly complete interpretations of their analyses. In many
places, the student is invited to stop and think or do a thought exercise. Some problems ask
the student to decide which statistical technique is appropriate. In sum, this book’s approach is
in tune with instructors who emphasize critical thinking in their course.
This textbook has been remarkably successful for more than 40 years. Students,
professors, and reviewers have praised it. A common refrain is that the book has a
conversational, narrative style that is engaging, especially for a statistics text. Other features
that distinguish this textbook from others include the following:
• Data sets are approached with an attitude of exploration.
• Changes in statistical practice over the years are acknowledged, especially the recent
emphasis on effect sizes and confidence intervals.
• Criticism of null hypothesis significance testing (NHST) is explained.
• Examples and problems represent a variety of disciplines and everyday life.
• Most problems are based on actual studies rather than fabricated scenarios.
• Interpretation is emphasized throughout.
• Problems are interspersed within a chapter, not grouped at the end.
• Answers to all problems are included.
• Answers are comprehensively explainedâ€â€over 50 pages of detail.
• A final chapter, Choosing Tests and Writing Interpretations, requires active responses to
comprehensive questions.
Preface
• Effect size indexes are treated as important descriptive statistics, not add-ons to NHST.
• Important words and phrases are defined in the margin when they first occur.
• Objectives, which open each chapter, serve first for orientation and later as review
items.
• Key Terms are identified for each chapter.
• Clues to the Future alert students to concepts that come up again.
• Error Detection boxes tell ways to detect mistakes or prevent them.
• Transition Passages alert students to a change in focus in chapters that follow.
• Comprehensive Problems encompass all (or most) of the techniques in a chapter.
• What Would You Recommend? problems require choices from among techniques in
several chapters.
For this 12th edition, I increased the emphasis on effect sizes and confidence intervals,
moving them to the front of Chapter 9 and Chapter 10. The controversy over NHST is
addressed more thoroughly. Power gets additional attention. Of course, examples and
problems based on contemporary data are updated, and there are a few new problems. In
addition, a helpful Study Guide to Accompany Exploring Statistics (12th edition) was written
by Lindsay Kennedy, Jennifer Peszka, and Leslie Zorwick, all of Hendrix College. The study
guide is available online at exploringstatistics.com.
Students who engage in this book and their course can expect to:
• Solve statistical problems
• Understand and explain statistical reasoning
• Choose appropriate statistical techniques for common research designs
• Write explanations that are congruent with statistical analyses
After many editions with a conventional publisher, Exploring Statistics: Tales of
Distributions is now published by Outcrop Publishers. As a result, the price of the print
edition is about one-fourth that of the 10th edition. Nevertheless, the authorship and quality of
earlier editions continue as before.
xv
xvi
Preface
Acknowledgments
The person I acknowledge first is the person who most deserves acknowledgment. And for the
11th and 12th editions, she is especially deserving. This book and its accompanying publishing company,
Outcrop Publishers, would not exist except for Thea Siria Spatz, encourager, supporter, proofreader, and
cheer captain. This edition, like all its predecessors, is dedicated to her.
Kevin Spatz, manager of Outcrop Publishers, directed the distribution of the 11th edition,
advised, week by week, and suggested the cover design for the 12th edition. Justin Murdock now serves
as manager, continuing the tradition that Kevin started. Tina Haggard of Fingertek Web Design created
the book’s website, the text’s ebook, and the online study guide. She provided advice and solutions for
many problems. Thanks to Jill Schmidlkofer, who edited the extensive answer section again for this
edition. Emily Jones Spatz created new drawings for the text. I’m particularly grateful to Grace Oxley for
a cover design that conveys exploration, and to Liann Lech, who copyedited for clarity and consistency.
Walsworth® turned a messy collection of files into a handsome bookâ€â€thank you Nathan Stufflebean
and Dennis Paalhar. Others who were instrumental in this edition or its predecessors include Jon Arms,
Ellen Bruce, Mary Kay Dunaway, Bob Eslinger, James O. Johnston, Roger E. Kirk, Rob Nichols, Jennifer
Peszka, Mark Spatz, and Selene Spatz. I am especially grateful to Hendrix College and my Hendrix
colleagues for their support over many years, and in particular, to Lindsay Kennedy, Jennifer Peszka, and
Leslie Zorwick, who wrote the study guide that accompanies the text.
This textbook has benefited from perceptive reviews and significant suggestions by some 90
statistics teachers over the years. For this 12th edition, I particularly thank
Jessica Alexander, Centenary College
Lindsay Kennedy, Hendrix College
Se-Kang Kim, Fordham University
Roger E. Kirk, Baylor University
Kristi Lekies, The Ohio State University
Jennifer Peszka, Hendrix College
Robert Rosenthal, University of California, Riverside
I’ve always had a touch of the teacher in meâ€â€as an older sibling, a parent, a professor, and now
a grandfather. Education is a first-class task, in my opinion. I hope this book conveys my enthusiasm for
it. (By the way, if you are a student who is so thorough as to read even the acknowledgments, you should
know that I included phrases and examples in a number of places that reward your kind of diligence.)
If you find errors in this book, please report them to me at spatz@hendrix.edu. I will post
corrections at the book’s website: exploringstatistics.com.
94
Correlation and Regression
CHAPTER
6
OBJECTIVES FOR CHAPTER 6
After studying the text and working the problems in this chapter, you should be able to:
1. Explain the difference between univariate and bivariate distributions
2. Explain the concept of correlation and the difference between positive and
negative correlation
3. Draw scatterplots
4. Compute a Pearson product-moment correlation coefficient, r
5. Discuss the effect size index for r
6. Calculate and discuss common variance
7. Recognize correlation coefficients that indicate a reliable test
8. Discuss the relationship of correlation to cause and effect
9. Identify situations in which a Pearson r does not accurately reflect the degree of
relationship
10. Name and explain the elements of the regression equation
11. Compute regression coefficients and fit a regression line to a set of data
12. Interpret the appearance of a regression line
13. Predict scores on one variable based on scores from another variable
CORRELATION AND REGRESSION: My guess is that you have some understanding of the
concept of correlation and that you are not as comfortable with the word regression. Speculation
aside, correlation is simpler. Correlation is a statistical technique that describes the direction and
degree of relationship between two variables.
Regression is more complex. In this chapter, you will use the regression technique to
accomplish two tasks, drawing the line that best fits the data and predicting a person’s score on
one variable when you know that person’s score on a second, correlated variable. Regression has
other uses, but you will have to put those off until you study more advanced statistics.
The ideas identified by the terms correlation and regression were developed by Sir Francis
Galton in England well over 100 years ago. Galton was a genius (he could read at age 3) who had
an amazing variety of interests, many of which he actively pursued during his 89 years. He once
listed his occupation as “private gentleman,†which meant that he had inherited money and did not
have to work at a job. Lazy, however, he was not. Galton traveled widely and wrote prodigiously
(17 books and more than 200 articles).
From an early age, Galton was enchanted with counting and quantification. Among the
many things he tried to quantify were weather, individuals, beauty, characteristics of criminals,
Correlation and Regression
boringness of lectures, and effectiveness of prayers. Often, he was successful.
Quantification
For example, it was Galton who discovered that atmospheric pressure highs
Concept that translating a
phenomenon into numbers
produce clockwise winds around a calm center, and his efforts at quantifying
produces better understanding
individuals resulted in an approach to classifying fingerprints that is in use
of the phenomenon.
today. Because it worked so well for him, Galton actively promoted the
philosophy of quantification, the idea that you can understand a phenomenon
much better if you translate its essential parts into numbers.
Many of the variables that interested Galton were in the field of biological heredity. His
classic example was the heights of fathers and their adult sons. Galton thought that psychological
characteristics, too, tended to run in families. Specifically, he thought that characteristics such as
genius, musical talent, sensory acuity, and quickness had a hereditary basis. Galton’s 1869 book,
Hereditary Genius, listed many families and their famous members, including Charles Darwin,
his cousin.1
Galton wasn’t satisfied with the list in that early book; he wanted to express relationships in
quantitative terms. To get quantitative data, he established an anthropometric (people-measuring)
laboratory at a health exposition fair and at a museum in London. Approximately 17,000 people
who stopped at a booth paid three pence to be measured. They left with self-knowledge; Galton
left with quantitative data and a pocketful of coins. For one summary of Galton’s results, see
Johnson et al. (1985).
Galton’s most important legacy is probably his invention of the concepts of correlation and
regression. The task of working out theory and mathematics, however, fell to Galton’s friend and
protégé, Karl Pearson, Professor of Applied Mathematics and Mechanics at University College in
London. Pearson’s 1896 product-moment correlation coefficient and other correlation coefficients
that he and his students developed were quickly adopted by researchers in many fields and are
widely used today in psychology, sociology, education, political science, the biological sciences,
and other areas.2
Finally, although Galton’s and Pearson’s fame is for their statistical concepts, their personal
quest was to develop recommendations that would improve the human condition. Making
recommendations required a better understanding of heredity and evolution, and they saw statistics
as the best way to arrive at this better understanding.
In 1889, Galton described how valuable statistics are (and also let us in on his emotional
feelings about statistics):
Some people hate the very name of statistics, but I find them full of beauty and
interest. . . . Their power of dealing with complicated phenomena is extraordinary. They
are the only tools by which an opening can be cut through the formidable thicket of
difficulties that bars the path of those who pursue the Science of Man.3
Galton and Darwin had the same famous grandfather, Erasmus Darwin, but not the same grandmother. For both the
personal and intellectual relationships between the famous cousins, see Fancher (2009).
2
Pearson was the first person on our exploration tour in Chapter 1. He told us about chi square, another statistic he
invented. Chi square is covered in Chapter 14.
1
95
96
Chapter 6
My plan in this chapter is for you to read about bivariate distributions (necessary for both
correlation and regression), learn to compute and interpret Pearson product-moment correlation
coefficients, and use the regression technique to draw a best fitting straight line and predict
outcomes.
Bivariate Distributions
In the chapters on central tendency and variability, you worked with one
variable at a time (univariate distributions). Height, time, test scores, and
Frequency distribution of one
variable.
errors all received your attention. If you look back at those problems, you’ll
find a string of numbers under one heading (see, for example, Table 3.2).
Bivariate distribution
Joint distribution of two
Compare those distributions with the one in Table 6.1. In Table 6.1, there
variables; scores are paired.
are scores under the variable Extraversion and other scores under a second
variable, Conscientiousness. The characteristic of the data in this table that
makes them a bivariate distribution is that the scores on the two variables are
paired. The 45 and the 65 go together; the 65 and the 35 go together. They are paired, of course,
because the same sister made the two scores. As you will see, there are also other reasons for
pairing scores. All in all, bivariate distributions are fairly common. A bivariate distribution may
show positive correlation, negative correlation, or zero correlation.
Univariate distribution
T A B L E 6 . 1 A bivariate distribution of scores on two personality
tests taken by four sisters
Extraversion
X variable
Conscientiousness
Y variable
Meg
45
65
Beth
35
55
Jo
Amy
55
65
45
35
Positive Correlation
In the case of a positive correlation between two variables, high numbers on one variable tend
to be associated with high numbers on the other variable, and low numbers on one variable with
low numbers on the other. For example, tall fathers tend to have sons who grow up to be tall men.
Short fathers tend to have sons who grow up to be short men.
In the case of the manufactured data in Table 6.2, fathers have sons who grow up to be
exactly their height. The data in Table 6.2 represent an extreme case in which the correlation
3
For a short biography of Galton, I recommend Thomas (2005) or Waller (2001).
Correlation and Regression
coefficient is 1.00. A correlation coefficient of 1.00 is referred to as perfect correlation. (Table
6.2 is ridiculous, of course; mothers and environments have their say, too.)
T A B L E 6 . 2 Hypothetical data on two variables: heights of fathers and
heights of their sons*
Father
Jacob Smith
Michael Johnson
Matthew Williams
Joshua Brown
Christopher Jones
Nicholas Miller
Height (in.)
X
Son
Height (in.)
Y
74
72
70
68
66
64
Jake, Jr.
Mike, Jr.
Matt, Jr.
Josh, Jr.
Chris, Jr.
Nick, Jr.
74
72
70
68
66
64
* The first names are, in order, the six most common for baby boys born in 2000 in the
United States. Rounding out the top 10 are Tyler, Brandon, Daniel, and Austin (www.ssa.
gov/cgi-bin/popularnames.cgi). The surnames are the six most common in the 2000 U.S.
census. Completing the top 10 are Davis, Garcia, Rodriguez, and Wilson (www.census.
gov/genealogy/www/data/2000surnames/index.html).
Figure 6.1 is a graph of the bivariate data in Table 6.2. One variable (height of
father) is plotted on the x-axis; the other variable (height of son) is on the y-axis. Each
data point in the graph represents a pair of scores, the height of a father and the height
of his son. The points in the graph constitute a scatterplot. Incidentally, it was when
Galton cast his data as a scatterplot graph that the idea of a co-relationship began to
become clear to him.
Scatterplot
Graph of the scores of
a bivariate frequency
distribution.
F I G U R E 6 . 1 A scatterplot and regression line for a perfect positive correlation (r = 1.00)
97
98
Chapter 6
Regression line
A line of best fit for a
scatterplot.
The line that runs through the points in Figure 6.1 (and in Figures 6.2, 6.3, and 6.5)
is called a regression line. It is a “line of best fit.†When there is perfect
correlation (r = 1.00), all points fall exactly on the regression line. It is from regression
line that the correlation coefficient gets its symbol, r.
Let’s modify the data in Table 6.2 a bit. If every son grew to be exactly 2 inches taller
than his father (or 1 inch or 6 inches, or even 5 inches shorter), the correlation would still be
perfect, and the coefficient would still be 1.00. Figure 6.2 demonstrates this point: You can have
a perfect correlation even if the paired numbers aren’t the same. The only requirement for perfect
correlation is that the differences between pairs of scores all be the same. If they are the same,
then all the points of a scatterplot lie on the regression line, correlation is perfect, and an exact
prediction can be made.
F I G U R E 6 . 2 A scatterplot and regression line with every son
2 inches taller than his father (r = 1.00)
Of course, people cannot predict their sons’ heights precisely. The correlation is not perfect,
and the points do not all fall on the regression line. As Galton found, however, there is a positive
relationship; the correlation coefficient is about .50. The points do tend to cluster around the
regression line.
In your academic career, you have taken an untold number of aptitude and achievement
tests. For several of these tests, separate scores were computed for reading/writing aptitude and
mathematics aptitude. Here is a question for you. In the general case, what is the relationship
between reading/writing aptitude and math aptitude? That is, are people who are good in one also
good in the other, or are they poor in the other, or is there no relationship? Stop for a moment and
compose an answer.
As you may have suspected, the next graph shows data that begin to answer the question.
Correlation and Regression
Figure 6.3 shows the scores of eight high school seniors who took the SAT college admissions test.
The scores are for the Evidence-Based Reading and Writing section of the SAT (SAT EBRW) and
the mathematics section of the test (SAT Math).4 As you can see in Figure 6.3, there is a positive
relationship, though not a perfect one. As the reading/writing scores vary upward, mathematics
scores tend to vary upward. If the score on one is high, the other score tends to be high, and if one
is low, the other tends to be low. Later in this chapter, you’ll learn to calculate the precise degree
of relationship, and because there is a relationship, you can use a regression equation to predict
students’ math scores if you know their reading/writing scores.
F I G U R E 6 . 3 Scatterplot and regression line for SAT EBRW and
SAT Math scores for eight high school seniors (r = .72)
(Examining Figure 6.3, you could complain that it is not composed well; the data points are
bunched up in one corner, which leaves three-fourths of the space bare. It looks ungainly but I
was in a dilemma, which I’ll explain at the end of the chapter.)
Negative Correlation
Here is a scenario that leads to another bivariate distribution. Recall a time when you sat for a college
entrance examination (SAT and ACT are the two most common ones). How many others took the
exam at the same testing center that day? Next, imagine your motivation that day. Was it high, just
average, or low? Knowing the number of fellow test takers and your motivation, you have one point
on a scatterplot. Add a bunch more folks like yourself and you have a bivariate distribution.
Data for the current version of the SAT test were not available for this edition of Exploring Statistics. This example
mirrors data from the pre-2017 test, but the terminology is that of the current version of the SAT. All SAT data are
derived from 2008 College Board Seniors. Copyright © 2008, the College Board. collegeboard.com. Reproduced
with permission.
4
99
100
Chapter 6
Do you think that the relationship between the two variables is positive like that of the SAT
EBRW scores and SAT Math scores, or that there is no relationship, or that the relationship is
negative?
As you probably expected from this section heading, the answer to the preceding question
is negative. Figure 6.4 is a scatterplot of SAT scores and density of test takers (state averages for
50 U.S. states). High SAT scores are associated with low densities of test takers, and low SAT
scores are associated with high densities of test takers.5 This phenomenon is an illustration of the
N-Effect, the finding that an increase in number of competitors goes with a decrease in competitive
motivation and, thus, test scores (Garcia & Tor, 2009). The cartoon illustrates the N-Effect.
F I G U R E 6 . 4 Scatterplot of state SAT averages and density of test takers.
Courtesy of Stephen Garcia
Being in the Top 20%
The correlation coefficient is –.68. When Garcia and Tor (2009) statistically removed the effects of confounding
variables such as state percentage of high school students who took the SAT, state population density, and other variables, the correlation coefficient was –.35.
5
Correlation and Regression
When a correlation is negative, increases in one variable are accompanied by decreases in the
other variable (an inverse relationship). With negative correlation, the regression line goes from
the upper left corner of the graph to the lower right corner. As you may recall from algebra, such
lines have a negative slope.
Some other examples of variables with negative correlation are highway driving speed and
gas mileage, daily rain and daily sunshine, and grouchiness and friendships. As was the case
with perfect positive correlation, there is such a thing as perfect negative correlation (r = –1.00).
In cases of perfect negative correlation also, all the data points of the scatterplot fall on the
regression line.
Although some correlation coefficients are positive and some are negative, one is not more
valuable than the other. The algebraic sign simply tells you the direction of the relationship
(which is important when you are describing how the variables are related). The absolute size of r,
however, tells you the degree of the relationship. A strong relationship (either positive or negative)
is usually more informative than a weaker one.
Zero Correlation
A zero correlation means there is no linear relationship between two variables. High and low
scores on the two variables are not associated in any predictable manner. The 50 American states
differ in personal wealth; these differences are expressed as per capita income, which ranged in
2016 from $41,099 (Mississippi) to $75,923 (Connecticut). The states also differ in vehicle theft
reports per capita. You might think that these two variables would be related, but the correlation
coefficient between per capita income and vehicle thefts is .04. There is no relationship.
Figure 6.5 shows a scatterplot that produces a correlation coefficient of zero. When r = 0,
the regression line is a horizontal line at a height of Y. This makes sense; if r = 0, then your best
estimate of Y for any value of X is Y.
F I G U R E 6 . 5 Scatterplot and regression line for a zero correlation
101
102
Chapter 6
clue to the future
Correlation comes up again in future chapters. The correlation coefficient between two
variables whose scores are ranks is explained in Chapter 15. In part of Chapter 10 and in
all of Chapter 12, correlation ideas are involved.
PROBLEMS
6.1. What is the primary characteristic of a bivariate distribution?
6.2. What is meant by the statement “Variable X and variable Y are correlated�
6.3. Tell how X and Y vary in a positive correlation. Tell how they vary in a negative correlation.
6.4. Can the following variables be correlated, and, if so, would you expect the correlation to
be positive or negative?
a. Height and weight of adults
b. Weight of first graders and weight of fifth graders
c. Average daily temperature and cost of heating a home
d. IQ and reading comprehension
e. The first and second quiz scores of students in two sections of General Biology
f. The Section 1 scores and the Section 2 scores of students in General Biology on the
first quiz
Correlation Coefficient
The correlation coefficient is used in a wide variety of fields. It is so popular
because it provides a quantitative answer to a very common question: “What is
the degree of relationship between _________ and ________ ?†Supplying names
of variables to go in the blanks is easy. Try it! Gathering data, however, takes
some work.
The definition formula for the correlation coefficient is
Correlation Coefficient
Descriptive statistic that
expresses the direction
and degree of relationship
between two variables.
where r = Pearson product-moment correlation coefficient
zX = a z score for variable X
zY = the corresponding z score for variable Y
N = number of pairs of scores
Think through the z-score formula to discover what happens when high scores on one variable
are paired with high scores on the other variable (positive correlation). The large positive z scores
are paired and the large negative z scores are paired. In each case, the multiplication produces
large positive products, which, when added together, make a large positive numerator. The result
is a large positive value of r. Think through for yourself what happens in the formula when there
is a negative correlation and also when there is a zero correlation.
Correlation and Regression
Although the z-score formula is an excellent way to understand how a bivariate distribution
produces r, other formulas are better for calculation. I’ll explain one formula in some detail and
then mention an equivalent formula.
One formula for calculating r is
where X and Y are paired observations
XY = product of each X value multiplied by its paired Y value
N = number of pairs of observations
The expression ∑ð‘‹ð‘Œ is called the “sum of the cross-products.†All formulas for Pearson
r include ∑ð‘‹ð‘Œ. To find ∑ð‘‹ð‘Œ, multiply each X value by its paired Y value and then sum those
products. Note that one term in the formula has a meaning different from that in previous chapters;
N is the number of pairs of scores.
error detection
∑ð‘‹ð‘Œ is not (∑ð‘‹)(∑ð‘Œ). To find ∑ð‘‹ð‘Œ, do as many multiplications as you have pairs.
Afterward, sum the products you calculated.
As for which variable to call X and which to call Y, it doesn’t make any difference for
correlation coefficients. With regression, however, it may make a big difference. More on that
later in the chapter.
Table 6.3 shows how to calculate r for the SAT EBRW and SAT Math data used for Figure
6.3. I selected these numbers so they would produce the correlation coefficient reported by the
College Board Seniors report. Work through the numbers in Table 6.3, paying careful attention to
∑ð‘‹ð‘Œ. Table 6.3 also includes the calculation of means and standard deviations, which are helpful
for interpretation and necessary for the regression equation that comes later.
Many calculators have a built-in function for r. When you enter X and Y values and press the
r key, the coefficient is displayed. If you have such a calculator, I recommend that you use this
labor-saving device after you have used the computation formulas a number of times. Calculating
the components (such as ∑ð‘‹ð‘Œ) leads to an understanding of what goes into r.
If you calculate sums that reach above the millions, your calculator may switch into scientific
notation. A display such as 3.234234 08 might appear. To convert this number back to familiar
notation, move the decimal point to the right the number of places indicated by the number
on the right. Thus, 3.234234 08 becomes 323,423,400. The display 1.23456789 12 becomes
1,234,567,890,000.
Table 6.4 shows the IBM SPSS output for a Pearson correlation of the two SAT variables.
Again, r = .72. The designation Sig. (2-tailed) means “the significance level for a two-tailed test,â€Â
a concept explained in Chapter 9.
103
104
Chapter 6
T A B L E 6 . 3 Calculation of r between SAT EBRW scores and SAT Math scores
T A B L E 6 . 4 IBM SPSS output of Pearson r for SAT EBRW scores and SAT Math scores
Correlation and Regression
Now for interpretation. What does a correlation of .72 between SAT EBRW and SAT Math
scores mean? It means that they are directly related (positive correlation) and the relationship
is strong. Students who have high SAT EBRW scores can be expected to have high SAT Math
scores. Students with low scores on one test are likely to have low scores on the other test, although
neither expectation will be fulfilled in every individual case.
Note that if the correlation coefficient had been near zero, the interpretation would have been
that the two abilities are unrelated. If the correlation had been sizeable but negative, say –.72, you
could say, “good in one, poor in the other.â€Â
What about the other descriptive statistics for the SAT data in Table 6.3? The means are about
500, which is what the test maker strives for, and the standard deviations are in the neighborhood
of 100. These two statistics are particularly helpful as context if you are asked to interpret an
individual score.
Correlation coefficients should be based on an “adequate†number of observations. The
traditional, rule-of-thumb definition of adequate is 30-50. My SAT example, however, had an N
of 8, and most of the problems in the text have fewer than 30 pairs. Small-N problems allow you
to spend your time on interpretation and understanding rather than “number crunching.†Chapter
9 provides the reasoning behind the admonition that N be adequate.
The correlation coefficient, r, is a sample statistic. The corresponding population parameter
is symbolized by ÃÂ (the Greek letter rho). The formula for ÃÂ is the same as the formula for r. One
of the “rules†of statistical names is that parameters are symbolized with Greek letters (ÃÆ’, µ, ÃÂ)
and statistics are symbolized with Latin letters (ð‘‹, Ã…Â, r). Like many rules, there are exceptions to
this one.
clue to the future
Quantitative scores are required for a Pearson r. If the scores are ranks, the appropriate
correlation is the Spearman coefficient, rs, which is explained in Chapter 15. Also, ÃÂ, the
population correlation coefficient, returns in Chapter 9, where the reliability of r is addressed.
Here is that second formula for r that I promised. It requires you to calculate means and
standard deviations first and then use them to find r. Please note that in this formula, the standard
deviations are S and not Ã…Â.
r=
∑XY
- (X)(Y)
N
(SX)(SY)
error detection
The Pearson correlation coefficient ranges between –1.00 and +1.00. Values less than
–1.00 or greater than +1.00 indicate that you have made an error.
105
106
Chapter 6
PROBLEMS
*6.5. This problem is based on data published in 1903 by Karl Pearson and Alice Lee. In
the original article, 1376 pairs of father–daughter heights were analyzed. The scores
here produce the same means and the same correlation coefficient that Pearson and Lee
obtained. For these data, draw a scatterplot and calculate r. For extra education, use the
other formula and calculate r again.
Father’s height, X (in.)
69 68 67 65 63 73
Daughter’s height, Y (in.)
62 65 64 63 58 63
*6.6. The Wechsler Adult Intelligence Scale (WAIS) is an individually administered test that
takes more than an hour to give. The Wonderlic Personnel Test can be given to groups
of any size in 15 minutes. X and Y represent scores on the two tests. Summary statistics
from a representative sample of 21 adults were
∑ð‘‹ = 2205, ∑𑌠= 2163, ∑ð‘‹2 = 235,800, ∑ð‘Œ2 = 227,200, ∑ð‘‹ð‘Œ = 231,100.
Compute r and write an interpretation about using the Wonderlic rather than the WAIS.
*6.7. Is the relationship between stress and infectious disease a strong one or a weak one?
Summary values that will produce a correlation coefficient similar to that found by
Cohen and Williamson (1991) are as follows:
∑ð‘‹=190, ∑ð‘Œ= 444, ∑ð‘‹2= 3940, ∑ð‘Œ2=20,096, ∑ð‘‹ð‘Œ= 8524, N = 10.
Calculate r. (The answer shows both formulas for r.)
Scatterplots
You already know something about scatterplotsâ€â€what their elements are and what they look
like when r = 1.00, r = .00, and r = –1.00. In this section, I illustrate some intermediate cases and
reiterate my philosophy about the value of pictures.
Figure 6.6 shows scatterplots of data with positive correlation coefficients (.20, .40, .80, .90)
and negative correlation coefficients (–.60, –.95). If you draw an envelope around the points in
a scatterplot, the picture becomes clearer. The thinner the envelope, the larger the correlation. To
say this in more mathematical language, the closer the points are to the regression line, the greater
the correlation coefficient.
Pictures help you understand, and scatterplots are easy to construct. Although the plots require
some time, the benefits are worth it. Peden (2001) constructed four data sets that all produce
a correlation coefficient of .82. Scatterplots of the data, however, show four different patterns
each requiring a different interpretation. So, this is a paragraph that encourages you to construct
scatterplotsâ€â€pictures promote understanding.
Interpretations of r
The basic interpretation of r is probably familiar to you at this point. A correlation coefficient
shows the direction and degree of linear relationship between two variables of a bivariate
Correlation and Regression
F I G U R E 6 . 6 Scatterplot of data in which r = .20, .40 – .60, .80, .90, and –.95
distribution. Fortunately, additional information about the relationship can be obtained from a
correlation coefficient. Over the next few pages, I’ll cover some of this additional information.
However, a warning is in orderâ€â€the interpretation of r can be a tricky business. I’ll alert you to
some of the common errors.
Effect Size Index for r
What qualifies as a large correlation coefficient? What is small? You will remember that you dealt
with similar questions in Chapter 5 when you studied the effect size index. In that situation, you
had two sample means that were different. The question was, Is this a big difference? Jacob Cohen
(1969) proposed a formula for d and guidelines of small (d = 0.20), medium (d = 0.50), and large
(d = 0.80). The formula and guidelines have been widely adopted.
107
108
Chapter 6
In a similar way, Cohen sought an effect size index for correlation coefficients. You will be
delighted with Cohen’s formula for the effect size index for r; it is r itself. The remaining question
is, What is small, medium, and large? Cohen proposed that small = .10, medium = .30, and large
= .50. This proposal met resistance.
The problem is that correlations are used in such a variety of situations. For example, r is used
to measure the reliability of a multi-item test, assess the clinical significance of a medical drug,
and determine if a variable is one of several factors that influence an outcome. An r value of .50
has quite different meanings in these three examples.
One solution is to take an empirical approach; just see what rs are reported in the literature.
Hemphill (2003) did this. When thousands of correlation coefficients from hundreds of behavioral
science studies were separated into thirds, the results were
Lower third
< .20
Middle third
.20 to .30
Upper third
> .30
On the issue of adjectives for correlation coefficients, there is no simple rule of thumb. The
proper adjective for a particular r depends on the kind of research being discussed. For example,
in the section on Reliability of Tests, r < .80 is described as inadequate. But for medical treatments
and drugs, correlations such as .05 and .10 can be evidence of a positive effect.
Coefficient of Determination
The correlation coefficient is the basis of the coefficient of determination, which tells the proportion
of variance that two variables in a bivariate distribution have in common. The coefficient of
determination is calculated by squaring r; it is always a positive value between 0 and 1:
Coefficient of determination = r 2
Look back at Table 6.2, the heights that produced r = 1.00. There is
variation among the fathers’ heights as well as among the sons’ heights.
Squared correlation coefficient, an
How much of the variation among the sons’ heights is associated with the
estimate of common variance
variation in the fathers’ heights? All of it! That is, the variation among the
sons’ heights (going from 74 to 72 to 70 and so on) exactly matches the variation seen in their
fathers’ heights (74 to 72 to 70 and so on). In the same way, the variation among the sons’ heights
in Figure 6.2 (76 on down) is the same variation as that among their shorter fathers’ heights (74 on
down). For Table 6.2 and Figure 6.2, r = 1.00 and r2 = 1.00.
Now look at Table 6.3, the SAT EBRW and SAT Math scores. There is variation among the
SAT EBRW scores as well as among the SAT Math scores. How much of the variation among the
SAT Math scores is associated with the variation among the SAT EBRW scores? Some of it. That
is, the variation among the SAT Math scores (350, 500, 400, and so on) is only partly reflected in
the variation among the SAT EBRW scores (350, 350, 400, and so on). The proportion of variance
in the SAT Math scores that is associated with the variance in the SAT EBRW scores is r2. In this
case, (.72)2 = .52.
Coefficient of determination
Correlation and Regression
What a coefficient of determination of .52 tells you is that 52% of the variance in the two
sets of scores is common variance. However, 48% of the variance is independent varianceâ€â€
that is, variance in one test that is not associated with variance in the other test.
Think for a moment about the many factors that influence SAT EBRW scores and SAT Math
scores. Some factors influence both scoresâ€â€factors such as motivation, mental sharpness on
test day, and, of course, the big one: general intellectual ability. Other factors influence one test
but not the otherâ€â€factors such as anxiety about math tests, chance successes and chance errors,
and, of course, the big ones: specific reading/writing knowledge and specific math knowledge.
Here is another example. The correlation of academic aptitude test scores with firstterm college grade point averages (GPAs) is about .50. The coefficient of determination is
.25. This means that of all that variation in GPAs (from flunking out to straight As), 25%
is associated with aptitude scores. The rest of the variance (75%) is related to other factors.
Examples of other factors that influence GPA, for good or for ill, include health, roommates,
new relationships, and financial situation. Academic aptitude tests cannot predict the variation
that these factors produce.
Common variance is often illustrated with two overlapping circles, each of which represents
the total variance of one variable. The overlapping portion is the amount of common variance.
The left half of Figure 6.7 shows overlapping circles for the GPA– college aptitude test scores,
and the right half shows the SAT EBRW– SAT Math data.
F I G U R E 6 . 7 Illustrations of common variance for r = .50 and r = .72
Note what a big difference there is between a correlation of .72 and one of .50 when they are
interpreted using the common variance terminology. Although .72 and .50 seem fairly close, an r
of .72 predicts more than twice the amount of variance that an r of .50 predicts: 52% to 25%. By
the way, common variance is the way professional statisticians interpret correlation coefficients.
109
110
Chapter 6
Uses of r
Reliability of Tests
Correlation coefficients are used to assess the reliability of measuring devices
such as tests, questionnaires, and instruments. Reliability refers to consistency.
Dependability or consistency
of a measure.
Devices that are reliable produce consistent scores that are not subject to
chance fluctuations.
Think about measuring a number of individuals and then measuring
them a second time. If the measuring device is not influenced by chance, you get the same
numbers both times. If the second measurement is always exactly the same as the first, it is
easy to conclude that the measuring device is perfectly reliableâ€â€that chance does not influence
the score you get. However, if the measurements are not exactly the same, you experience
uncertainty. Fortunately, a correlation coefficient between the test and the retest scores gives
you the degree of agreement. High correlation coefficients mean lots of agreement and therefore
high reliability; low coefficients mean lots of disagreement and therefore low reliability. But
what size r indicates reliability? The rule of thumb for social science measurements is that an r
of .80 or greater indicates adequate reliability.
Here is an example from Galton’s data. When the heights of 435 adults were measured twice,
the correlation was .98. It is not surprising that Galton’s method of measuring height was very
reliable. The correlation, however, for “highest audible tone,†a measure of pitch perception, was
only .28 for 349 participants whom Galton tested a second time within a year (Johnson et al.,
1985). Two interpretations are possible. Either people’s ability to hear high sounds changes during
a year, or the test was not reliable. In Galton’s case, the test was not reliable. Perhaps the test
environment was not as quiet from one time to the next, or the instruments were not calibrated the
same for both tests.
This section explains the reliability of a measuring instrument, which involves assessing the
instrument twice. The reliability of a correlation coefficient between two different variables is a
different matter. The .80 rule of thumb does not apply, a topic that will be addressed in Chapter 9.
Reliability
To Establish Causationâ€â€NOT!
A high correlation coefficient does not give you the kind of evidence that allows you to make
cause-and-effect statements. Therefore, don’t do it. Ever.
Jumping to a cause-and-effect conclusion is a cognitively easy leap for humans. For example,
Shedler and Block (1990) found that among a sample of 18-year-olds whose marijuana use ranged
from abstinence to once a month, there was a positive correlation between use and psychological
health. Is this evidence that occasional drug use promotes psychological health?
Because Shedler and Block had followed their participants from age 3 on, they knew about
a third variable, the quality of the parenting that the 18-year-olds had received. Not surprisingly,
parents who were responsive, accepting, and patient and who valued originality had children
Correlation and Regression
who were psychologically healthy. In addition, these same children as 18-year- olds had used
marijuana on occasion. Thus, two variablesâ€â€drug use and parenting styleâ€â€were each correlated
with psychological health. Shedler and Block concluded that psychological health and adolescent
drug use were both traceable to quality of parenting. (This research also included a sample of
frequent users, who were not psychologically healthy and who had been raised with a parenting
style not characterized by the adjectives above.)
Of course, if you have a sizable correlation coefficient, it could be the result of a causeand- effect relationship between the two variables. For example, early statements about cigarette
smoking causing lung cancer were based on simple correlational data. Persons with lung cancer
were often heavy smokers. In addition, comparisons between countries indicated a relationship
(see Problem 6.13). However, as careful thinkersâ€â€and also the cigarette companiesâ€â€pointed
out, both cancer and smoking might be caused by a third variable; stress was often suggested.
That is, stress caused cancer and stress also caused people to smoke. Thus, cancer rates and
smoking rates were related (a high correlation), but one did not cause the other. Both were caused
by a third variable. What was required to establish the cause-and-effect relationship was data
from controlled experiments, not correlational data. Experimental data, complete with control
groups, established the cause-and-effect relationship between cigarette smoking and lung cancer.
(Controlled experiments are discussed in Chapter 10.)
To summarize this section using the language of logic: A sizable correlation is a necessary but
not a sufficient condition for establishing causality.
PROBLEMS
6.8. Estimate the correlation coefficients for these scatterplots.
111
112
Chapter 6
6.9. For the two measures of intelligence in Problem 6.6, you found a correlation of .92. What
is the coefficient of determination, and what does it mean?
6.10. In Problem 6.7, you found that the correlation coefficient between stress and infectious
disease was .25. Calculate the coefficient of determination and write an interpretation.
6.11. Examine the following summary statistics (which you have seen before). Can you
determine a correlation coefficient? Explain your reasoning.
Height of women (in.)
ΣX
ΣX 2
N = 50 pairs
3255
212,291
Height of men (in.)
3500
245,470
6.12. What percent of variance in common do two variables have if their correlation is .10?
What if the correlation is quadrupled to .40?
6.13. For each of 11 countries, the accompanying table gives the cigarette consumption per
capita in 1930 and the male death rate from lung cancer 20 years later in 1950 (Doll,
1955; reprinted in Tufte, 2001). Calculate a Pearson r and write a statement telling what
the data show.
Country
Iceland
Norway
Sweden
Denmark
Australia
Holland
Canada
Switzerland
Finland
Great Britain
United States
Per capita cigarette
consumption
Male death rate
(per million)
217
250
308
370
455
458
505
542
1112
1147
1283
59
91
113
167
172
243
150
250
352
467
191
6.14. Interpret each of these statements.
a. The correlation between vocational-interest scores at age 20 and at age 40 for 150
participants was .70
b. A correlation of .86 between intelligence test scores of identical twins raised together
c. A correlation of –.30 between IQ and family size
d. r = .22 between height and IQ for 20-year-old men
e. r = –.83 between income level and probability of diagnosis of schizophrenia
Strong Relationships but Low Correlation Coefficients
One good thing about understanding something is that you come to know what’s going on beneath
the surface. Knowing the inner workings, you can judge whether the surface appearance is to be
trusted or not. You are about to learn two of the “inner workings†of correlation. These will help
Correlation and Regression
you evaluate the meaning of low correlation coefficients. A small correlation coefficient does not
always mean there is no relationship between two variables. Correlations that do not reflect the
true degree of the relationship are said to be spuriously low or spuriously high.
Nonlinearity
For r to be a meaningful statistic, the best-fitting line through the scatterplot of points must be a
straight line. If a curved line fits the data better than a straight line, r will be low, not reflecting the
true relationship between the two variables.
Figure 6.8 is an example of a situation in which r is inappropriate because the best-fitting line
is curved. The X variable is arousal, and the Y variable is efficiency of performance. At low levels
of arousal (sleepy, for example), performance is not very good. Likewise, at very high levels of
arousal (agitation, for example), people don’t perform well. In the middle range, however, there is
a degree of arousal that is optimum; performance is best at moderate levels of arousal.
F I G U R E 6 . 8 Generalized relationship between arousal and efficiency of performance
In Figure 6.8, there is obviously a strong relationship between arousal and performance,
but r for the distribution is –.10, which indicates a very weak relationship. The product-moment
correlation coefficient is just not useful for measuring the strength of curved relationships. For
curved relationships, researchers often measure the strength of association with the statistic eta (η)
or by calculating the formula for a curve that fits the data.
error detection
When a data set produces a low correlation coefficient, a scatterplot is especially
recommended. A scatterplot might reveal that a Pearson correlation coefficient is not
appropriate for the data set.
113
114
Chapter 6
Truncated Range
Besides nonlinearity, a second situation produces small Pearson correlation coefficients even
though there is a strong relationship between the two variables. Spuriously low r values can occur
when the range of scores in the sample is much smaller than the range of scores in the population
(a truncated range).
I’ll illustrate with the relationship between GRE scores and grades in
graduate school.6 The relationship graphed in Figure 6.9 is based on a study
Truncated range
Range of a sample is smaller
by Sternberg and Williams (1997). These data look like a snowstorm; they
than the range of its population
lead to the conclusion that there is little relationship between GRE scores
and graduate school grades.
F I G U R E 6 . 9 Scatterplot of GRE scores and graduate school grades in one school
However, students in graduate school do not represent the full range of GRE scores; those
with low GRE scores are not included. What effect does this restriction of the range have? You can
get an answer to this question by looking at Figure 6.10, which shows a hypothetical scatterplot
of data for the full range of GRE scores. This scatterplot shows a moderate relationship. So, unless
you recognized that your sample of graduate students truncated the range of GRE scores, you
might be tempted to dismiss GRE scores as “worthless.†(A clue that Figure 6.9 has a truncated
range is on the horizontal axis. The GRE scores range from medium to high.)
6
The Graduate Record Examination (GRE) is used by many graduate schools to help select students to admit.
Correlation and Regression
F I G U R E 6 . 1 0 Hypothetical scatterplot of GRE scores and expected graduate
school grades for the population
Other Kinds of Correlation Coefficients
The kind of correlation coefficient you have been learning aboutâ€â€the Pearson
Dichotomous variable
product- moment correlation coefficientâ€â€is appropriate for measuring the
Variable that has only two
degree of the relationship between two linearly related, continuous variables.
values.
Sometimes, however, the data do not consist of two linearly related, continuous
Multiple correlation
variables. What follows is a description of five other situations. In each case,
Correlation coefficient that
you can express the direction and degree of relationship in the data with a
expresses the degree of
relationship between one
correlation coefficientâ€â€but not a Pearson product-moment correlation
variable and two or more
coefficient. Fortunately, other correlation coefficients are interpreted much like
other variables.
Pearson product-moment coefficients.
Partial correlation
1. If one of the variables is dichotomous (has only two values), then a
Technique that allows the
biserial correlation (rb) or a point-biserial correlation (rpb) is appropriate.
separation of the effect of one
variable from the correlation
Variables such as height (recorded as simply tall or short) and gender (male or
of two other variables.
female) are examples of dichotomous variables.
2. Several variables can be combined, and the resulting combination can
be correlated with one variable. With this technique, called multiple correlation, a more precise
prediction can be made. Performance in school can be predicted better by using several measures
of a person rather than one.
3. A technique called partial correlation allows you to separate or partial out the effects
of one variable from the correlation of two variables. For example, if you want to know the true
correlation between achievement test scores in two school subjects, it is probably necessary to
partial out the effects of intelligence because cognitive ability and achievement are correlated.
115
116
Chapter 6
4. When the data are ranks rather than scores from a continuous variable, researchers calculate
Spearman rs, which is covered in Chapter 15.
5. If the relationship between two variables is curved rather than linear, then the correlation
ratio eta (η) gives the degree of association (Field, 2005a).
These and other correlational techniques are discussed in intermediate-level textbooks.
PROBLEMS
6.15. The correlation between scores on a humor test and a test of insight is .83. Explain what
this means. Continue your explanation by interpreting the coefficient of determination.
End your explanation with caveats7 appropriate for r.
6.16. The correlation between number of older siblings and degree of acceptance of personal
responsibility for one’s own successes and failures is –.37. Interpret this correlation. Find
the coefficient of determination and explain what it means. What can you say about the
cause of the correlation?
6.17. Examine the following data, make a scatterplot, and compute r if appropriate.
Serial position
1
2
3
4
5
6
7
8
Errors
2
5
6
9
13
10
6
4
Linear Regression
First, a caution about the word regression, which has two quite separate meanings. One meaning
of regression is a statistical technique that allows you to make predictions and draw a line of best
fit for bivariate distributions. This is the topic of this chapter.
The word regression also refers to a phenomenon that occurs when an extreme group is tested
a second time. Those who do very well the first time can be expected to see their performance
drop the second time they are tested. Similarly, those who do poorly the first time can expect a
better score on the second test. (This phenomenon is more properly referred to as regression to
the mean.) Very high scores and very low scores are at least partly the result of good luck and bad
luck. Luck is fickle, and the second test score is not likely to benefit from as much luck, either
good or bad. Regression to the mean occurs in many situations, not just with tests. Any time
something produces an extreme result and then is evaluated a second time, regression occurs. For
an informative chapter on the widespread nature of the regression phenomenon, see Kahneman
(2011, Chapter 17).
Now, back to linear regression, a technique that lets you predict a specific
Linear regression
score
on one variable given a score on the second variable. A few sections ago, I said
Method that produces a
straight line that best fits a
that the correlation between college entrance examination scores and first-semester
bivariate distribution.
grade point averages is about .50. Knowing this correlation, you can predict that
those who score high on the entrance examination are more likely to succeed as
7
Warnings.
Correlation and Regression
freshmen than those who score low. This statement is correct, but it is pretty general. Usually, you
want to predict a specific grade point average for a specific applicant. For example, if you were
in charge of admissions at Collegiate U., you want to know the entrance examination score that
predicts a GPA of 2.00, the minimum required for graduation. To make specific predictions, you
must calculate a regression equation.
The Regression Equation
The regression equation is a formula for a straight line. It allows you to predict a value for Y,
given a value for X. In statistics, the regression equation is
Regression equation
Ŷ = a + bX
Equation that predicts values
of Y for specific values of X.
where Ŷ = Y value predicted for a particular X value
a = point at which the regression line intersects the y-axis
b = slope of the regression line
X = value for which you wish to predict a Y value
For correlation problems, the symbol Y can be assigned to either variable, but in regression
equations, Y is assigned to the variable you wish to predict.
To use the equation Ŷ = a + bX, you must have values for a and b, which
are called regression coefficients. To find b, the slope of the regression line,
use the formula
b=
Regression coefficients
The constants a and b in a
regression equation.
N∑XY - (∑X)(∑Y)
N∑X2 - (∑X)2
Use the following alternate formula for b when raw scores are not available, which often happens
when you do a regression analysis on the data of others.
b=r
SY
SX
where r = the correlation coefficient for X and Y
SY = standard deviation of the Y variable (N in the denominator)
SX = standard deviation of the X variable (N in the denominator)
117
118
Chapter 6
To compute a, the regression line’s intercept with the y-axis, use the formula
a = Y - bð‘‹
where
= mean of the Y scores
b = regression coefficient computed previously
ð‘‹ = mean of the X scores
Figure 6.11 is a generic illustration that helps explain regression coefficients a and b. In
Figure 6.11, the regression line crosses the y-axis exactly at 4, so a = 4.00. The coefficient b
is the slope of the line. To independently determine the slope of a line from a graph, divide
the vertical distance the line rises by the horizontal distance the line covers. In Figure 6.11, the
line DE (vertical rise of the line FD) is half the length of FE (the horizontal distance of FD).
Thus, the slope of the regression line is 0.50 (DE/FE = b = 0.50). Put another way, the
value of Y increases one-half point for every one-point increase in X.
F I G U R E 6 . 1 1 The regression coefficients a and b
The regression coefficients a and b can have positive values or negative values. A negative
a means that the line crosses the y-axis below the zero point. A negative b means the line has a
negative slope. A line with a negative slope has its highest point to the left of its lowest point.
This might be expressed as the line slopes to the left. A line with a positive b has its highest
point to the right of its lowest point; that is, the line slopes to the right. Figure 6.11 is an
example. Small values of b (either positive or negative) indicate regression lines that are almost
horizontal.
Correlation and Regression
Writing a Regression Equation
I’ll illustrate the calculation of a regression equation with college aptitude data and first-year
college grade point averages. The college aptitude data come from the SAT EBRW data in Table
6.3. The first-year college grade point averages (FY GPA) come from Korbrin et al. (2008).
Because the task is to predict FY GPA from SAT EBRW scores, FY GPA is the Y variable.
error detection
Step 1 in writing a regression equation is to identify the variable whose scores you want to
predict. Make that variable Y.
After designating the Y variable, assemble the data. Korbrin et al. (2008) supply summary data
but not raw data, so the alternative formula for b is necessary. The assembled dataâ€â€correlation
coefficient, means, and standard deviations (S) of both variables are as follows:
The formula for b gives
b=r
SY
0.71
= (.48)
= (.48)(0.00759) = 0.00364
93.54
SX
The formula for a gives
a = Y - bð‘‹ = 2.97 - (0.00364)(500) = 2.97 - 1.822 = 1.148
The b coefficient (0.0036) tells you that the slope of the regression line is almost flat. The a
coefficient tells you that the regression line intersects the y-axis at 1.148. Entering these regression
coefficient values into the regression equation produces a formula that predicts first-year GPA
from SAT EBRW scores.
Ŷ = a + bX = 1.148 + 0.00364X
119
120
Chapter 6
To illustrate, let’s predict the freshman GPA for a student whose SAT EBRW score is one
standard deviation above the mean. This SAT EBRW score is about 594 (500 + 93.54 = 593.54 ≈
594). SAT EBRW scores are reported in multiples of 10. An SAT EBRW score of 590 is almost
one standard deviation above the mean. I’ll use 590 for this illustration.
Ŷ = 1.148 + 0.00364X
Ŷ = 1.148 + 0.00364(590) = 3.30
Thus, we’d predict a respectable GPA of 3.30 for a high school student whose SAT EBRW
score is about one standard deviation above the mean.8
Several sections back, I said that if you were in charge of admissions at Collegiate U, you’d
want to know the entrance exam score that predicts a graduation GPA of 2.00. With the regression
equation above, you can approximate that knowledge by finding the SAT EBRW score that
predicts a first-year GPA of 2.00.
Ŷ = 1.148 + 0.00364X
2.00 = 1.148 + 0.00364X
X = 234
SAT scores come in multiples of 10, so I’ll have to choose between 230 and 240. Because
I want a score that predicts applicants who will achieve a FY GPA of at least 2.00, I would
recommend 240.
To find Ŷ values from summary data without calculating a and b, use this formula:
To make Ŷ predictions directly from raw scores, use this formula:
Ŷ = Y + (X – ð‘‹)
N∑ XY – (∑ X)(∑ Y)
N∑ X2 – (∑ X)2
PROBLEM
*6.18. Using the statistics in Table 6.3, write the equation that predicts SAT Math scores from
SAT EBRW scores.
You may know your own SAT EBRW score. With this equation, you can predict your own first-year college grade
point average. Also, you probably already have a first-year college grade point average. How do the two compare?
8
Correlation and Regression
Now you know how to make predictions. Predictions, however, are cheap; anyone can make
them. Respect accrues only when your predictions come true. So far, I have dealt with accuracy by
simply pointing out that when r is high, accuracy is high, and when r is low, you cannot put much
faith in your predicted values of Ŷ.
Standard error of
To actually measure the accuracy of predictions made from a regression estimate
analysis, you need the standard error of estimate. This statistic is discussed Standard deviation of the
in most intermediate-level textbooks and in textbooks on testing. The concepts differences between predicted
outcomes and actual
in Chapters 7 and 8 of this book provide the background needed to understand outcomes.
the standard error of estimate.
Drawing a Regression Line on a Scatterplot
To illustrate drawing a regression line, I’ll return to the data in Table 6.3, the two subtests of the
SAT. To draw the line, you need a straightedge and two points on the line. Any two points will do.
One point that is always on the regression line is (ð‘‹, Y). Thus, for the SAT data, the two means
(500, 515) identify a point. This point is marked on Figure 6.12 with an open circle.
F I G U R E 6 . 1 2 Scatterplot and regression line for SAT EBRW and SAT Math scores in Table 6.3
The second point may take a little more work. For it, you need the regression equation.
Fortunately, you have that equation from your work on Problem 6.18.
Ŷ = 165 + 0.700X
To find a second point, choose a value for X and solve for Y. Any value for X within the range
of your graph will do; I chose 400 because it made the product of 0.700X easy to calculate. (0.700)
(400) = 280. Thus, the second point is (400, 445), which is marked on Figure 6.12 with an ×.
Finally, line up the straightedge on the two points and extend the line in both directions. Notice
that the line crosses the y-axis just under 400, which may surprise you because a = 165. (I’ll come
back to this in the next section.)
121
122
Chapter 6
With a graph such as Figure 6.12, you can make predictions about SAT Math scores from
SAT EBRW scores. From a score on the x-axis, draw a vertical line up to the regression line. Then
draw a horizontal line over to the vertical axis. That Y score is Ŷ, the predicted SAT Math score.
On Figure 6.12, the SAT Math score predicted for an SAT EBRW score of 375 is between 400
and 450.
In IBM SPSS, the linear regression program calculates several statistics and displays them
in different tables. The IBM SPSS table coefficients is reproduced as Table 6.5 for the SAT data.
The regression coefficients are in the β column under Unstandardized Coefficients. The intercept
coefficient, a (165.00), is labeled (Constant), and the slope coefficient, b (0.700), is labeled SAT
EBRW. The Pearson correlation coefficient (.724) is in the Beta column.
T A B L E 6 . 5 IBM SPSS output of regression coefficients and r for the SAT EBRW and
SAT Math scores in Table 6.3
The Appearance of Regression Lines
Now, I’ll return to the surprise I mentioned in the previous section: The regression line in Figure
6.12 crosses the y-axis just below 400 although a = 165. The appearance of regression lines
depends not only on the calculated values of a and b but also on the units chosen for the x- and
y-axes and whether there are breaks in the axes. Look at Figure 6.13. Although the two lines
appear different, b = 1.00 for both. They don’t appear the same because the y-axes are different.
The units in the left graph are twice as large as units in the right graph.
I can now explain the dilemma I faced when I designed the poorly composed Figure 6.3. The
graph is ungainly because it is square (100 X units is the same length as 100 Y units) and because
both axes start at zero, which forces the data points up into a corner. I composed it the way I did
because I wanted the regression line to cross the y-axis at a (note that it does, 165) and because I
wanted its slope to appear equal to b (note that it does, 0.70).
The more attractive Figure 6.12 is a scatterplot of the same data as those in Figure 6.3. The
difference is that Figure 6.12 has breaks in the axes, and 100 Y units is about one-third the length
of 100 X units. I’m sure by this point you have the messageâ€â€you cannot necessarily determine a
and b by looking at a graph.
Correlation and Regression
F I G U R E 6 . 1 3 Two regression lines with the same slope (b = 1.00) but with
different appearances. The difference is caused by different-sized units on the y axis
Finally, a note of caution: Every scatterplot has two regression lines. One is called the
regression of Y onto X, which is what you did in this chapter. The other is the regression of X
onto Y. The difference between these two depends on which variable is designated Y. So, in your
calculations, be sure you assign Y to the variable you want to predict.
PROBLEMS
6.19. In Problem 6.5, the father–daughter height data, you found r = .513.
a. Compute the regression coefficients a and b. Let fathers = X; daughters = Y.
b. Use your scatterplot from Problem 6.5 and draw the regression line.
6.20. In Problem 6.6, the two different intelligence tests (with the WAIS test as X and the
Wonderlic as Y), you computed r.
a. Compute a and b.
b. What Wonderlic score would you predict for a person who scored 130 on the WAIS?
6.21. Regression is a technique that economists and businesspeople rely on heavily. Think
about the relationship between advertising expenditures and sales. Use the data in the
accompanying table, which are based on national statistics.
a. Find r.
b. Write the regression equation.
c. Plot the regression line on the scatterplot.
d. Predict sales for an advertising expenditure of $10,000.
e. Explain whether any confidence at all can be put in the prediction you made.
Advertising, X
($ thousands)
Sales, Y
($ thousands)
3
4
3
5
6
5
4
70
120
110
100
140
120
100
123
124
Chapter 6
6.22. The correlation between Stanford–Binet IQ scores and Wechsler Adult Intelligence Scale
(WAIS) IQs is about .80. Both tests have a mean of 100. Until recently, the standard
deviation of the Stanford–Binet was 16. For the WAIS, S = 15. What WAIS IQ do you
predict for a person who scores 65 on the Stanford–Binet? Write a sentence summarizing
these results. (An IQ score of 70 has been used by some schools as a cutoff point between
regular classes and special education classes.)
6.23. Many predictions about the future come from regression equations. Use the following
data from the National Center for Education Statistics to predict the number of college
graduates with bachelor’s degrees in the year 2017. Use the time period numbers rather
than years in your calculations and carry four decimal places. Carefully choose which
variable to call X and which to call Y.
Time period 1 2 3 4 5
Year
2011 2012 2013 2014 2015
Graduates (millions) 1.72 1.79 1.84 1.87 1.89
6.24. Once again, look over the objectives at the beginning of the chapter. Can you do them?
6.25. Now it is time for integrative work on the descriptive statistics you studied in Chapters
2–6. Choose one of the two options that follow.
a. Write an essay on descriptive statistics. Start by jotting down from memory
things you could include. Review Chapters 2–6, adding to your list additional
facts or other considerations. Draft the essay. Rest. Revise it.
b. Construct a table that summarizes the descriptive statistics in Chapters 2–6.
List the techniques in the first column. Across the top of the table, list topics
that distinguish among the techniquesâ€â€topics such as purpose, formula, and
so forth. Fill in the table. Whether you choose option a or b, save your answer
for that time in the future when you are reviewing what you are learning in this
course (final exam time?).
KEY TERMS
Bivariate distribution (p. 96)
Causation and correlation (p. 110)
Coefficient of determination (p. 108)
Common variance (p. 108)
Correlation coefficient (p. 102)
Dichotomous variable (p. 115)
Effect size index for r (p. 107)
Intercept (p. 117)
Linear regression (p. 116)
Multiple correlation (p. 115)
Negative correlation (p. 99)
Nonlinearity (p. 113)
Partial correlation (p. 115)
Positive correlation (p. 96)
Quantification (p. 95)
Regression coefficients (p. 117)
Regression equation (p. 117)
Regression line (p. 98)
Reliability (p. 110)
Scatterplot (p. 97, 106)
Slope (p. 117)
Standard error of estimate (p. 121)
Truncated range (p. 114)
Univariate distribution (p. 96)
Zero correlation (p. 101)
Correlation and Regression
What Would You Recommend? Chapters 2-6
At this point in the text (and at two later points), I have a set of problems titled What would you
recommend? These problems help you review and integrate your knowledge. For each problem
that follows, recommend a statistic from among those you learned in the first six chapters. Note
why you recommend that statistic.
a. Registration figures for the American Kennel Club show which dog breeds are common
and which are uncommon. For a frequency distribution for all breeds, what central tendency
statistic is appropriate?
b. Among a group of friends, one person is the best golfer. Another person in the group is the
best at bowling. What statistical technique allows you to determine that one of the two is better
than the other?
c. Tuition at Almamater U. has gone up each of the past 5 years. How can I predict what it
will be in 25 years when my child enrolls?
d. Each of the American states has a certain number of miles of ocean coastline (ranging from
0 to 6640 miles). Consider a frequency distribution of these 50 scores. What central tendency
statistic is appropriate for this distribution? Explain your choice. What measure of variability do
you recommend?
e. Jobs such as “appraiser†require judgments about the value of a unique item. Later, a sale
price establishes an actual value. Suppose two applicants for an appraiser’s job made judgments
about 30 items. After the items sold, an analysis revealed that when each applicant’s errors were
listed, the average was zero. What other analysis of the data might provide an objective way to
decide that one of the two applicants was better?
f. Suppose you study some new, relatively meaningless material until you know it all. If you
are tested 40 minutes later, you recall 85%; 4 hours later, 70%; 4 days later, 55%; and 4 weeks
later, 40%. How can you express the relationship between time and memory?
g. A table shows the ages and the number of voters in the year 2018. The age categories start
with “18–20†and end with “65 and over.†What statistic can be calculated to best describe the age
of a typical voter?
h. For a class of 40 students, the study time for the first test ranged from 30 minutes to 6
hours. The grades ranged from a low of 48 to a high of 98. What statistic describes how the
variable Study time is related to the variable Grade?
125
Transition Passage
To Inferential Statistics
YOU ARE NOW through with the part of the book that is clearly about descriptive statistics.
You should be able to describe a set of data with a graph; a few choice words; and numbers such
as a mean, a standard deviation, and (if appropriate) a correlation coefficient.
The next chapter serves as a bridge between descriptive and inferential statistics. All the
problems you will work give you answers that describe something about a person, score, or
group of people or scores. However, the ideas about probability and theoretical distributions
that you use to work these problems are essential elements of inferential statistics.
So, the transition this time is to concepts that prepare you to plunge into material on
inferential statistics. As you will see rather quickly, many of the descriptive statistics that you
have been studying are elements of inferential statistics.
127
Theoretical Distributions
Including the Normal
Distribution
CHAPTER
7
OBJECTIVES FOR CHAPTER 7
After studying the text and working the problems in this chapter, you should be
able to:
1. Distinguish between theoretical and empirical distributions
2. Distinguish between theoretical and empirical probability
3. Describe the rectangular distribution and the binomial distribution
4. Find the probability of certain events from knowledge of the theoretical distribution
of those events
5. List the characteristics of the normal distribution
6. Find the proportion of a normal distribution that lies between two scores
7. Find the scores between which a certain proportion of a normal distribution falls
8. Find the number of cases associated with a particular proportion of a normal
distribution
THIS CHAPTER HAS more figures than any other chapter, almost one per page. The
reason for all these figures is that they are the best way I know to convey ideas about
theoretical distributions and probability. So, please examine these figures carefully,
making sure you understand what each part means. When you are working problems,
drawing your own pictures is a big help.
I’ll begin by distinguishing between empirical distributions and theoretical
distributions. In Chapter 2, you learned to arrange scores in frequency distributions. The
scores you worked with were selected because they were representative of
Empirical distribution
scores from actual research. Distributions of observed scores are empirical Scores that come from
distributions.
observations.
This chapter has a heavy emphasis on theoretical distributions. Like
Theoretical distribution
the empirical distributions in Chapter 2, a theoretical distribution is a Hypothesized scores based
presentation of all the scores, usually presented as a graph. Theoretical on mathematical formulas and
logic.
distributions, however, are based on mathematical formulas and logic rather
than on empirical observations.
Theoretical distributions are used in statistics to determine probabilities. When there
is a correspondence between an empirical distribution and a theoretical distribution, you
can use the theoretical distribution to arrive at probabilities about future empirical events.
Probabilities, as you know, are quite helpful in reaching decisions.
Purchase answer to see full
attachment