+1(978)310-4246 credencewriters@gmail.com

Requirement and guideline
Research Critique: You must provide an in-depth critique of existing research in the Criminology and
Criminal Justice field (See the list below). you must select two (2) research publications (peer-reviewed
articles, academic presentations, policy white papers, etc.). For each publication, you must assess the
validity, accuracy, and ethics of the process used and the findings presented. The goal is to highlight
both the flaws and the successes of the research in a manner consistent with a peer-review. The
research critique for both publications together must be between 6 -7 pages- single-space,12 font, and
one-inch margins. Must be well-written and organized as well you must provide Endnotes.
Select any (2) research publications relevant to any of these topics listed below that in the Criminology
and Criminal Justice field:
➢ Neither Justice nor Treatment
➢ The Color of Justice
➢ Keeping Our Schools Safe
Format: The research critique for both publications together must be between 6 Minimum -7 Maximum
pages- single space ,12 font, one-inch margins.
Endnotes page: replaces footnotes which are normally at the end of each essay page.
Textbook: (Use the text from any chapters 1-12 as fit to support and defend your position)
Note I provide you with chapters 1 and 2 below and you can access the rest of chapters 3-12 as
need it from the library is free.
Fundamentals of research in criminology and criminal justice 5th edition by Bachman and Schutt
Ch1: https://us.sagepub.com/sites/default/files/upm-assets/109198_book_item_109198.pdf
Ch2: https://us.sagepub.com/sites/default/files/upm-assets/109199_book_item_109199.pdf
Other sources:(You can use any sources as fit to support and defend your position but must be
reliable and accessible with links)
Mono-Method Bias
To have more than one operational representation of a construct does not
necessarily imply that all irrelevancies have been made heterogeneous. Indeed,
when all the manipulations are presented the same way, or all the measures use
the sam~ means of recording ~espo~ses, then the method is itself an irrelevancy
whose mfluence cannot be dtssoctated from the influence of the target construct. Thus, if all the experts in the previous hypothetical example had been
presented to respondents in writing, it would not logically be possible to generalize to experts who are seen or heard. Thus it would be more accurate to
la~l the treatment as “experts presented in writing.” To cite another example,
attitude scales are often presented to respondents without apparent thought to
(a) using methods of recording other than paper-and-pencil, (b) varying whether
the attitude statements are positively or negatively worded, or (c) varying
whether the positive or negative end of the response scale appears on the right
or left of the page. On these three points depends whether one can test if
“personal private attitude” has been measured as opposed to “paper-and-pencil
nonaccountable responses,” or “acquiescence,” or “response bias.”
Hypothesis-Guessing Within Experimental Conditions
The internal validity threats called “resentful demoralization” and “compensation rivalry” were assumed to result because persons who received less
desirable treatments compared themselves to persons who received more desirable trea~ments, making it unclear whether treatment effects of any kind
occurred m the treatment group. Reactive research may not only obscure true
treatment effects, but also result in effects of diminished interpretability. This
is especially true if it is suspected that persons in one treatment group compared themselves to persons in other groups and guessed how the experimenters
expected them ~o behave. Indeed, in many situations it is not difficult to guess
what the expenmenters hope for, especially in education or industrial organizations. Hypothesis-guessing can occur without social comparison processes, as
“:hen respondents know only about their own treatment but persist in trying to
dtscover what the experimenters want to learn from the research.
Th.e problem of hypothesis-guessing can best be avoided by making hypotheses (tf p~sent) hard to guess, by decreasing the general level of reactivity in
the expenment, or by deliberately giving different hypotheses to different
res~ndents. But these solutions are at best partial, since respondents are not
passtve and can always generate their own treatment-related hypotheses which
may or may not be the same as the experimenters’. Learning an hypothesis
does ?ot necessarily imply either the motivation or the ability to alter one’s
behavtor because of the hypothesis. Despite the widespread discussion of treatment confounds that are presumed to result from wanting to give data that will
please the researcher-which we suspect is a result of discussions of the
~awthome effect-there is neither widespread evidence of the Hawthorne effect
m field .experiments (see reviews by D. Cook, 1967; Diamond, 1974), nor is
there evtdence of a similar orientation in laboratory contexts (Weber and Cook
1972). However, we still lack a sophisticated and empirically corroborated
theory of the conditions under which hypothesis-guessing (a) occurs, (b) is
treatment specific, and (c) is translated into behavior that (d) could lead to
erroneous conclusions about the nature of a treatment construct when (e) the
research takes place in a field setting.
Evaluation Apprehension
Rosenberg (1969) has reviewed considerable evidence from laboratory
experiments which indicates that respondents are apprehensive about being
evaluated by persons who are experts in personality adjustment or the assessment of human skills. In such cases respondents attempt to present themselves
to such persons as both competent and psychologically healthy. It is not clear
hoW widespread such an orientation is in social science experiments in field
settings, especially when treatments last a long time and populations do not
especially value the way that social scientists or their sponsors evaluate them.
Nonetheless, it is possible that some past treatment effects were due to respondents being willing to present themselves to experimenters in ways that would
lead to a favorable personal evaluation. Being evaluated favorably by experimenters is rarely the target construct around which experiments are designed. It
is a confound.
Experimenter Expectancies
There is some literature (Rosenthal, 1972) which indicates that an experimenter’s expectancies can bias the data obtained. When this happens, it will
not be clear whether the causal treatment is the treatment-as-labeled or the
expectations of the persons who deliver the treatments to respondents. This
threat can be decreased by employing experimenters who have no expectations
or have false expectations, or by analyzing the data separately for persons who
deliver the treatments and have different kinds or levels of expectancy. Experimenter expectancies are thus a special case of treatment-correlated irrelevancy,
and they may well operate in some (but certainly not all) field settings.
Confounding Constructs and Levels of Constructs
Experiments can involve the manipulation of several discrete levels of an
independent variable that is continuous. Thus, one might conclude from an
experiment that A does not affect B when in fact A-at-level-one does not affect
B, whereas A -at-level-four might well have affected B if A had been manipulated as far as level four. This threa:t is a problem when A and B are not linearly related along the whole continuum of A; and it is especially prevalent, we
assume, when treatments have only a weak impact. If they do, because low
levels of A are manipulated, and if conclusions are drawn about A without any
qualifications concerning the strength of the manipulation, then misleading negative conclusions can be drawn. The best control for this threat is to conduct
Parametric research in which many levels of A are varied and many levels of B
are measured.
Interaction of Different Treatments
. This threat occurs if respondents experience more than one treatment which
IS common in laboratory research but quite rare in field settings. We do not
know in such an instance whether we could generalize any findings to the situation where respondents received only a single treatment. More importantly, we
would not be able to unconfound the effects of the treatment from the effects
of the context of several treatments. The solution to this problem is either to
give only one treatment to respondents or, wherever possible, to conduct separate analyses of the first and succeeding ~reatments which respondents received.
Interaction of Testing and Treatment
To which kinds of testing situations can a cause-effect relationship be generalized? In particular, can it be generalized beyond the testing conditions that
were originally used to probe the hypothesized cause-effect relationship? The
latter is an especially important question when the pretesting of respondents is
involved and might condition the reception of the experimental stimulus,
although the previously cited work of Lana (1969) suggests that pretest sensitization is far from omnipresent. We would want to know whether the same
result would have been obtained without a pretest, and a posttest-only control
group is necessary for this. Similarly, if repeated posttest measurements are
made, we would want to know whether the same results would be obtained if
respondents were posttested once rather than at each delay interval. We would
want to know whether the effect does or does not have to be specified as
including · the frequency of posttest measurement. The recommended solution to
this problem is to have independent experimental groups at each delayed-test
Restricted Generalizability Across Constructs
When social science results are presented to audiences, it is very common to
hear comments such as: “Yes, I accept that the youth job-training program increases the likelihood of being employed immediately after graduation. But what
does it do to adaptive job skills- punctuality, the ability to follow orders, and
so on?” When such questions can be answered, we have a fuller picture of a
treatment’s total impact and are more likely to gain a comprehensive assessment
of the program. Sometimes treatments will affect dependent variables quite differently, implying a positive effect on some construct and an unintended negative
effect on another. While it is impossible to measure all the constructs that a particular treatment could affect, it is useful to explore with other persons how a
“treatment might influence constructs other than those that first come to mind in
the original formulation of the research question. Particularly in the program
evaluation area, we could cite many studies where the guiding research questions
were not well explored and where it would have been feasible to collect more
outcome measures, making the research more useful.
Construct Validity, Preexperimental Tailoring,
and Postexperimental Specification
Our presentation of the construct validity of putative causes and effects has
thus far emphasized the researcher critically (a) thinking through how a construct
should be defined, (b) isolating the cognate constructs from which any particular
construct has to be differentiated, and (c) deciding which measures or manipula-
tions he can use to index the particular hypothetical construct of interest. Then, we
emphasized both (d) the need to have multiple measures or manipulations wherever possible. This need does not deny that some measures are better than others
bUt merely indicates that no single measure is perfect and also indicates (e) the
eed to present the manipulations or measures in multiple delivery modes. All of
~hese points are geared toward helping the researcher answer the major conceptual
uestions guiding the research, whether the questions are theoretical or applied.
q Data analyses do not always produce the desired results that suggest high contrUCt validity. Consider, first, direct measures which are collected to test whether
~e treatment varied what it should have varied and did not vary what it was not
supposed to have varied. If a reliable measure of, say, communicator credibility
suggests that a communicator was not perceived to be more credible in one experimental group than another, then it is not easy to say that credibility caused any
effects that may have been inferred from the outcome data. The investigator is then
forced to become a detective whose goal is to use whatever means are available to
specify what might have caused the observed effects if credibility did not.
Next, consider what might happen if the data indicate that a manipulation
affected two reliably measured exemplars of a particular construct but not three
others that were equally well measured. How is the effect to be labeled in this
case, since the planned label does not fit all the results and so seems inappropriate? Feldman’s (1968) experiment in Boston, Athens, and Paris offers a concrete
example of this. He used five measures of “cooperation” in an effort to test
whether compatriots receive greater cooperation than foreigners. The measures
were giving street directions; doing a favor by mailing a lost letter; giving back
money that one could easily, but falsely, claim as one’s own; giving correct
change when one did not have to; and charging the correct amount to passengers
in taxis. The data suggested that giving street directions and mailing the lost letter
were differently related to the experimental manipulations than were foregoing
chances to cheat in ways that would be to one’s advantage. Thus, the data forced
Feldman to specify two kinds of “cooperation” (involving low-cost favors versus
foregoing one’s own financial advantage) where initially he had tailored his measures to reflect what he had hoped was the unitary construct of cooperation. Moreover, since his respecification of the constructs came after the data were received,
we can place less confidence in them than might otherwise have been warranted.
This is not to downplay Feldman’s research, which was exemplary given his
research question. If he had not had the five measures, a much less differentiated-and hence less accurate-picture would have emerged of the differences in
help given to compatriots and foreigners.
The important point is that construct validity consists of more than merely
assessing the fit between planned constructs and the operations that were tailored
to these constructs. One can use the obtained pattern of data to edit one’s thinking
about both the cause and effect constructs, and one can suggest, after the fact,
other constructs that might fit the data better than those with which the experiment
began. Often, the data force one to be more specific in one’s labeling than originally planned, as in the Feldman example or with the research of Parker (1963),
Who set out to test whether the introduction of television caused a decrease in per
capita library circulation. He finally concluded that it did for the circulation of
of factual ones. The process of hypothesizing
constructs and testing how well treatment and outcome operations fit these constructs
is similar whether it occurs before the research begins or after the data are received.
The major difference is that in the later stage one specifies constructs that fit the data,
whereas in the earlier stage one derives operations from constructs.
In their pathfinding discussion of construct validity, Cronbach and Meehl
(1955) stressed the utility of drawing inferences about constructs from the fit
between patterns of data that would be predicted if a particular theoretical construct was operating and the multivariate pattern of data was actually obtained in
the research. They used the term “nomological net” to refer to the predicted
pattern of relationships that would permit naming a construct. For instance, a
current version of dissonance theory predicts that being underpaid for a counterattitudinal advocacy will result in greater belief change than being overpaid, provided that the individual who makes the advocacy thinks he has a free choice to
refuse to perform the advocacy. The construct ”dissonance” would therefore be
partially validated if experimental data showed that underpayment caused more
belief change than overpayment but only under free choice conditions. However,
the fit between the complex prediction and the complex data only facilitates belief
in “dissonance” to the extent that other theoretical constructs could not explain
this same data pattern. Bern (1972) obviously believes that reinforcement constructs do as good a job of complex prediction in this case as “dissonance.”
We have implicitly used the “nomological net” idea in our presentation of
construct validity. First, we discussed the usefulness-for labeling the treatmentof examining whether the planned treatment is related to direct measures of the
treatment process and is not related to cognate processes. Second, we discussed
the advantages of determining in what ways the outcome variables are related to
treatments and the type of treatment that could have resulted in such a differentiated impact. For instance, if the introduction of television decreases the circulation of fiction but not fact books, one can hypothesize that the causal impact is
mediated by television taking time away from activities that are functionally similar-such as fantasy amusement-but not from functionally dissimilar activitiessuch as learning specific facts. However, our emphasis has differed slightly from
that of Cronbach and Meehl (1955) inasmuch as we are more interested in fitting
cause and effect operations to a generalizable construct (see Campbell, 1960-the
discussion of “trait validity”) than we are in using complex predictions and data
patterns to validate entirely hypothetical scientific constructs like “anxiety,”
“intelligence” or “dissonance.” However, we readily acknowledge that the way
the data tum out in experiments helps us edit the constructs we deal with, as when
we find that a foreman’s “supervision” has different consequences from less than
ten feet as opposed to more than ten feet.
Under external validity, Campbell and Stanley originally listed the threat of
not being able to generalize across exemplars of a particular presumed cause or
effect construct. We have obviously chosen to incorporate this feature under con-
struct val’idity as “mono-operation bias.” The reason for listing this threat differently from Campbell and Stanley is not fundamental. Rather it is meant to
emphasize that most researchers want to draw conclusions about constructs, but
the Campbell and Stanley discussion had a flavor of definitional operationalism,
!though a multiple definitional operationalism. We have tried to avoid this flavor
~y invoking construct validity to replace generalizing across cause and effect
exemplars. The other features of Campbell and Stanley’s conceptualization of
external validity are preserved here and elaborated upon. They have to do with (l)
generalizing to particular target persons, settings, and times, and (2) generalizing
across types of persons, settings, and times.
Bracht and Glass ( 1968) have succinctly explicated external validity, pointing
out that a two-stage process is involved: a target population of persons, settings,
or times has first to be defined and then samples are drawn to represent these
populations. Very occasionally, the samples are drawn from the populations with
known probabilities, thereby maximizing the final representativeness discussed in
textbooks on sampling theory (e.g., Kish, 1965). But usually the samples cannot
be drawn so systematically and are drawn instead because they are convenient and
give an intuitive impression of representativeness, even if it is only the representativeness entailed by class membership (e.g., I want to generalize to Englishmen
and the people I found on streetcomers in Birkenhead, England, belong to the
class called Englishmen). Accidental sampling, as it is technically labeled, gives
us no guarantee that the achieved population (a subset of Englishmen who hang
around streetcomers in Birkenhead) is representative of the target population of
which they are members. Consequently, we find it useful to distinguish among (I)
target populations, (2) formally representative samples that correspond to known
populations, (3) samples actually achieved in field research, and (4) achieved
. One of many examples that could be cited to illustrate these last points concerns the design of the first negative income tax experiment. Practical administrative considerations led to the study being conducted in a few localities within New
Jersey and in one city in neighboring Pennsylvania. Since the basic question guiding the research did not require such a restricted geographical location, the New
Jersey-Pennsylvania setting must be considered a limitation which reduces one’s
ability to generalize to the implicit target population of the whole United States.
(To criticize the study because the achieved sample of settings was not fonnally
representative of the target population may appear unduly harsh in light of the fact
that financial and logistical resources for the experiment were limited, and so
sampling was conducted for convenience rather than formal representativeness.
We shall return to this point later. For the present, however, it is worth noting
that accidental samples of convenience do not make it easy to infer the target
population, nor is it clear what population is actually achieved.)
Generalizing to well-explicated target populations should be clearly distinguished from generalizing across populations. Each is germane to external
validity: the fonner is crucial for ascertaining whether any research goals that
specified populations have been met, and the latter is crucial for ascertaining
Which different populations (or subpopulations) have been affected by a treatment, i.e., for assessing how far one can generalize. Let us give an example.
Suppose a new television show were introduced that was aimed at teaching
basic arithmetic to seven-year-olds in the United States. Suppose, further, that
one could somehow draw a random sample of all seven-year-olds to give a
representative national sample within known limits of sampling error. Suppose,
further, that one could then randomly assign each of the children to watching
or not watching the television show. This would result in two randomly
formed, and thus equivalent, experimental groups which were representative of
all seven-year-olds in the United States. Imagine, now, that the data analysis
indicated that the average child in the viewing group gained more than the
average child in the nonviewing group. One could generalize such a finding to
the average seven-year-old in the nation, the target population of interest.
This is equivalent to saying that the results were obtained despite possible
variations in how much different kinds of children in the experimental viewing
group may have gained from the show. A more differentiated data analysis
might show that the boys gained more than the girls (or even that only the
boys gained), or the analysis might show that children with certain kinds of
home background gained while children from different backgrounds did not.
Such differentiated findings indicate that the effects of the televised arithmetic
show could not be generalized across all subpopulations of seven-year-old
viewers, even though they could be generalized to the population of sevenyear-old viewers in the United States.
To generalize across subpopulations like boys and girls logically presupposes being able to generalize to boys and girls. Thus, the logical distinction
between generalizing to and across should not be overstressed. The distinction
is most useful for its practical implications insofar as many researchers who are
concerned about generalizing across populations are usually not as concerned
with careful samplings as are persons who want to generalize to target populations. Many researchers with the former focus would be happy to conclude that
a treatment had a specific effect with the particular achieved sample of boys or
girls in the study, irrespective of how well the achieved population of boys or
girls can be specified.
The distinction between generalizing to target populations and across multiple
populations or subpopulations is also useful because commentators on external
validity have often implicitly stressed one over the other. For instance, some persons
discuss external validity as though it were only about estimating limits of generalizability, as is evidenced by comments such as: “Sure, the treatment affected sevenyear-olds in Tucson, Arizona, and that was your target group. But what about children of different ages in other areas of the United States?” Other commentators
discuss external validity exclusively in terms of the fit between samples and target
populations, as is evidenced by comments such as: “I’m not sure whether the treatment is really effective with children who have learning disabilities, for if you look
at the pretest achievement means for the groups in your experiment, you’ll see that
they are as high as the test publisher quotes for the national average. How could
children with learning disabilities have scored so high? I doubt that the research
really involved the kind of child you said it did.”
Finally, we make the distinction between generalizing to and across in order to
emphasize the greater stress that we shall place in this presentation on generalizing
across. The rationale for this is that formal random sampling for representativeness is rare in field research, so that strict generalizing to targets of external validitY is rare. Instead, the practice is more one of generalizing across haphazard
instances where similar-appearing treatments are implemented. Any inferences
about the targets to which one can generalize from these instances are necessarily
fallible and their validity is only haphazardly checked by examining the instances
in question and any new instances that might later be experimented upon. It is also
worth noting that the formal generalization to target populations of persons is
often associated with large-scale experiments. These are often difficult to administer both in terms of treatment implementation and securing high-quality measurement. Moreover, attrition is almost inevitable, and so the sample with which
one finishes the research may not represent the same population with which one
began the research. A case can be made, therefore, that external validity is enhanced
more by a number of smaller studies with haphazard samples than by a single study
with initially representative samples if the latter could be implemented. Of course, it
should not be forgotten that all the haphazard instances of persons and settings that
are examined can belong to the class of persons or settings to which one would like
to be able to generalize research findings. Indeed, they should belong to such a
class. ·
List of Threats to External Validity
Tests of the extent to which one can generalize across various kinds of persons, settings, and times are, in essence, tests of statistical interactions. If there is
an interaction between, say, an educational treatment and the social class of children, then we cannot say that the same result holds across social classes. We
know that it does not. Where effects of different magnitude exist, we must then
specify where the effect does and does not hold and, hopefully, begin to explore
why these differences exist. Since the method we prefer of conceptualizing external validity involves generalizing across achieved populations, however unclearly
defined, we have chosen to list all of the threats to external validity in terms of
statistical interaction effects.
Interaction of Selection and Treatment
In which categories of persons can a cause-effect relationship be generalized?
Can it be generalized beyond the groups used to establish the initial relationshipto Various racial, social, geographical, age, sex, or personality groups? Even when
respondents belong to a target class of interest, systematic recruitment factors lead
to findings that are only applicable to volunteers, exhibitionists, hypochondriacs,
~ientific do-gooders, those who have nothing else to do, and so forth. One feas~ble way of reducing this bias is to make cooperation in the experiment as convenIent as possible. For example, volunteers in a television-radio audience experiment
Who have to come downtown to participate are much more likely to be atypical
~han are volunteers in an experiment carried door-to-door. An experiment involvIng executives is more likely to be ungeneralizable if it takes a day’s time than if it
takes only ten minutes, for only the latter experiment is likely to include those
People who have little free time.
Interaction of Setting and Treatment
Can a causal relationship obtained in a factory be obtained in a bureaucracy, in a military camp, or on a university campus? The solution here is to
vary settings and to analyze for a causal relationship within each. This threat is
of particular relevance to organizational psychology since its settings are on
such disparate levels as the organization, the small group, and the individual.
When can we generalize from any one of these units to the others? The threat
is also relevant because of the volunteer bias as to which organizations cooperate. The refusal rate in getting the cooperation of industrial organizations,
school systems, and the like must be nearer 75% than 25%, especially if we
include those that were never contacted because it was considered certain they
would refuse. The volunteering organizations will often be the most progressive, proud, and institutionally exhibitionist. For example, Campbell (1956),
although working with Office of Naval Research funds, could not get access to
destroyer crews and had to settle for high-morale submarine crews. Can we
extrapolate from such situations to those where morale, exhibitionism, pride, or
self-improvement needs are lower?
Interaction of History and Treatment
To which periods in the past and future can a particular causal relationship be
generalized? Sometimes an experiment takes place on a very special day (e.g.,
when a president dies), and the researcher is left wondering whether he would
have obtained the same cause-effect relationship under more mundane circumstances. Even when circumstances are relatively more mundane, we still cannot
logically extrapolate findings from the present to the future. Yet, while logic can
never be satisfied, “commonsense” solutions for short-term historical effects lie
either in replicating the experiment at different times (for other advantages of
consecutive replication, see Cook, 1974a) or in conducting a literature review to
see if prior evidence exists which does not refute the causal relationship.
Models to Be Followed in Increasing External Validity
In many instances researchers know that they want to generalize to specific
target populations of persons, settings, or times. This is particularly the case in
much applie_d research, although it is also found among basic researchers interested in contingency theories (e.g., a theory of schizophrenia, or of behavior in
street settings which require the ability to make references about schizophrenics
and street settings, however these are defined). Clearly, when target populations
are specified, it is necessary that the research samples be “representative” in
some way.
In other instances, the researchers may not have specific populations in mind.
!his is most likely to be the case with someone developing a general theory, but it
Is al~o sometimes appropriate in developing more limited theories or conducting
apphed research. For instance, the applied researcher in education may have
foUrth-grade inner-city children as the primary intended target population. But he
or ~he may not have a specific target group of persons in mind for giving the
achievement tests. Yet if all the posttest measurement is conducted by middleclass testers hired for the particular project, the researcher cannot ·extrapolate
beyond such testers. In a sense, he or she has drawn an unintended secondary
sample with an unclear population referent that has no intrinsic interest, and without further evidence no generalization beyond such testers is warranted. How
much better it would be if the irrelevant factor of tester social status were not fixed
but varied. Then, one could analyze the data to test whether similar effects were
obtained despite background differences among testers-that is, one could test
whether it is possible to generalize across factors like tester status that are irrelevant to major research goals.
When a target population has been specified, it is appropriate-where possible-to draw up a sampling frame and select instances so that the sample is representative of the population within known limits of sampling error. Many textbooks
on sampling theory exist and are informative about the advantages and disadvantages of drawing samples in different ways. Formally speaking, the most representative samples will be those that are randomly chosen from the population, and it
is possible for these randomly selected units to be randomly assigned to various
experimental groups. We might label the first stage in such a two-stage randomization process as following the random sampling for representativeness model.
It is probably only feasible to follow this model when sampling intended primary targets of persons, the more so if generalization to a limited setting is
required (e.g., to residents of Detroit, rather than the whole United States). However, random sampling for representativeness is theoretically possible on a larger
scale, particularly if multistage area sampling of, say, the whole nation is undertaken. But studies on this scale require considerable resources. Moreover, while it
is clear that the model can be followed for some issues where it is important to
generalize to particular target populations of persons, it is less clear whether it is
often feasible to generalize to target settings, except where these are highly restricted. For instance, by selecting a representative national sample of persons,
one should be able to generalize to various geographical settings (i.e., cities,
towns, and the like). But regions do not exhaustively define settings, and the
nationwide representative experiments of which we are aware-all of which
embed treatments within polling studies- take place in the respondents’ homes
rather than in the street or in factories. While a restriction to living rooms is
desirable for anyone interested in generalizing to settings where opinion polls typically take place, it is less desirable for the majority of researchers who have no
such particular target setting in mind. The point to be noted is that the model of
random sampling for representativeness requires considerable resources which are
probably more readily available for sampling target populations of persons than of
settings or historical times and which are probably more available for restricted
populations of persons (e.g., inhabitants of Detroit) than for the United States at
A second model for increasing external validity is the model of deliberate
sampling for heterogeneity. Here the concern is to define target classes of persons,
settings, and times and to ensure that a wide range of instances from within each
class is represented in the design. Thus, a general educational experiment might
~ designed to include boys and girls from cities, towns, and rural settings who
differ widely in aptitude and in the value placed on achievement in their home
settings. The task would then be to test whether an educational innovation has
comparable effects in each of the subgroups of children and settings. If the
achieved sample sizes do not permit this, then the task would be to test whether
the innovation has observable effects despite differences between kinds of children
and kinds of settings. The first task involves an obvious attempt at multiple replication, either by testing for interactions of the treatment and student characteristics
or by statistical tests of whether treatment has any observed effects within each
group. The second task involves testing whether a treatment effect is obtained
even though differences between persons and settings are not taken into account in
the data analysis and are inflating the error terms that are used for testing treatment
Deliberate sampling for heterogeneity does not require random sampling at any
stage in the sampling design. Hence one cannot-technically speaking-generalize from the achieved samples to any formally meaningful populations. All one
has are purposive quotas of persons with specified attributes. These quotas permit
one to conclude that an effect has or has not been obtained across the particular
variety of samples of persons, settings, and times that were under study, which is
like saying: “We tried to have children of Types I and II in the experiment in
order to see if the effect would hold with each of them. It did. We’re not sure how
well one can generalize from our particular achieved samples of children to children of Type I and Type II in general, but at least we learned that the effect holds
with at least one sample of Type I children and at least one sample of Type II
children. What we cannot do with any confidence is specify the populations of
children involved.” To have a sample of persons in an experiment with Type I
characteristics is not at all sufficient for formally concluding that we can generalize any findings to the average Type I persons.
When one samples nonrandomly, it is usually advantageous to obtain opportunistic samples that differ as widely as possible from each other. Thus, if it were
possible, one might choose to implement a treatment both in a “Magnet School,”
that is, a school established to exemplify teaching conditions at their presumed
best, and also in one of the city’s worst problem schools. If each instance produced comparable effects, then one might begin to suspect that the effect would
hold in many other kinds of schools. However, there is a real danger in having
only extreme instances at each end of some implicit, impressionistic continuum.
This can best be highlighted by asking: “What would you conclude about external
validity if an effect were obtained at one school but not the other?” In this case,
one would be hard pressed to conclude anything about the effects of the innovation in the majority of schools between the extremes. For this reason, it is especially advantageous if deliberate sampling for heterogeneity results in at least one
instance of the impressionistic mode of the class under investigation as well as
instances at each extreme. In other words, at least one instance should be representative of the “typical school” of a particular city (or nation), and at least one
instance representing the best and worst schools.
The model of deliberate sampling for heterogeneity is especially useful in
avoiding the pitfall of restricted inference that results from the failure to consider
sampling questions about secondary targets of inference (e.g., the social class of
educational testers as opposed to the social class of school children). Unless one
has good reasons for matching the class of testers and children, the model based
on seeking heterogeneity indicates that it would be unwise to sample from a
homogeneous group of testers with a common background. Comparable background does not mean identical testers, of course, for testers of any one class differ from each other in a multitude of ways. Nonetheless, social class is relatively
homogeneous, should plausibly affect test scores, and is an irrelevant source of
homogeneity that can often be made heterogeneous at little or no extra cost.
Deliberate purposive sampling for heterogeneity is usually more feasible than
random sampling for representativeness. Imagine conducting an experiment in a
school district to which you want to generalize. You could draw up a list of
schools and randomly select a number of them in order to generalize with confidence. But resources and politics often prevent working with so many schools.
Instead, the researcher is often lucky if he can afford (or be granted) access to
more than one or two schools-an achieved sample of convenience. This being
so, the researcher should seek convenient samples which differ considerably on
attributes that he or she especially wants to generalize across and should take care
not to be inadvertently restricted to populations, particularly those of secondary
A third model for extending external validity is the impressionistic modal
instance model. Here, the concern is to explicate the kinds of persons, settings, or
times to which one most wants to generalize and then to select at least one
instance of each class that is impressionistically similar to the class mode. We
alluded to this strategy earlier in detailing the desirability of having at least one
school similar to the average school in a district. To achieve this aim is f>imple.
Where comprehensive records exist, one can detail the average size of schools,
average achievement levels, average per capita expenditure, and so forth, and
choose one or more schools that most closely approximate the modal school
characteristics that have been “drawn up.” Should there be no obvious single
mode, one can then define the multiple modes and try to obtain at least one sample
of each. Thus, in many urban school districts, one might find three modes corresponding to all-black, all-white, and heavily desegregated schools. Then a choice
of one group from each class would be called for. Where no suitable archive
measures exist, it should nonetheless be possible for the researcher to sample the
opinions of experts and interested parties to obtain their impression of what the
average school or student is like. A composite impression is then derived for all
the single impressions, and this composite forms the framework for deciding the
order in which potential respondents (or which access-granting authorities) should
be approached for permission to do the study in their locale.
The definition .and selection of modal instances is probably most easy in consultant work or project evaluation where very limited generalization is required.
For instance, an industrial manager knows that he or she wants to generalize to the
Present work force in its current setting carrying out its present tasks the effective~ess of which is measured by means of locally established indicators of productivIty • profitability, absenteeism, lateness, and the like. The consultant or evaluator
t~en knows that he or she has to select respondents and settings to reflect these
c~rcumscribed targets. A feasible method is to concentrate on sampling impresSJonistically modal instances if sampling has to be carried out at all. (The evaluator might also do well to select out exemplary instances in order to gain a
preliminary understanding of what a business or project is capable of. But that is
another matter.)
The determination of modal instances is more difficult the closer one comes to
theoretical research. This is because target populations are less likely to be specified. For instance, in testing propositions about “helping” behavior, it is not
desirable to generalize only to workers who are presently employed in a particular
factory, working at a particular task, and producing a particular product. The
persons, the settings, the task, and the product would be irrelevant to any helping
theory. Yet-logically speaking-the factors incorporated into a particular test of
a proposition about helping determine the external validity of the findings, and the
researcher presumably does not welcome this restriction. Instead, he or she would
like to generalize to all persons (in the United States? beyond our shores?), all
settings (the street, the home, the factory?), and all tasks (helping someone who
has fainted, helping the permanently disabled?). The feasibility of confident generalizations of such breadth is low, and the most that the basic researcher can do is
to attempt to replicate his or her original findings across settings with different
restrictions or to wait until others have conducted the replications. Sampling for
heterogeneity is at issue here rather than sampling to obtain impressionistically
modal instances that the researcher cannot convincingly define.
It should be clear by now that, where targets are specified, the model of random sampling for representativeness is the most powerful model for generalizing
and that the model of impressionistic modal instances is the least powerful. The
model of heterogeneous instances lies between the two. However, the last model
has advantages over the other two in that it can be used when no targets are
specified and the major concern is not to be limited in one’s generalizations.
Moreover, it can be used with small numbers of samples of convenience. In many
cases the random selection of instances results in generalizing to targets that are of
minimal significance for persons whose interests differ from those of the original
researcher’s. For instance, to be able to generalize to all whites living in the
Detroit area, while of interest for some purposes, is generally of little interest to
~ost .people. However, it is worth noting that whites in Detroit differ in age, SES,
mtelhgence, and the like so that it is possible to test whether a particular treatment
can have similar effects despite such differences. In addition, subgroup analyses
can be conducted to examine generality across subpopulations. In other words,
formal randomization from populations of low interest can be used to test causal
relati?nships across heterogeneous subpopulations. In other words, an important
function of random samples is to permit examining the data for differential effects
on a variety of subpopulations. Given the negative relationships between “inferential” power and feasibility, the model of heterogeneous instances would seem
~ost useful, particularly if great care is made to include impressionistically modal
Instances among the heterogeneous ones.
In the last analysis, external validity-like construct ·validity-is a matter of
r~plication. It is worth noting that one can have multiple replication both within a
smgle study-subgroup analyses exemplify this-and also across studies-as
when one investigator is intrigued by a pattern of findings and tries to replicate
them using his or her own procedures or procedures that have been closely modeled on the original investigators’.
Three dimensions of replication are worth noting. First, is the simultaneous or
nsecutive replication dimension, with the latter being preferable since it offers
corne test, however restricted, of whether a causal relationship can be corroborated
so tWO different times. (Generalizing across times is necessarily more difficult than
at eralizing across persons or settings.) Second is the independent or nonindepengen t investigator dimension, with the former bemg
· more tmportant,
· 11y t”f
bo h
h independent investigators have different expectattons a ut ow an expenment
tum out. Third is the dimension of demonstrated or assumed replication. The
is assessed by explicit comparisons among different types of persons and
ttings where some persons did or did not receive a particular treatment. The
:tter is inferred from treat~ent effe~ts that. are obtained wit.h heterogeneous samles, but no explicit statisttcal cogmzance ts t:UCe~ of. the differences ~mong p~r­
p ns, settings, and times. Demonstrated replication ts clearly more mformative
:an assumed, for to obtain an effect with a mixed sample of, say, boys and girls,
does not logically entail that the effect could be obtained separately for both boys
and girls. It only entails that the effect was obtained despite any differences
between boys and girls in how they reacted to the treatment.
The difficulties associated with external validity should not blind experimenters
to practical steps that can be taken to increase generalizability. For instance, one can
often deliberately choose to perform an experiment at three or more sites where
different kinds of persons live or work. Or, if one can randomly sample, it is useful
to do so even if the population involved is not meaningful, for random sampling
ensures heterogeneity. Thus, in their experiment on the relationship between beliefs
and behavior about open housing, Brannon et al. (1973) chose a random sample of
all white households in the metropolitan Detroit area. While few of us are interested
in generalizing to such a population, the sample was nonetheless considerably more
heterogeneous than that used in most research, despite the homogeneity on the attributes of race and geographical residence.
In ·addition, our three models for increasing external validity can be used in
combination, as has been achieved in some survey research experiments on
improving survey research procedures (Schuman and Duncan, 1974). Usually,
random samples of respondents are chosen in such experiments, but the interviewers are not randomly chosen; they are merely impressionistically modal of all
experienced interviewers. Moreover, the physical setting of the research is limited
to one target setting that . is of little interest to anyone who is not a survey
researcher-the respondent’s living room-and the range of outcome variables is
usually limited to those that survey researchers typically study-that is, those that
can be assessed using paper and pencil. However, great care is normally taken that
these questions cover a wide range of possible effects, thereby ensuring considerable heterogeneity in the effect constructs studied.
Our pessimism about external validity should not be overgeneralized. An
awareness of targets of generalization, of the kinds of settings in which a target
class of behaviors most frequently occurs, and of the kinds of persons who most
often experience particular kinds of natural treatments will, at the very least, prevent the designing of experiments that many persons shrug off willy-nilly as
“irrelevant.” Also, it is frequently p(>ssible to conduct multiple replications of an
experiment at different times, in different settings, and with different kinds of
experimenters and respondents. Indeed, a strong case can be made that external
validity is enhanced more by many heterogeneous small experiments than by one or
two large experiments, for with the latter one runs the risks of having heterogeneous
treatment, measures that are not as reliable as they could be, and measures
that do not reflect the unique nature of the treatment at different sites. Many
small-scale experiments with local control and choice of measures is in many
ways preferable to giant national experiments with a promised standardization that
is neither feasible nor even desirable from the standpoint of making irrelevancies
Imagine, now, the same assumptions except that bias is operating. Because
f the bias, the distribution of sample mean differences will no longer have a
~ean of zero, and the difference from zero indicates the magnitude of. the bias.
However, the point to be emphasized is that some sample mean differences
ill be as large when there is bias as when there is not, although the propor~ n of differences reaching the specified magnitude will vary between the bias
u~d nonbias cases depending on the direction and magnitude of bias. Since
:ampling error, which is one kind of rando~ error, affects bot~ sam.?le means
and variances, it can lead to both false posttive and false negative differences.
In this respect, sampling error is like internal validity. But it is unlike internal
validity in that it cannot affect population means. Only sources of bias-threats
to internal validity-can do the latter.
Internal Validity and Statistical Conclusion Validity
construct Validity and External Validity
Drawing false positive or false negative conclusions about causal hypotheses
is the essence of internal validity. This was a major justification for Campbell
(1969) adding “instability” to his list of threats to internal validity. “Instability” was defined as “unreliability of measures, fluctuations in sampling persons
or components, autonomous instability of repeated or equivalent measures,” all
of which are threats to drawing correct conclusions about covariation and hence
about a treatment’s effect. (What precipitated the need for this additional threat
was the viewpoint of some sociologists who had argued against using tests of
significance unless the comparison followed random assignment to treatments.
See Winch and Campbell, 1969, for further details.)
The status of statistical conclusion validity as a special case of internal
validity can be further illustrated by considering the distinction between bias
and error. Bias refers to factors which systematically affect the value of means;
error refers to factors which increase variability and decrease the chance of
obtaining statistically significant effects. If we erroneously conclude from a
quasi-experiment that A causes B, this might either be because threats to internal validity bias the relevant means or because, for a specifiable percentage of
possible comparisons, sample differences as large as those found in a study
would be obtained by chance. If we erroneously conclude that A does not
affect B (or cannot be demonstrated to affect B), this can either be because
threats to internal validity bias means and obscure true differences or because
the uncontrolled variability obscures true differences. Statistical conclusion
validity is concerned not with sources of systematic bias but with sources of
random error and with the appropriate use of statistics and statistical tests.
An important caveat has to be added to the preceding statement that random errors reduce the risk of statistically corroborating true . differences. This
does not imply that random errors invariably inflate standard errors or that they
never lead to false positive conclusions about covariation. Let us try to illustrate these points. Imagine multiple replications of an unbiased experiment
where the treatment had no effect. The distribution of sample mean differences
should be normal with a mean of zero. However, many individual sample
mean differences will not be zero. Some will inevitably be larger or smaller
than zero, even to a statistically significant degree.
Making generalizations is the essence of both construct and external validity. It is instructive, we think, to analyze the similarities and differences
between the two types of validity. The major similarity can perhaps best be
summarized in terms of the notion of statistical interaction-that is, the sign or
direction of a treatment effect differs across populations. It is easy to see how
person, setting, and time variables can moderate the effectiveness of a treatment. It is probably also easy to see how an estimate of the effect may depend
on such threats to construct validity as ttte number of treatments a respondent
receives or the frequency with which outcomes are measured. It may be less
easy to see how a treatment effect can interact with (i.e., depend on) the particular method used for collecting data (mono-method bias), or the expectancies of the persons implementing a treatment (experimenter expectancies), or
the guesses that respondents make about how they are supposed to ·behave
(hypothesis-guessing). But in all these instances an internally valid effect can
be obtained under one ;:ondition (say, when paper-and-pencil measures of attitude are used) and a different, but still valid, effect may result when attitude is
measured some other way.
Specifying the factors that codetermine the direction and size of a particular cause-effect relationship is useful for inferring cause and effect constructs.
This is because some of the causes or effects that might explain a particular
relationship observed under one condition may not be able to explain why there
are different causal relationships under other conditions. It should especially be
noted that specifying the populations of persons, settings, and times over which
a relationship holds can also clarify construct validity issues. For instance, suppose a negative income tax causes more married women than men to withdraw
their labor from the labor market (see the summary statements of the four negative income tax experiments in Cook, Del Rosario, Hennigan, Mark, and Trochim, 1978). Such an action might suggest that the causal treatment can be
Understood, not just in monetary terms but also in terms of a possible shift in
economic risks (i.e., where the family breadwinner is guaranteed an income,
the withdrawal of his or her labor could have extremely serious consequences
if the income guarantee were withdrawn or if the guaranteed sum failed to rise
With inflation. But where a source of more marginal income is involved-as
with some married women-the withdrawal of their labor is less critical since
the family is not so heavily dependent on that one source of income.) Other
interpretations of why men and women are affected differently are also possible. Their existence highlights the difficulty of inferring causal constructs where
the clarifying inference is indirect, being based on differences in responding
across populations rather than on attempts to refine the causal operations
directly so that they better fit a planned construct. The major point to be noted,
however, is that both external and construct validity are concerned with specifying the contingencies on which a causal relationship depends and all such
specifications have important implications for the generalizability and nature of
causal relationships. Indeed, external validity and construct validity are so
highly related that it was difficult for us to clarify some of the threats as
belonging to one validity type or another. In fact, two of them are differently
placed in this book than in Cook and Campbell (1976). These are “the interaction of treatments” and “the interaction of testing and treatment.” They were
formerly included as threats to external validity on grounds that the number of
treatments and testings were part of the research setting. On reflection, however, we think they are more useful for specifying cause and effect constructs
than for delimiting the settings under which a causal relationship holds, though
they obviously can serve both purposes.
The major difference between external and construct validity has to do with
the extent to which real target populations are available. In the case of external
validity the researcher often wants to generalize to specific populations of per. sons, settings, and times that have a grounded existence, even if he or she can
only accomplish this by impressionistically examining data patterns across accidental samples. However, with cause and effect constructs it is more difficult
to specify a particular construct-what, for instance, is aggression? Any definitions would be disputed and would not have the independent existence of, say,
the population of American citizens over 18 years of age. Even though the latter is a theoretical construct, it is obviously more grounded in reality than such
constructs as “attitude towards authority” or “a negative income tax.”
.Issues of Priority Among Validity Types
Some ways of increasing one kind of validity will probably decrease
another kind. For instance, internal validity is best served by carrying out
randomized experiments, but the organizations willing to tolerate these are
probably less representative than organizations willing to tolerate passive measurement. Second, statistical conclusion validity is increased if the experimenter
can rigidly control the stimuli impinging on respondents, but this procedure can
decrease both external and construct validity. And third, increasing the construct validity of effects by multiply operationalizing each of them is likely to
increase the tedium of measurement and to cause either attrition from the
experiment or lower reliability for individual measures.
These countervailing relationships-and there are many others-suggest how
crucial it is to be explicit about the priority ordering among validity types in
planning any experiment. Means have to be developed for avoiding all unnecessary trade-offs between one kind of validity and another, and to minimize the
Joss entailed by the necessary trade-offs. However, since some trade-offs are
·nevitable, we think it unrealistic to expect that a single piece of research will
ffectively answer all of the validity questions surrounding even the simplest
~ausal relationship.
The priority among validity types varies with the kind of research being
onducted. For persons interested in theory testing it is almost as important to
ow that the variables involved in the research are constructs A and B (constrnct validity) as it is to show that the relationship is causal and goes from
s ne variable to the other (internal validity). Few theories specify crucial target
ettings, populations, or times to or across which generalization is desired.
~onsequently, external validity is of relatively little importance. In practice, it
is often sacrificed for the greater statistical power that comes through having
isolated settings, standardized procedures, and homogeneous respondent populations. For investigators with theoretical interests our estimate is that the types
of validity, in order of importance, are probably internal, construct, statistical
conclusion, and external validity.
We also estimate that the construct validity of causes may be more important for such researchers than the construct validity of effects, particularly in
psychology. Think, for example, of how simplistically “attitude” is operationalized in many persuasion experiments, or “cooperation” in bargaining studies,
or “aggression” in studies of interpersonal violence. Think, on the other hand,
about how much care goes into demonstrating that a particular manipulation
varied “cognitive dissonance” and not reactance, communicator expertise and
not experimenter expectancies or evaluation apprehension. Might not the construct validity of effects rank lower than statistical conclusion validity for most
theory-testing researchers? If it does, this would be ironic since multiple operationalism makes it easier to achieve higher construct validity of effects than of
Much applied research has a different set of priorities. It is concerned with
testing whether a particular problem has been alleviated by a treatment-recidivism in criminal justice settings, achievement in education, or productivity in
industry (high internal validity and high construct validity of the effect). It is
also crucial that any demonstration of change in the indicator be made in a
context which permits either wide generalization or generalization to the specific target settings or persons in whom the researcher or his clients are particularly interested (high interest in external validity). The researcher is relatively
less concerned with determining the causally efficacious components of a complex treatment package, for the major issue is whether the treatment as implemented caused the desired change (low interest in construct validity of the
~ause). The priority ordering for many applied researchers is something like
tntemal validity, external validity, construct validity of the effect, statistical
conclusion validity, and construct validity of the cause.
. For the kinds of causal problems we have been discussing, the primacy of
tntemat validity should be noted for both basic and applied researchers. The
reasons for this will be given below, and they relate to the often considerable
cheosts of being wrong about the magnitude and direction of causal relations, and
often minimal gains in external validity that are achieved in moving from
initial accidental samples of convenience that belong in the class to which generalization is desired to other types of samples. Consequently, jeopardizing
internal validity for the sake of increasing external validity usually entails a
minimal gain for a considerable loss.
There is also a circular justification for the primacy of internal validity that
pertains in any book dealing with experiments. The unique purpose of experiments is to provide stronger tests of causal hypotheses than is permitted by
other forms of research, most of which were developed for other purposes. For
instance, surveys were developed to describe population attitudes and reported
behaviors while participant observation methods were developed to describe and
generate new hypotheses about ongoing behaviors in situ. Given that the
unique original purpose of experiments is cause-related, internal validity has to
assume a special importance in experimentation since it is concerned with how
confident one can be that an observed relationship between variables is causal
or that the absence of a relationship implies no cause. The relative desirability
of randomized experiments over quasi-experiments becomes even clearer in this
context, for the former allows stronger tests of causal hypotheses than the latter. This is not to say that the randomized experiment guarantees a perfect test
of internal validity. Far from it. However, it usually allows a stronger test than
most quasi-experiments; and most of the quasi-experiments we discuss in
chapters 3 and 5 of this volume permit stronger tests than the nonexperiments
we shall discuss in chapter 7.
Though experiments are designed to test causal hypotheses, and internal
validity is the sine qua non of causal inference, there are contexts where it
would not be advisable to subordinate too much to internal validity. Someone
commissioning research to improve the efficiency of his own organization
might not take kindly to the idea of testing a proposed improvement in a laboratory setting with sophomore respondents. A necessary condition for meeting
such a client’s needs is that he can generalize any findings to his own organization and to the indicators of efficiency that he regularly uses for monitoring
performance. Indeed, his need in this respect may be so great that he is prepared to sacrifice some gains in internal validity for a necessary minimum of
external validity. We would tend to agree with him if increasing internal validity meant going outside his organization or organizations like his own into
some completely different type of setting, e.g., the psychological laboratory. In
most cases, the desirable minimum of external validity would be that the
achieved samples of persons, settings, and measures belong to the specified target “populations,” however accidental the samples finally achieved happened
to be. However, we would be less inclined to agree with him if class membership were not enough and he insisted on, say, the formal random sampling of
respondents when this type of selection precluded random assignment to treatments, which it might if it were feared that many of the potential respondents
would refuse to be in the study if random assignment to treatments took place.
In this last case, the gain in external validity in moving from accidental samples to samples that were initially formally random would not usually seem
worth the loss in internal validity that is associated with going from random to
systematic assignment to treatments.
Many basic researchers specify target populations when they formulate their
uiding research questions, and they want to test causal theories about specific
glasses of persons (e.g., alcoholics) or settings (e.g., urban ghettoes), for their
c search would be trivialized by any procedures that increased internal validity
~ough conducting research with groups other than alcoholics or in settings
t tber than ghettoes. Thus, when targets of generalization are specified in guid?
g research questions, cognizance has to be taken of this in designing an
periment, and instances should be chosen that at least belong in the class to
exbich generalization is desired. Unfortunately, being a member of a class does
necessarily imply being representative of that class.
A number of criticisms of the original Campbell and Stanley distinction
between internal and external validity have recently appeared, and we wish to
discuss them here (Gadenne, 1976; Kruglanski and Kroy, 1975; Hultsch and
Hickey, 1978; Cronbach, in preparation). These critics make partially overlapping but also independent criticisms which we shall address one by one.
The first objection is to the claim that random assignment rules out all
threats to internal validity. The argument is made that it is in principle impossible to rule out all validity threats because the true cause of an observed effect
may be either the planned treatment, or procedural correlates of the treatment,
or the interaction of the treatment and the procedures in which the treatment is
embedded. The critics cite Rosenthal’s work on experimenter expectancies to
support this point, arguing that such expectancies are just one of many conceivable, and some as yet inconceivable, forces that operate in an experiment
and can be treatment related.
On one level this is an important point, highlighting theorists’ concerns
with generalizing from operationalized independent variables to theoretical
causal constructs. But the objection does not take into account the fact that
Campbell and Stanley conceived of procedural variables, like experimenter
expectancies, as threats to external and not internal validity. This is because
such threats cast doubt on whether a causal relationship can be generalized
beyond particular settings (e.g., where the experimenter had an hypothesis
about the outcome of the study). They do not cast doubt on whether there was
a causal relationship from the independent-variable-as-manipulated to the dependent-variable-as-measured. Indeed, the critics seem quite prepared to acknowledge that causal inference is involved in studies where the experimenters’
expectancies may play a role, and their concern is with how the treatment
should be labeled. Is it an effect of a particular theory-relevant construct or of
the theoretical irrelevancy of “experimenter expectancies”? Such an issue of
“confounding” has been discussed in this chapter as an issue of construct
Validity, while in Campbell and Stanley it was an issue of external validity and
Was never an issue of internal validity.
Nonetheless, the critics do perform an essential service, for it is indeed
~alse to claim that randomization controls for all threats to internal validity. For
Instance, one can set up a randomized experiment but still have systematic
selection because of differential attrition. Moreover, the process of distributing
valued resources on a random basis (instead of by need or merit, say) can lead
to the operation of threats like compensatory rivalry or compensatory equalization. These, in their turn, can lead to false inferences about the effects of a
treatment. While randomization is the best single means of increasing our confidence in causal inferences, it is not a panacea. Indeed, a book devoted to
quasi-experiments implies that randomized experiments are not achievable at
will. In the chapter devoted to randomized experiments we will stress the need
to design interpretable quasi-experiments along with randomized experiments so
the researcher has strong alternative designs should the initial random assignment to treatments break down. Though this book advocates random assignment, it does so in a more explicitly qualified manner than its predecessors.
The second objection made by the critics is that Campbell and Stanley,
while explicitly rejecting inductive inference, nonetheless base their concept of
external validity on inductive inference-going from samples to populations. At
first glance this seems to be a telling criticism, for the language of external
validity is the language of generalizing from samples of persons and settings to
populations of persons and settings. However, it should be noted that in previous discussions, Campbell (1969b) has stressed that, because of the problem
of induction, all generalization in the social sciences is particularly presumptive
and that external validity is inherently more problematic than even internal
validity whose bases are more obviously deductive.
It should also be noted that the relationship of samples to populations can
be specified in deductive terms that permit falsification. For instance, if one
conducts an experiment with a random sample drawn from a well-designated
universe (say, the city of Detroit), one can rule out the threat that the universe
is biased towards white or male or upper-income inhabitants of the city either
by understanding what random selection is and then examining how it was
implemented in the experiment in question, or by comparing background
characteristics of the sample with (hopefully, recent) census data on the population. Stated more formally, the threat to be ruled out is: There is a race bias in
the study. One deduces from this that the percentages of persons from different
races who are in the sample should not differ from the percentages in the population to a greater degree than is warranted by sampling error. This deduction
is testable by collecting data on the race of persons in the sample. If the percentages differ from what is expected on the basis of the (hopefully, recent)
census, it would not be false to say that there is probably a race bias. However, if the percentages are as expected from the census, then it would be false
to say there is a race bias though there may be other sources of bias.
External validity can also be deductively tested when sampling is carried
out to achieve heterogeneity rather than formal representativeness. The postulate is, say: The treatment does not affect black inhabitants of Detroit. The
deduction is: The effect will not be observed among blacks. The falsifying test
of this is to examine empirically whether there is a causal relationship among
blacks. If there is, one cannot say that the causal relationship generalizes to all
blacks, but one can at least say that the relationship is not false when tested
with a particular biased sample of blacks. If there is no causal relationship
among blacks, one can confidently conclude that the effect does not hold with
all black inhabitants of Detroit. It is simply not the case, therefore, that external validity rests on a base of inductive inference that Hies in the face of the
acknowledged limitations of inductive inference.
Campbell and Stanley included under external validity threats having to do
with generalizing from manipulations and measures to target constructs. Their
discussion of these issues was explicitly nonpositivist, espousing multiple operationalism. Nonetheless, there was a flavor of positivism in that the inductive
framework may have encouraged readers to think that the operations ”somehow” were the constructs, that an observed response really was “aggression”
or “love.” The present book, in distinguishing between external and construct
validity, has been written to avoid such a positivist flavor and to stress that
constrUcts are hypothetical entities not “corporeally” represented by samples of
A third objection the critics make is to note that all guiding research propositions must be couched in generaVuniversal terms whereas internal validity is
couched in the language of causal connections involving research operations.
The critics wonder how one can have any validity internal to an experiment
when the propositions whose validity is being tested are phrased in general
terms external to the experiment (e.g., A is “causally” related to B). The concern with the universal nature of research propositions goes beyond internal
validity, of course, as can be seen in the fact that the guiding research propositions are likely to be phrased as: “What is the causal effect of school desegregation on the academic achievement of children in the public schools of
Evanston in 1969?” Or, “What was the causal effect of a guaranteed income
on the labor force participation of working poor persons in New Jersey and
Scranton, Pennsylvania, between 1969 and 1972?” Given the universal terms
in these propositions, critics point out that validity depends on the fit between
research operations and referent constructs (construct validity) or populations
(i.e., external validity). Most critics invoke Brunswik at one point or another
and call for a “representative” social science in which (1) the target populations and constructs are clearly stated and sampling takes place so as to represent these populations in the research operations; and (2) the targets are not
conceived solely in terms of respondents–the representativeness of settings and
procedures is also crucial for Brunswik and his followers.
We have a great deal of sympathy with the position that all aspects of
research design test propositions of a general and universal nature and that
sampling is the means by which one approximates representing general conStructs about causes, effects, types of people, or the like. However, it is easier
t? conceive of the representativeness of constructs and populations than of relationships among variables. How does one, for instance, sample to represent
~·causality”? We find it difficult to imagine representative samples of causal
Instances. Instead, we think that testing the nature of an observed relationship
~tween an independent and dependent variable has to revolve around the particularities of a single study-around details concerning covariation, temporal
Precedence, and the ruling out of alternative interpretations about the nature of
the relationship in the experiment on hand. Of course, we do not deny that the
notion of “cause” is an abstract one and that the single study only a~proxi­
mates causal knowledge. But we believe it is confusing to insist that mternal
validity is a contradiction in terms because all validity is external, referring to
abstract concepts beyond a study and not to concrete research operations within
a study . It is confusing because the choice of populations and the fit between
samples and populations determines representativeness, whereas neither populations nor samples are necessary for inferring cause.
Nonetheless, the critics make a very useful point, for if the goals of a
research project are formulated well enough to permit specifying target constructs and populations, then the research operations have to represent these targets if the research is to be relev~nt ~ither to th~ory or
policy. Mo~e~~er, a
focus on representativeness has htstoncally entatled a hetghtened sensitiVIty to
unplanned and irrelevant targets that unnecessarily limit generalizability, as
when all the persons who collect posttest achievement data in an early childhood experiment with economically disadvantaged children are of the same
social class. Clearly, relevant research demands representativeness where target
constructs or populations are specified. It also demands heterogeneity where
irrelevant populations could limit the applicability of the research. Though we
advocate putting considerable resources into the preexperimental explication of
relevant theory or policy questions-and hence targets-this should not be
interpreted in any way as an exclusive focus. As we tried to demonstrate in the
discussion of both construct and external validity, it is sometimes the case that
the data, once collected and analyzed, force us to restrict (or extend) generalizability beyond the scope of the original formulation of target constructs and
populations. The data edit the kinds of general statements we can make.
For instance, in his experiment on the help given to compatriots and foreigners, Feldman (1968) wanted to generalize to “cooperation.” He deduced
that if his independent variable affected cooperation, he would find five dependent variable measures related to his treatment. But only two were related, and
the data outcomes forced him to conclude tentatively that his treatment was differently related to two kinds of cooperation. Similarly, the designers of the
New Jersey Negative Income Tax Experiment wanted to generalize to working
poor persons, but the data forced them tentatively to conclude that working
poor blacks responded one way to the treatments, working poor persons who
were Spanish speaking reacted another way, and working poor whites probably
did not respond to the treatments at all. The point is this: While it is laudable
to sample for representativeness when targets of generalization are specified in
advance-and we heartily endorse such sampling-in the last analysis it is the
patterning of data outcomes which determines the range of constructs and populations over which one can claim a treatment effect was obtained. One has
always to be alert to the data demanding a respecification of the affected populations and constructs and to the possibility that the affected populations and
constructs will not be the same as those originally specified.
A fourth objection has been directed towards Campbell and Stanley’s stress
on the primacy of internal over external validity. The critics argue that no kind
of validity can logically have precedence over another. Of what use, critics
say, is a theory-testing experiment if the true causal variable is not what the
researchers say it is; or of what use is a policy experiment about the effects of
school desegregation if it involves a school in rural Mississippi when most
desegregation is in large, northern cities where white children have fewer alternatives to public schools than in the deep South? This point of view has been
best expressed by Snow (1974). He uses the term “referent validity” to designate the extent to which research operations correspond to their referent terms
in research propositions of the form: “Counseling for pregnant teenagers
improves their mental health” or “The introduction of national health insurance
causes an increase in the use of outpatient services.” Without using our terminology, Snow notes that such propositions usually contain implicit or explicit
references to populations, settings, and times (external validity), to the nature
of the presumed cause and effect (construct validity), to whether the operations
representing the cause and effect covary (statistical conclusion validity), and to
whether this covariation is plausibly the result of causal forces (internal validity).
For a study to be useful, the argument goes, each part of the proposition
should be given approximately equal weight. There is no need to stress the causality term over any other. Other critics (Hultsch and Hickey, 1978; Cronbach, in
preparation) take the argument one step further and stress the primacy of external
over internal validity. Hultsch claims that if we have a target population of special
interest-for example, the educable mentally retarded-then it is better to test
causal propositions about this group on representative samples. He maintains this
should be done even if less rigorous means have to be used for testing causal
propositions than would be the case if a study was restricted to easily available but
nonrepresentative subgroups of the educable mentally retarded or to children who
were not educable and retarded. Cronbach (in preparation) echoes this argument
and adds, first, that in much applied social research the results are needed quickly
and, second, that a high degree of confidence about causal attribution is less
important in the decisions of policy-makers (broadly conceived) than is confidence
in knowing that one is working with formally or impressionistically representative
sa~ples. Consequently, Cronbach contends that the time demands of experiments
~1th experimenter-controlled manipulanda imd the reality of how research is (and
IS n~t) used in decision making suggest a higher priority for speedy research using
avatlable data sets, simple one-wave measurement studies or qualitative studies as
oppo~d to studies which, like quasi-experiments, take more time and explicitly
stress mtemal validity.
It is in some ways ironic that the charge of neglecting external validity should
~ leveled against one of the persons who invented the construct and elevated its
~Prtance in the eyes of those who restricted experimentation to laboratory seth_ngs and who wrote about experimentation without formally mentioning the spec~al Problems that arise in field settings. But this aside, we have no quarrel in the
~ stract with the point of view that, where causal propositions include references
eo Ppulations of persons and settings and to constructs about cause and effect,
d~~ should be equally weighted in empirical tests of these propositions. The real
~heculty comes in p~rticu~ar ins~ances of research design and implementation
int re very often the mvesttgator rs forced to make undesirable choices between
tal~rnal and external validity. Gaining a representative sample of educable, meny retarded students across the whole nation demands considerable resources.
Even gaining such a sample in a few cities located more closely together is difficult, requiring resources for implementing a treatment, ensuring its consistent delivery, collecting the required pretest and posttest data, and doing the necessary
public relations work. Without such resources, one runs the risk of a large, poorly
implemented study with a representative sample or of a smaller, better implemented ·study where the small sample size limits our confidence in generalizing.
Since random sampling is so rare for purposes of achieving representativeness,
it is useful to consider the trade-off between internal and external validity when
heterogeneous but unrepresentative sampling is used or when impressionistically
modal but unrepresentative instances are selected that at least belong in the general
class to which generalization is desired. Samples selected this way will have
unknown initial biases, since not all schools will volunteer to permit measurement, even fewer schools will agree to deliberate manipulation of any kind, and
the sample of schools that will agree to randomized manipulation will probably be
even more circumscribed than the sample of schools that agrees to measurement
with or without quasi-experimentation. The crucial issue is this: Would one do
better to work with the initially more representative sample of schools in a particular geographical area that volunteered to permit measurement, even though no
deliberate manipulation took place? Or would one rather work with a less representative sample of schools where both measurement and deliberate manipulation
took place?
Solving this problem boils down, we think, to asking whether the internal
validity costs of eschewing deliberate manipulation and more confident causal
inferences are worth the gains for external validity of having an initially more
representative sample from which bias-inducing attrition will nonetheless take
place. Any resolution must also consider two other factors. First, the study which
stresses internal validity has at least to take place in a setting and with persons
who belong in the class to which generalization is desired, however formally
unrepresentative of the class they might be. Second, the study which stresses external
validity and has apparently more representative samples of settings and persons will
result in less confident causal conclusions because more powerful techniques of field
experimentation were not used or were not used as well as they might have been under
other circumstances.
The art of designing causal studies is to minimize the need for trade-offs .and
to try to estimate in any particular instance the size of the gains and losses in
internal and external validity that are involved in different trade-off options.
Scholars differ considerably in their estimate of gains and losses. Cronbach maintains that timely, representative, but less rigorous studies can still lead to reasonable approximations to causal inference, even if the studies are nonexperimental
and of the kind we shall discuss- somewhat pessimistically-in chapter 7. Campbell and Boruch (1975), on the other hand, maintain that causal inference is problematic with nonexperimental and single-wave quasi-experiments because of the
many threats to internal validity that remain unexamined or have to be ruled out
by fiat rather than through direct design or measurement. The issue involves estimating how to balance timeliness and the quality of causal inference, whether the
costs of being wrong in one’s causal inference are not greater than the costs of
being late with the results.
Consider two cases of timely research aimed at answering causal questions
which used manifestly inadequate experimental procedures. Head Start was evaluated by Ohio-Westinghouse (Cicirelli, 1969) in a design with only one wave of
measurement of academic achievement. The conclusion-Head Start was harmful.
Analysis of the same data using different statistical models appeared to corroborate
this conclusion (Bamow, 1973); to reverse it completely, making Head Start appear
helpful (Magidson, 1977); or to result in no-difference findings (Bentler and Woodward, 1978). Since we do not know the effects of Head Start, any timely decisions
based on the data would have been premature and perhaps harmful. The second
example worth citing is the Coleman Report (Coleman et al., 1966). In this largescale, one-wave study it was concluded that black children gained more in achievement the higher the percentage of white children in their classes. This finding was
used to justify school desegregation. However, better designed subsequent research
bas shown that if blacks gain at all because of desegregation (which is not clear),
they gain much less than was originally claimed. It is important, we feel, not to
underestimate the costs of producing timely results about cause, particularly its
direction, which tum out to be wrong. Clearly, the chances of being wrong about
cause are higher the more one deviates from an experimental model and conducts
nonexperimental research using primitive one-wave quasi-experiments.
Since timeliness is important in policy research-though less so for basic
researchers for whom this book is also intended-we shall devote some of the next
chapter to quasi-experimental designs that do not require pretests and to ways in
which archives can be used for rigorous and timely causal analysis. In the end,
however, each investigator has to try to design research which maximizes all kinds
of validity and, if he or she decides to place a primacy on internal validity, this
cannot be allowed to trivialize the research.
We hav~ not tried to place internal validity above other forms of validity.
Rather, we wanted to outline the issues. In a sense, by writing a book about
experimentation in field settings, we are assuming that readers already believe that
internal validity is of great importance, for the raison d’etre of experiments is to
facilitate causal inference. Other forms of knowledge about the social world one
more accurately or more efficiently gained through other means-e.g., surveys or
participant observation. Our aim differs, therefore, from that of the last critics we
discussed. They argue that experimentation is not necessary for causal inference or
that it is harmful to the pursuit of knowledge which will be useful in policy formulation. We assume that readers believe causal inference is important and that
experimentation is one of the most useful, if not the most useful, way of gaining
knowledge about cause.
Protests against “scientism” have been prominent in recent commentaries on the
theory of conducting social science. Such protests focus on inappropriate and
blind efforts to apply “the scientific method” to the social sciences. Critics argue
that quantification, random assignment, control groups and the deliberate intrusion
of treatments-all techniques borrowed from the physical sciences–distort the
context in which social research takes place. Their protest against scientism is
often linked with the now-pervasive rejection of the logical positivist philosophy
of science and is frequently accompanied by a greater emphasis on humanistic and
qualitative research methods such as ethnography, participation observation, and
ethno-methodology. Critics also point to the irreducibly judgmental and subjective
components in all social science research and to the pretensions to scientific precision found in many current studies.
We agree with much of this criticism and have addressed the issue in our
previous work (Campbell, 1966, 1974, 1975; Cook, 1974a; Cook and Cook,
1977; Cook and Gruder, 1978; Cook and Reichardt, in press). However, some of
the critics of scientism (Guttentag, 1971, 1973; Weiss and Rein, 1970; Hultsch
and Hickey, 1978; Mitroff and Bonoma, 1978; Mitroff and Kilman, 1978; Cronbach, in preparation) have cited Campbell and Stanley (1966) and Cook and
Campbell (1976) as prime examples of the scientistic norm to which they object.
While the identification of our previous work with scientism oversimplifies and
blurs the issues, we acknowledge that in this volume, as in the past, we advocate
using the methods of experiments and quantitative science that are shared in part
with the physical sciences. We cannot here comment extensively on these criticisms of our background assumptions, which go beyond criticisms of causation
issues alone. But we can indicate in broad terms the approach we would take in
responding to these objections.
First, we of course agree with the critics of logical positivism. The philosophy
was wrong in describing how physical science achieved its degree of validity,
which was not through descriptive best-fit theories and definitional operationalism.
Although the error did not have much impact on the practice of physics, its effect
on social science methods was disastrous. We join in the criticism of positivist
social science when positivist is used in this technical sense rather than as a synonym for “science.” We do not join critics when they advocate giving up the
search for objective, intersubjectively verifiable knowledge. Instead we advocate
substituting a critical-realist philosophy of science, which will help us understand
the success of the physical sciences and guide our efforts to achieve a more valid
social science. Critical realists (Mandelbaum, 1964) or “metaphysical realists”
(Popper, 1972), “structural realists” (Maxwell, 1972), or “logical realists”
(Northrop, 1959; Northrop and Livingston, 1964) are among the most vigorous
modern critics of logical positivism. Critical realists particularly concerned with
the social sciences identify their position with Marx’s materialist criticism of idealism and positivism, e.g., Bhaskar, 1975, 1978; Keat and Urry, 1975.
Second, it is generally agreed that the social disciplines, pure or applied, are
not truly successful as sciences. In fact, they may never have the predictive and
explanatory power of the physical sciences-a pessimistic conclusion that merits
serious debate (Herskovits, 1972; Campbell, 1972). This book, with its many
categories of threats to validity and its general tone of modesty and cavtion in
making causal inferences, supports such pessimism and underscores the equivocal
nature of our conclusions. However, it is sometimes forgotten that these threats
are not limited to quantitative or deliberately experimental studies. They also arise
in less formal, more commonsense, humanistic, global, contextual integrative and
qualitative approaches to knowledge. Even the “regre~sio~ ru:ifacts,” identifi~d
with measurement error, are an observational-inferential Ill~sion that occurs m
ordinary cognition (see Tversky and Kahnman, 1974, and F1schoff, 1?75) ..
We feel that those who advocate qualitative methods for soc1al science
arch are at their best when they expose the blindness and gullibility of sperese
· 1ude qua1·1tattve
. us quantitative studies. Field experimentation should a1ways me
CIO arch to describe and illuminate the context and con d’.
1t1ons un er w 1ch
search is conducted. These efforts often may uncover Important s1te-spec1 c
:..eats to validity and contribute to valid explanati?ns of. experimental resul~s in
neral and of perplexing or unexpected outcomes m particular. We also beheve,
!~ong with many critics, that quanti~ti~e researchers in the past have used poorly
framed .questions to generate qu~ntltatlve. sco~s and that these scores .have t~en
been applied uncritically to a vanety of s1tuat10ns. (Chapters 4 and 7, m particular, highlight some of the abuses associated ~ith trad~tions of qua~t~tative da~a
alysis which have probably led to many spectous findmgs.) In uncntlcal quantian
· l fi rst step m
· the
tative research, measurement has been viewed as an essentta
research process, whereas in physics the routine measures are the products of past
crucial experiments and elegant theories, not the essential first steps. Also, the
definitional operationalism of logical positivists has supported the uncritical reification of measures and has encouraged research practitioners to overlook the measures ‘ inevitable shortcomings and the consequences of these shortcomings. A
fundamental oversight of uncritical quantifiers has been to misinterpret quantifications as replacing rather than depending upon ordinary perception and judgment,
even though quantification at its best goes beyond these factors (Campbell, 1966,
1974, 1975). Experimental and quantitative social scientists have often used tests
of significance as though they were the sole and final proof of their conclusions.
From our perspective, tests of significance render implausible only one of the
many plausible threats to validity that are continually arising. Naive social quantifiers continue to overlook the presumptive, qualitatively judgmental nature of all
science. In cont…
Purchase answer to see full

error: Content is protected !!