|
Should We Be
Measuring Effect Size in Applied Behavior Analysis?
Continued
by
Sigurdur Oli Sigurdsson & John Austin,
Ph.D.
Western Michigan University
Back
to previous page
Should we use baseline standard deviation
(sd) or pooled (baseline plus control) sd?
The rationale for using the sd of the control group in between–group
comparisons is based on the assumption that the population variance of the
control and experimental groups are equal. When applying the d statistic
to applied behavioral data, it is unclear whether the homogeneity of variance
assumption is met when considering different conditions. We can assume that
consecutive baseline and intervention conditions have equal variance and therefore
use the baseline standard deviation when calculating the d statistic.
However we must understand how this assumption affects the d statistic.
Using only the baseline sd would most likely lead to a more conservative
estimate of ES. This is the product of an effect of many behavioral interventions
referred to as reduction in variability. In such cases, using only the baseline sd,
as opposed to using the pooled sd, could therefore result in a smaller d (because
the standard deviation is the denominator in the equation). Another option
for calculating d would be to pool the sd of baseline and
intervention phases. If the sd is indeed smaller for intervention
data than baseline data, a larger ES would then be obtained.
How many data points per experimental
phase are needed to calculate ES?
When conducting applied research, it is some-times unrealistic to ask researchers
to conduct lengthy baseline periods. A cursory review by the first author of
the applied organizational behavior management literature revealed that many
conditions have fewer than five data points. This may contribute to an inaccurate
measure of the variance, which could be corrected if a greater number of data
points were available. It is also evident that interventions in applied behavior
analysis are not carried out in a manner that facilitates ES calculations.
That is, rather than using some numerical criterion to determine when to change
phases, conditions in applied behavior analysis are continued until the data
appear to be stable. An unequal number of data points for baseline and intervention
is therefore often unavoidable, even though an equal amount of observations
for baseline and intervention phase would have been preferred.
The stability criterion for the introduction of a new experimental phase can
also lead to a relatively small number of data points. A small number of data
points in an experimental phase can, for example, lead to an inflated measure
of sd that is not representative of the variability in the behavior.
Should the means for the same applications of the independent variable be pooled?
For example, in an A–B–A–B design, should the means for both
intervention phases be combined for the sake of simplicity? This approach would
yield only one measure of ES. On the other hand, calculating a separate d for
each intervention phase may be a better alternative to demonstrate the reliability
of an effect. If we select the latter approach, we would essentially be treating
each replication of an effect as an additional example of the effects of the
independent variable.
Given that applied behavioral data typically will show several intervention
effects for the same group of participants over time, is it appropriate to
treat each effect separately, as we would do when evaluating ES in group designs?
The issue of ES comparisons between single–subject designs and between–group
designs is further addressed below.
What are the concerns relevant to the unit of analysis?
In between–group designs, the mean and sd are calculated by
combining scores from numerous participants. In single–subject designs,
the mean is an aggregate of an individual's responses, and the sd is
based on the variability of those responses. These two kinds of data sets are
different in the sense that variability in an individual's responses is not
taken into account in group data. How this difference affects ES comparisons
between the two designs remains to be evaluated. Moreover, data from single–subject
designs may be autocorrelated (see below).
How does autocorrelation affect ES measures?
Autocorrelation is defined by Johnston and Pennypacker
(1993) as "a description
of data that indicates the extent to which values in one subset of a series
predict values of another subset" (p. 363). Autocorrelation in behavioral
data would lead to a reduction in variation, which leads to an inflated d,
suggesting that effects are larger than they actually are. It must be noted,
however, that autocorrelation is not a mysterious aspect of behavioral data
or that it is inherent in the data by definition. The extent to which autocorrelation
is present in a dataset can be analyzed through statistical procedures (Huitema,
1985). In fact, Huitema (1985) demonstrated that autocorrelation probably does
not present as big a problem as often is feared. Huitema’s results reveal
that autocorrelation is extant in behavioral data, but not in a manner that
precludes ES calculations.
How do we report ES?
Effect–size calculations demand a change in the way we report results.
If ES is to be calculated for behavioral data, it must also become a standard
for behavior analysts to report means and standard deviations. Reporting this
information is an important aid to researchers interested in conducting meta–analyses
of behavioral interventions. For ex-ample, the dearth of such information contributed,
in part, to the small number of articles included in the meta–analysis
of organizational behavior modification interventions by Stajkovic and Luthans
(1997) . Stajkovic and Luthans set a number of criteria for inclusion in their
analysis, but only included 19 articles out of the 125 that were initially
identified through a search of the literature.
These issues, and no doubt more, will have to be discussed, especially if meta–analyses
are to be carried out for applied behavior data. Some decision rules have to
be agreed upon, and advanced statisticians and behavior analysts will undoubtedly
find faults with the reasoning put forth here. However, the utilization of
the effect–size measure in behavior analysis seems to be an important
issue.
References
Balcazar, F., Hopkins, B. L., & Suarez,
Y. (1985). A critical, objective review of performance
feedback. Journal of Organizational Behavior
Management, 7, 65-89.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New
York: Academic Press.
Huitema, B. E. (1985). Autocorrelation in applied behavior analysis: A myth.
Behavioral Assessment, 7, 107-118.
Johnston, J. M., & Pennypacker, H. S. (1993). Strategies and tactics
of behavioral research (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.
Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect
sizes in behavioral research: A correlational approach. Cambridge: Cambridge
University Press.
Stajkovic,
A. D., & Luthans, F. (1997). A meta–analysis
of the effects of organizational behavior modification
on task performance, 1975-95. Academy of Management
Journal, 40, 1122-1149.
|