Organizational Behavior Management Network

What is OBM?
Why Go Into OBM?
Why Use OBM?

Resources
Articles & More
JOBM
Newsletter
Discussion
Links

Membership
Why Join?
Sign-up
Directory

Opportunities
Grants & Awards
Graduate Training
Jobs

More
Upcoming Events
Store / Donations
Contact

About Us
Strategic Plan
Officers
Bylaws

 

 

 



Should We Be Measuring Effect Size in Applied Behavior Analysis?
Continued

by Sigurdur Oli Sigurdsson & John Austin, Ph.D.
Western Michigan University

Back to previous page


Should we use baseline standard deviation (sd) or pooled (baseline plus control) sd?
The rationale for using the sd of the control group in between–group comparisons is based on the assumption that the population variance of the control and experimental groups are equal. When applying the d statistic to applied behavioral data, it is unclear whether the homogeneity of variance assumption is met when considering different conditions. We can assume that consecutive baseline and intervention conditions have equal variance and therefore use the baseline standard deviation when calculating the d statistic. However we must understand how this assumption affects the d statistic. Using only the baseline sd would most likely lead to a more conservative estimate of ES. This is the product of an effect of many behavioral interventions referred to as reduction in variability. In such cases, using only the baseline sd, as opposed to using the pooled sd, could therefore result in a smaller d (because the standard deviation is the denominator in the equation). Another option for calculating d would be to pool the sd of baseline and intervention phases. If the sd is indeed smaller for intervention data than baseline data, a larger ES would then be obtained.

How many data points per experimental phase are needed to calculate ES?
When conducting applied research, it is some-times unrealistic to ask researchers to conduct lengthy baseline periods. A cursory review by the first author of the applied organizational behavior management literature revealed that many conditions have fewer than five data points. This may contribute to an inaccurate measure of the variance, which could be corrected if a greater number of data points were available. It is also evident that interventions in applied behavior analysis are not carried out in a manner that facilitates ES calculations. That is, rather than using some numerical criterion to determine when to change phases, conditions in applied behavior analysis are continued until the data appear to be stable. An unequal number of data points for baseline and intervention is therefore often unavoidable, even though an equal amount of observations for baseline and intervention phase would have been preferred.
The stability criterion for the introduction of a new experimental phase can also lead to a relatively small number of data points. A small number of data points in an experimental phase can, for example, lead to an inflated measure of sd that is not representative of the variability in the behavior.

Should the means for the same applications of the independent variable be pooled?

For example, in an A–B–A–B design, should the means for both intervention phases be combined for the sake of simplicity? This approach would yield only one measure of ES. On the other hand, calculating a separate d for each intervention phase may be a better alternative to demonstrate the reliability of an effect. If we select the latter approach, we would essentially be treating each replication of an effect as an additional example of the effects of the independent variable.
Given that applied behavioral data typically will show several intervention effects for the same group of participants over time, is it appropriate to treat each effect separately, as we would do when evaluating ES in group designs? The issue of ES comparisons between single–subject designs and between–group designs is further addressed below.
What are the concerns relevant to the unit of analysis?

In between–group designs, the mean and sd are calculated by combining scores from numerous participants. In single–subject designs, the mean is an aggregate of an individual's responses, and the sd is based on the variability of those responses. These two kinds of data sets are different in the sense that variability in an individual's responses is not taken into account in group data. How this difference affects ES comparisons between the two designs remains to be evaluated. Moreover, data from single–subject designs may be autocorrelated (see below).
How does autocorrelation affect ES measures?
Autocorrelation is defined by Johnston and Pennypacker (1993) as "a description of data that indicates the extent to which values in one subset of a series predict values of another subset" (p. 363). Autocorrelation in behavioral data would lead to a reduction in variation, which leads to an inflated d, suggesting that effects are larger than they actually are. It must be noted, however, that autocorrelation is not a mysterious aspect of behavioral data or that it is inherent in the data by definition. The extent to which autocorrelation is present in a dataset can be analyzed through statistical procedures (Huitema, 1985). In fact, Huitema (1985) demonstrated that autocorrelation probably does not present as big a problem as often is feared. Huitema’s results reveal that autocorrelation is extant in behavioral data, but not in a manner that precludes ES calculations.
How do we report ES?

Effect–size calculations demand a change in the way we report results. If ES is to be calculated for behavioral data, it must also become a standard for behavior analysts to report means and standard deviations. Reporting this information is an important aid to researchers interested in conducting meta–analyses of behavioral interventions. For ex-ample, the dearth of such information contributed, in part, to the small number of articles included in the meta–analysis of organizational behavior modification interventions by Stajkovic and Luthans (1997) . Stajkovic and Luthans set a number of criteria for inclusion in their analysis, but only included 19 articles out of the 125 that were initially identified through a search of the literature.
These issues, and no doubt more, will have to be discussed, especially if meta–analyses are to be carried out for applied behavior data. Some decision rules have to be agreed upon, and advanced statisticians and behavior analysts will undoubtedly find faults with the reasoning put forth here. However, the utilization of the effect–size measure in behavior analysis seems to be an important issue.


References


Balcazar, F., Hopkins, B. L., & Suarez, Y. (1985). A critical, objective review of performance feedback. Journal of Organizational Behavior Management, 7, 65-89.


Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic Press.

Huitema, B. E. (1985). Autocorrelation in applied behavior analysis: A myth. Behavioral Assessment, 7, 107-118.

Johnston, J. M., & Pennypacker, H. S. (1993). Strategies and tactics of behavioral research (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.

Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. Cambridge: Cambridge University Press.

Stajkovic, A. D., & Luthans, F. (1997). A meta–analysis of the effects of organizational behavior modification on task performance, 1975-95. Academy of Management Journal, 40, 1122-1149.