Effectiveness and Social Validity of FBAs for Youth At-Risk or With High Incidence Disabilities: A Meta-Analysis

This meta-analysis examined the effectiveness and social validity of 44 functional behavioral assessment (FBA) studies using single case research designs (SCRDs) conducted with youth displaying challenging behaviors or had high incidence disabilities. Three effect sizes were calculated: standard mean difference (SMD), Tau-U, and improvement rate difference (IRD). Fisher’s conservative dual criterion (CDC), which is a statistical aid to visual analysis, was also applied. Social validity was assessed by using indicators described by Kazdin (2010). Effect sizes were in ranges indicating moderate to large effects. Approximately 71% of AB contrasts reflected CDC systematic change. However, only 44% of studies assessed social validity. There were no significant differences in effectiveness of interventions whether or not a functional analysis was conducted nor whether the controlling function was escape or attention. Results are discussed in terms of FBA implementation issues related to social validity and the necessity for conducting a functional analysis for these youth.


Introduction
Applied Behavior Analysis (ABA) is a performance-based self-evaluative method for changing behavior, and whose dimensions were described by Baer, Wolf, and Risley (1968) approximately half a century ago in the first issue of the Journal of Applied Behavior Analysis. There have been many important developments in the ABA literature since then, but the issue of social validity and the development of functional behavioral assessment (FBA) arguably have been the most instrumental (Maag, 2014). Ten years after Baer and his colleagues wrote their seminal article, Wolf' (1978) wrote another influential article on social validity which addresses whether a relevant audience (e.g., educators, mental health providers) find interventions in real-life settings to be acceptable in terms of their goals, methods, outcomes, and ease of implementation. Around that time, Carr (1977) described how self-injurious behaviors resulted from either positive reinforcement or negative reinforcement. Based on his hypotheses, Iwata, Dorsey, Slifer, Bauman and Richman (1982) conducted what many consider the first study on FBA.
FBA refers to a series of heuristic approaches for determining the purpose (i.e., source of environmental reinforcement) youths' challenging behaviors serve. An important byproduct of an FBA is the development of a behavior intervention plan (BIP) that addresses the identified function(s). It is believed that the most effective interventions implemented in school and clinical settings are those based on the purpose maladaptive behaviors serve (Ervin et al., 2001). There have literally been hundreds of studies conducted on various aspects of FBA methodologies across different participant characteristics, and approximately 17 systematic reviews have been conducted on various procedures and populations-eight of which used meta-analytic approaches to calculate effect sizes (Bruni et al., 2017;Common, Lane, Pustejovsky, Johnson, & Johl, 2017;Delfs & Campbell, 2010;Gage, Lewis, & Stichter, 2012;Goh & Bambara, 2012;Losinski, Maag, Katsiyannis, & Ennis, 2014;McKenna, Flower, Kim, Ciullo, & Haring, 2015;Miller & Lee, 2013). Some of the variables addressed in these reviews included, but were not limited to, single case synthesis, effect size approaches, different populations, quality of studies, and positive supports in schools and clinics.
However, SMD is considered unreliable because of small number of observations and floor effects limiting variability which results in overestimates of the parametric treatment effects (Horner, Swaminathan, Sugai, & Smolkowski, 2012;Scruggs & Mastropieri, 2012). Therefore, a ceiling based on the 3 rd quartile was used to decrease overestimates (i.e., 3.40). Improvement rate difference (IRD) was computed because it provides an effect size similar to the risk difference used in medical treatment research which has a proven track record in hundreds of studies (Parker, Vannest, & Brown, 2009). Finally Tau-U values were computed because it controls for monotonic trend.
Additional analysis. Independent t-tests were computed to compare differences in effectiveness of study interventions that did and did not conduct a functional analysis to corroborate hypothesized functions. Independent t-tests were also computed to compare difference in effectiveness of study interventions between those developed based on escape versus attention functions. These t-tests were computed for all three effect size calculations.
Conservative dual criterion. The CDC lines were computed from AB contrasts (i.e., baseline and intervention adjacent phases) extracted from the graphs. Two lines were calculated: the mean line and the least squares regression line. Each line was dropped 0.25 standard deviations and then superimposed on the intervention data phase. Then the criteria developed by Fisher, Kelley, and Lomas (2003) were applied to determine whether changes in data were systematic or nonsystematic by examining the number of data points during the intervention phase that were below the 0.25 standard deviation least squares regression line. Intervention phases must consist of at least five data points in order to apply CDC lines.

Social Validity
Social validity was assessed for each study based on the four components described by Kazdin (2010). First, did researchers have a specifically stated goal related to social validity of the study? Second, was there a social comparison component? That is, were data collected on a peer(s) who did not display challenging behaviors on the dependent variables to determine if participants' intervention data were commensurate with the peers' level. Third, did the researchers include a social validity scale that assessed participating teachers' degree to which interventions were helpful, easy to implement, and outcomes were positive? Fourth, did researchers report on the level to which intervention addressed the specific goal?

Publication Bias
Publication bias, or the "file drawer" effect was addressed. This phenomenon refers to presence of potential bias existing because of a greater likelihood that published research shows positive findings (Rosenthal, 1979). In a meta-analysis of group design studies, the Meta-Win's Fail-Safe function (Rosenberg, Adams, & Gurevitch, 2000) can be used to estimate the number of unpublished studies with null results sufficient to reduce observed effect sizes to a minimal level (i.e., < .20). However, there is no comparable formula in SCRD meta-analyses. Therefore, to reduce the likelihood of the "file drawer" effect, the number of cases with no effect were added to the group of study effect sizes to reduce the overall effect to insignificant or suspect levels (d<.20; IRD < 37; Tau <.20).

Inter-Rater Reliability
Interrater reliability (IRR) data were conducted on 20 randomly selected articles out of the 44 included studies for a total of 45% of studies on the eight coded study characteristics. This percentage is congruent with other published SRCD systematic reviews (e.g., Maggin, Briesch, & Chafouleas, 2013). Social validity IRR was conducted on all 44 studies to determine the presence or absence of the four indicators described previously. Interrater reliability was calculated both for study characteristics and social validity components by dividing the total number of agreements by the total number of agreements plus disagreements for each item and averaged for all items. The author and one doctoral-level graduate assistant coded the articles for all variables and IRR for study characteristics was 88% (range: 72% -100%) and 85% (range 68% -100%) for social validity.

Characteristics of Participants and Settings
A total of 91 participants were included in the 44 studies contained in this analysis. Descriptions of participant age, gender, grade level (when stated) and disability/diagnosis/at-risk appear in Table 1. Participants' ages ranged from six years old (e.g., Grady & Peck, 1997) to 15 years of age (Patterson, 2009) with a mean age of 8.86 years and a median age of 9 years. There were eight studies for a total of 14 participants (13 males, 1 female) that only reported grade level and not age, but together had a median of fourth grade. All studies reported gender with more males (n = 76) than females (n = 15) represented. The majority of participants in the studies were at-risk and displayed challenging behaviors but were not have a high incidence disability nor the psychiatric conditions described previously (n = 56) followed by participants identified as EBD (n = 17), ADHD (n = 13), and LD (n = 5). Approximately 39% of the studies (n = 17) only had one participant.
The majority of studies were conducted in a general education classroom (n = 34). The next most common setting was identified as self-contained classroom (n = 8). There were two studies that identified the setting as "special education" (Bessett & Wills, 2007;Clarke et al., 1995). Change seating arrangement away from friends; request a break; ignore inappropriate behavior a many studies indicated using a "multi-element" design. These designs were used during functional analysis (i.e., testing hypothesis), whereas the present review was only interested in the subsequent design used to analyze the efficacy of the intervention developed from the FBA. b there was a second participant but diagnosed with a pervasive developmental disorder. c there were four boys but only three were include in the analysis because the fourth had autism. d there were two additional boys but both were 4 years old and, consequently, excluded for not meeting inclusion criteria.

Design Features
The majority of SCRDs were reversal (n = 16) and multiple baseline (n = 14). Other designs used were a simple AB (n = 7), alternating treatments (n = 3), changing conditions (n = 3), and multi-element (n = 1). Only designs that evaluated the efficacy of an intervention developed from the FBA were used. Many studies initially indicated using a multi-element design but those were for determining and testing the hypothesized functions and not the designs they used to assess the effectiveness of the interventions developed from them.

Dependent variables and identified function(s).
The majority of studies targeted between three to five dependent variables. The three most commonly targeted behaviors were talking to others (n = 28), being out of seat/walking around (n = 27), and not following directions/noncompliance (n = 26). Some studies used fairly subjective terms such as aggression (n = 9) while others were more specific such as hitting (n = 5) and kicking (n = 6). Two studies targeted sexually explicit comments (Trussell, Lewis, & Raynor, 2016;Turton, Umbreit, & Mathur, 2011).

FBA developed interventions.
Most studies (n = 38) developed multi-component interventions with only six using a one element intervention: differential reinforcement of other behavior (DRO; Broussard & Northrup, 1997), modifying interest level of tasks/assignments (Clarke et al., 1995), reducing task difficulty (Haydon, 2012), teacher spending two minutes talking to student before a lesson (Patterson, 2009), and assigning more challenging tasks (Umbreit, 1995). The most common intervention components were teaching replacement behaviors, contingent attention, extinction, differential reinforcement of alternative behavior (DRA), differential negative reinforcement of alternative behavior (DNRA), self-monitoring, and rearranging antecedent cues for the occurrence of appropriate behavior. From the descriptions of the interventions, quite elaborate and complicated techniques were used to determine whether problems behaviors displayed by participants during academic-related activities were maintained by either attention or escape.

Statistical Analysis
Effects of studies. Effect sizes were calculated for 145 AB contrasts and were then averaged for each study that appear in Table 2. Overall omnibus effect sizes for each type were as follows: SMD (mean = 2.26, SD = 1.266, range 0.12 -3.40); IRD (mean = .7754, SD = 0.267, range = 0 -1); and Tau (mean = .7712, SD = 0.272, range = 0 -1). Results of independent samples t-tests were insignificant on all three effect size types for differential effectiveness of interventions based on whether a functional analysis was conducted to conform hypothesized functions versus those using only indirect measures: IRD (t = 1.301, p = .09), Tau-U (t = 1.038, p = .15), and SMD (t = 0.983, p = .16). There also were no significant differences in the effectiveness of interventions based on either the function of attention versus escape for IRD (t = -0.856, p = .19), Tau-U (t = -0.750, p = .22), and SMD (t = -0.605, p = .27).

Conservative dual criterion.
There were a total of 112 AB contrasts that met evaluation criteria (i.e., too few intervention data points, alternating treatments design with no baseline). Based on the criteria developed by Fisher et al. (2003), 80 (71%) of AB contrasts demonstrated systematic change while 32 (29%) represented nonsystematic change.
Publication bias. To address the "file drawer effect," the number of studies with results of zero required to reduce the overall effect to insignificant or suspect levels was determined for SMD, IRD and Tau effect sizes. It would take an average of 219 cases each with an effect size of 0 to bring the overall SMD, IRD, and Tau, into small to ineffective ranges. There are typically between one and six participants in SCRD studies. Using an average of three participants, 73 "filed" studies (almost half as many as met inclusion criteria) would be needed to bring obtained effect sizes into the ineffective range. 9. Dunlap et al. (1996) .91 0.117 .97 0.047 3.40 2.60 10. Dwyer et al. (2012) . 12. Ellis & Magee (1999) . 14. Grady & Peck (1997) . 19. Kamps et al. (2006) .60 0.149 .66 0.199 1.93 0.580 20. Kennedy et al. (2001) .75 0.248 .67 0.364 1.61 1.482 21. Kern et al. (2001) . n/a n/a n/a n/a n/a n/a 26. Lo & Cartledge (2006) .  Payne et al. (2007)

Characteristics of the Data
It was not possible to calculate CDC lines for 44 of the AB contrasts due to fewer than five data points in intervention phases. Specifically, there were 20 intervention phases with four data points, 18 with three data points, five with two data points, and one with 1 intervention data point. Many of the baselines had very unstable trends. (e.g., Besset & Wills, 2007;Broussard & Northrup, 1997;Cho & Blair, 2017;Edwards, Magee, & Ellis, 2002;Kern, Ringdahl, Hilt, & Sterling-Turner, 2001).
There were other peculiarities with the data, especially related to measurement and quantity of dependent variables. There were studies that had very low baseline levels. These low levels were sometimes related to the dependent variable, such as aggression (e.g., Bessett & Wills, 2007) which tends to be a low frequency low duration but high intensity behavior. Some studies had five or fewer episodes of the target behavior during baseline (e.g., Cho & Blair, 2017) or low percentages of intervals, with one average baseline level being 6% (Dejager & Filter, 2015). Three studies had Y axis numbers of 0, 0.5, 1, 1.5, 2, 2.5, and 3 for episodes of the dependent variable during one minute observations (e.g., Christensen et al., 2012;Haydon, 2012;Luiselli & Pine, 1999). It is difficult to interpret a 0.5 disruptive behavior. Another curiosity was the dependent variable of "out of seat" being recorded with frequency instead of duration or interval recording (e.g., Patterson, 2009).

Social Validity
A little less than half the studies (n = 20 [44%]) addressed social validity in terms of having teachers rate their satisfaction with interventions developed from the FBAs. The most common way to assess social validity was through surveys and questionnaires that typically had Likert-scale ratings. Two studies interviewed teachers to determine social validity (Ingram et al., 2005;Moore et al., 2005). Eight (18%) studies included an explicitly stated social validity goal (Broussard & Northrup, 1997;Dejager & Filter, 2015;Kamps et al., 2006;Lane et al., 2006;Lane, Weisenbach, Phillips, & Wehby, 2007;Packenham et al., 2004;Shumate & Wills, 2010;Skinner et al., 2009). All but one study (Dejager & Filter, 2015) stated the impact of the intervention related to the stated goal.
In terms of interventionists, teachers were the sole agent in nine studies, 13 were researchers, and 16 were both teacher and researcher. In the latter, the researcher(s) typically conducted the FBA and/or functional analysis and then teachers were trained to implement the FBA-based intervention. For studies that identified just teachers as the agent, they were trained to conduct both the FBA and intervention. There were six studies in which the intervention agent was a paraeducator, school staff, graduate students, or therapist (Bessett & Wills, 2007;Campbell & Anderson, 2008;Edwards et al., 2002;Ellis & Magee, 1999;Hansen et al., 2014;Haydon, 2012).

Discussion
The present meta-analysis reviewed the literature on the use of FBAs with youth in kindergarten through 12 th grade who either had high incidence disabilities or were at risk for and displayed challenging behaviors in classroom settings during academic related tasks/activities. Calculated effect sizes were commensurate to those obtained by other reviewers (e.g., Bruni et al., 2017;Losinski et al., 2014;Miller & Lee, 2013). No previous review calculated CDC lines in order to determine the percentage of systematic change (71% of AB contrasts), and only 40% of studies reported social validity. There were no significant differences in effectiveness of interventions based on whether or not a functional analysis was conducted nor whether the controlling function was escape or attention.

Descriptive Analysis
Several conclusions can be reached from the descriptive analysis. First, most participants did not have any disabilities nor psychiatric disorders but rather displayed challenging behaviors in the classrooms of general education teachers. These participants were typically males with a mean age of approximately nine years. Second, most of the studies were conducted in general education classrooms during academic lessons, activities, or tasks. Third, the most typical types of behaviors targeted were talking to others, being out of seat/walking around, and those under the category of defiance and refusal to follow directions. Fourth, the identified functions, except in one case, were either attention or escape, and quite elaborate and complicated techniques were used to make these determinations, sometimes for only one participant. Fifth, most interventions based on the FBAs were teaching replacement behaviors, positive reinforcement for appropriate behaviors and extinction for inappropriate behaviors, rearranging antecedents, DNRA (e.g., giving students whose behaviors were maintained by escape breaks for task completion), and self-monitoring.
The use of FBAs has been considered best practice in schools generally-especially those using multi-tiered systems of support (MTSS) because they follow a universal supports paradigm that addresses struggling students regardless of the presence or absence of a disability. However, certain instances-particularly in the United States for students with disabilities served under the individuals with disabilities education act (IDEA)-mandate its use (Individuals with Disabilities Education Act, 2004). Specifically, FBAs must be conducted for behaviors that interfere with the learning environment, for students who are suspended for more than ten schools days, when misconduct results in a manifestation determination, or when weapons, drugs, or serious bodily injury occurs (Katsiyannis & Maag, 2001). Nevertheless, as school continue to adopt MTSS frameworks, they will be required to build capacity among staff in function-based thinking and assessment for students across tiers of support within both general and special education settings.

FBA Effectiveness
All previous meta-analyses obtained effect sizes in acceptable or effective ranges using a variety of calculations, although some with large variability and heterogeneity (e.g., Common et al., 2017). It is difficult to draw conclusions regarding intervention effectiveness from SCRD effect sizes due to presence of autocorrelations, lack of independence between observations, and the tendency of some calculations (i.e., SMD) to overestimate results (Campbell, 2004;Hershberger, Wallace, Green, & Marquis, 1999;Olive & Smith, 2005).
In the present review both baseline and intervention data could be characterized as moderately to highly unstable which does not help interpret effect sizes. Specifically, baselines were characterized by few data points, low numbers/percentages, and highly unstable trends. Collectively, these are serious methodological flaws in interpreting SCRD results (Kazdin, 2010).
Perhaps these data problems were reflective of studies conducted in natural environments, such as the general education classroom, and that more stable data and observations may be obtained for studies conducted in more controlled settings such as clinics, residential facilities, or psychiatric hospitals. Lang, Sigafoos, Lancioni, Didden and Rispoli (2010) reviewed a small set of studies conducted in different settings and concluded that in some instances FBA procedures were similar between settings and others they were different.

Social Validity
It is somewhat surprising, given the acceptance of using FBAs and interventions generated from them, that less than half the studies (44%) addressed social validity. However, 32% of studies had researchers or graduate students conducting all aspects of the FBA process which would make it moot to assess social validity, and also reflect the complexity of the process and levels of expertise required for implementation. The researcher and teacher collaborated in 13 of the studies but only six collected measures of social validity. In these instances the teacher implemented the function-based intervention that the researchers developed.
The issue is not that general education teachers are incapable of learning and implementing FBAs (e.g., Maag & Larson, 2004;Moore et al., 2002;Packenham et al., 2004), but whether they will use this methodology, consistently, with fidelity, and independently when not being observed-especially if the procedures are perceived to be time-consuming and require high levels of expertise. For example, Lane, Smither, et al. (2007) engaged in six activities to determine that for one participant the function of his behavior was peer and adult attention: (a) Preliminary Functional Assessment Survey, (b) functional assessment interview with the student, (c) 10 hours of direct observation using an A-B-C approach, (d) teacher completed Motivation Assessment Scale, (e) teacher version of the Social Skills Rating System, and (f) the School Archival Record Search. These measures would be a daunting undertaking for any teacher, let alone considering many studies conducted actual experimental manipulations to confirm hypothesized function using multi-component designs (e.g., Broussard & Northup, 1997;Clarke et al., 1995;Dwyer, Rozewiski, & Simonsen, 2012;Edwards et al., 2002). It is difficult to imagine school personnel-even school psychologists-would have the expertise and time to engage in these activities for one student.
One positive result related to social validity was that interventions based on FBAs were no more effective when functional analyses were conducted than those using only indirect measures. Both direct and indirect measures have advantages and disadvantages, but indirect measures in which conducting functional analyses are unnecessary may be more social acceptable if teachers were aware of their existence and use. For example, Dufrene, Kazmerski and Labrot (2017) reviewed the social validity of indirect FBA instruments and concluded that teachers and other school personal have very limited knowledge of their existence and relevance. This result corroborates the need for more extensive and varied social validity measures to be included in FBA intervention studies for youth with high incidence disabilities or those simply displaying challenging behaviors.
Another mixed conclusion involves the socially validity of interventions developed from FBAs. In the present review, except for replacement behavior training and perhaps self-monitoring, the remaining interventions were simple procedures such as rearranging antecedents, providing positive reinforcement, changing assignment difficulty, or giving students who misbehave breaks contingent on completing certain portions of their work. These simple techniques are based on elementary principles of applied behavior analysis included in special education teacher preparation programs but rarely part of general education curriculum. For example, Merrett and Wheldall (1993) interviewed 176 secondary school teachers regarding their professional training and behavior management. Nearly three-quarters of them were dissatisfied with the preparation in this area while a majority indicated interest in attending training courses in behavior management. Gebbie, Ceglowski, Taylor, and Miels, (2012) found that when classroom teachers of preschool children with disabilities were surveyed, their most frequent request was how to address students' challenging behaviors. Ironically, it is not a stretch to imagine that if more general education teachers in the studies reviewed possessed these rudimentary behavior management skills that some of the participants identified as displaying challenging behaviors would not even be included.
A final issue related to social validity was that only surveys and questionnaires typically involving Likert-type scales were used and, consequently, may limit the amount and type of information they offer (Leko, 2014). Several researchers have expanded on the social validity construct by using qualitative approaches in intervention research such as interviews (e.g., Broer, Doyle, & Giangreco, 2005;Copeland et al., 2004;Gerber & Popp, 1999;Leko, 2014;Lyst, Gabriel, O'Shaughnessy, Meyers, & Meyers, 2005). In the present meta-analysis 16 studies had both researchers and teachers jointly developed and conducted the FBA and subsequent interventions. These studies would have been an excellent format in which to conduct interviews between researchers and teachers.

Conclusion
There is no question that FBAs and interventions created from them are an essential component in the education and treatment of individuals with moderate to severe disabilities. Further, the results of the present review corroborate those obtained by Common et al. (2017) but extent them to the corpus of literature extending back to the 1980s. It is not surprising that the two main functions obtained from the reviewed studies were escape and attention. The studies were conducted in mostly general education classrooms during academic lessons and tasks-typically during independent paper-and-pencil seatwork. Some students with high-incidence disabilities or those at risk display academic deficits and, therefore, may find instructional lessons and the task demands that accompany them aversive and act out to escape or gain attention from peers. The question remains whether intensive multi-component FBAs such as those conducted in the reviewed studies are necessary and even possible for educators who may lack the expertise for doing so. However, when the interventionist was a teacher or other educational personnel, they tended to find the interventions developed from the FBA acceptable in terms of their social validity.