Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

rental practice A Re v i s e d M e t a -a na ly s i s of th e He nt a 1 P ra ct i c e Literature on Motor Skill Learning Concomitant with the cognitive revolution in psychology has been the resurgence of research on mental practice. As a specific form of practice, mental practice has also been referred to as symbolic rehearsal tSackett, 1935), imaginary practice (Perry, 1939), covert rehearsal (Corbin, 1967), implicit practice (Morrisett' 1956) ~ mental rehearsal (Whiteley, 1962) ~ conceptualizing practice (Egstrom, 1964), mental preparation (Weinberg, 1982), and visualization (Seiderman & Schneider, 1983). According to Richardson (1967, p. 95), "mental practice refers to the symbolic rehearsal of a physical activity in the absence of any gross muscular movements. ° Such covert activity is commonly observed among musicians and athletes prior to their performances. For example, when a gymnast imagines going through the motions of performing a still ring routine he is engaged in mental practice. Since the 19 30s there have been over 100 studies on mental practice. The specific research question addressed in these studies has been whether a given amount of mental practice prior to performing a motor skill will enhance one's subsequent motor performance. Unfortunately, definitive answers to this question have not been readily forthcoming. Although there are existing narrative (Corbin, 1972; Richardson, 1967 a, b; Weinberg, 1982) and meta-analytic (Feltz & Landers, 1983) reviews of the mental practice literature, the conclusions have been contradictory. There is a need, therefore, to conduct a comprehensive review of the mental practice literature using more sophisticated me~a-

4 analytic procedures and examining more study features than used in previous studies (eager Feltz & Landers' 1983). MENTAL PRACTICE PARADIGMS Most experiments on skill acquisition have been variants on a research design which employs four groups of subjects randomly selected from a homogeneous parent population or equated on initial levels of performance. These groups have been (a) mental practice, (b) physical practice, (c) combined physical and mental practice, and (d) no physical or mental practice (i.e., control). Most studies compared the performances (pre-post) of subjects who had previous mental practice to a control group that had not received mental instructions. In the mental practice group the time intervening between pre and posttest was usually occupied in sitting or standing and rehearsing the skill in imagination for a set amount of time. The members of the no practice group were simply instructed not to practice the skill physically or mentally during the interval. A more appropriate control has required members of the no practice group to participate in the same number of practice sessions as the mental and physical practice groups, but with activity that has been irrelevant to the task. Quite often. these around wars n1 an ~^nFr~ct=~ hm ~ _ groups were also physical practice group and a group receiving physical practice. A practice period was then varied considerably in the number of trials in session and in total number and spacing of mental-physical practice groups, practice contrasted to a combined mental and instituted which each practice trials. In combined periods involved either

a 1 t e mat ing me nt a ~ a n d p by s ~ ca 1 p ra at i c e t ri a 1 s, me nt a 1 1 y practicing a number of trials followed by physical practice, or physically practicing a number of trials followed by mental practice. Following this practice period, the subjects' skills were tested under standard conditions to determine whether their performance scores differed as a result of the practice condition administered. The scope of the present meta-analytic review is considerably broader than in previous reviews. Whereas Feltz and Landers (1983) limited their review to only comparisons between mental practice and no practice, all four groups are compared in the present review. The previous meta-analytic study included only studies that had pretest scores or a control group with which to be compared. By contrast, the present review included only single or multiple group studies having pre and posttest scores. The use of pre-post designs permitted a determination of a change-score effect size for each group examined in this set of mental practice s tudies . PREVIOUS REVIEWS Research studies examining the effects of mental practice on motor learning and skilled performance have been reviewed on a selective basis. The reviews by Richardson ( 1967a ) and Corbin ( 1972) included from 22 to 56 studies and provided contradictory conclusions. Richardson (1967a) reviewed studies of three types: (a) those that focused on how mental practice could facilitate the initial acquis ition of a perceptual motor skill, (b ) those that focused on aiding the continued . etention of a motor skill,

6 and (c) those that focused on improving the immediate performance of a skill. He concluded that in 8 majority of the studies reviewed, mental practice facilitates the acquisition of a motor skill. There were not enough studies to draw any conclusions regarding the effect of mental practice on retention or immediate performance of a task. Five years later, Corbin (1972) who reviewed many other factors that could affect mental practice was much more cautious in his interpretation of the effects of menta' practice on acquisition and retention of skilled motor behavior. In fact, he maintained that the studies were inconclusive and that a host of individual, task and methodological factors used with mental practice produced different mental practice results. In a 1982 review of "mental preparation, n Weinberg reviewed 27 studies dealing with mental practice. Although Weinberg noted the equivocal nature of this literature, he maintained that the following consistencies were apparent: (a) physical practice is better than mental practice; and (b) mental practice combined and alternated with physical practice is more effective than either physical practice or mental practice alone. The latter conclusion is similar to Richardson's (1967a) cautious inference that the combined practice group is as good as or better than physical practice trials only. Another conclusion reached by Weinberg (1982) was that for mental practice to be effective individuals had to achieve a minimum skill proficiency. However, in their meta-analysis, Feltz and Landers (1983) found no significant differences between

the effect sizes determined for novice and experienced performers. It is not surprising that with all of the significant and nonsignificant findings in the numerous mental practice studies, it is exceedingly difficult in these narrative reviews (Corbin, 1972; Richardson, 1967; Weinberg, 1982) to obtain any clear patterns. The insights about directions for future research that were provided in previous reviews by Richardson (1967), Corbin (1972) and Weinberg (1982) were helpful. In the above reviews, however, the conclusions about mental practice effects may have been distorted for one or more of the following reasons: (a) too few studies have been included to accurately portray the overall empirical findings in the area; (b) only a subset of possible studies was included, leaving open the possibility that bias on the reviewers' part may have influenced them to include studies that supported their position, while excluding those that may have contradicted their beliefs; (c) although the reviewers speculated about a range of variables that may influence the effectiveness of mental practice, the style used in these reviews was more narrative and rhetorical than technical and statistical, thus making it difficult to systematically identify the variables; and (d) the reviews have ignored the issue of relationship strength, which may hare allowed weak disconfirmation, or the equal weighting of conclusions based on few studies with conclusions based on several studies (see Cooper, 1979). In other words, they had a smaller pool of studs es, and at that time, more sophisticated tools for research

r L a a; ~ ~ i,; ~ integration were not widely available. Thus, some of their conclusions may no longe r be t enable . Given the current confusion that may have resulted from the basic limitations of previous reviews, there is a need for a more comprehensive review of existing research, using a more powerful method of combining results than summary impression. The methodology recommended for such a purpose is meta-analysis, which examines the magnitude of differences between conditions as well as the probability of finding such differences. AN OVERVIEW OF META-ANALYSIS TECHNIQUES This section provides an overview of the concept and practice of meta-analysis, the quantitative synthesis of research findings. A brief introduction is followed by a discussion of Cooper's (1984) formulation of the process of integrative research reviewing. The effect size, as popularized by Glass (1976), is next introduced: this measure serves as an index of the effectiveness of mental practice training in our review. An overview of hypotheses tested by statistical method designed specifically for analyzing effect-size data (e.g., Hedges & Olkin, 1985) concludes the section. Introduction "Meta-analysis,n (Glass, 1976) or the analysis of analyses, is an approach to research reviewing that is based upon the quantitative synthesis of results of related research studies. Although the idea of statistically combining measures of study outcomes is not new in the agricultural or physical sciences

9 (e.g., Birge' 1932; Fisher, 1932), Deed to summarize research results Glass ( 1976) proposed the idea of Glass described meta-analysis the casual' narrative discussions typify our attempts to make research literature" ( 1976, Smith ( 1981) presents an ore conceptua 1~ ze d . In Glas s ' s to explore the variation in same way that one might ana Ques tions of the ef f ects of treatment implementation on empirically. Thus we avoid few studies not believed to and teas ing the conclusions results . Some critics (e.g., Eysenck, 1978; Slavin, 1984) have claimed that meta-analysis (as it is generally applied) is little more than the thoughtless application of statistical summaries to the results of studies of questionable quality. In fact, as is true for some published primary research, some published meta- analyses are flawed because of problems in data collection, data analysis, or other important aspects. However, when thoughtfully conducted, a meta-analysis can provide a more rigorous and ob jective alternative to the traditional narrative review. Additionally, the development of statistical analyses desk gned this approach was not of ten in the social sciences until mete -ana lys is . as "a rigorous alternative to of res earch s tudies which s ens e of the rapidly expanding p. 3) . The book by Glass, McGaw, and Purview of the process as it was f irs ~ view, the task of the meta-analyst is the f indings of studies in much the lyze dat a in p rims ry res ear ch . d-. fferences in study design or s tu dy r e s u i t s a r e a d d r e s s e d the practice of eliminating all but a be deficient in design o- analysis, of the review on the remains ng

10 especially for effect sizes makes the thoughtful meta-analysis a necessity rather than an option. T Integrative Review Both Jackson (1980) and Cooper (1982, 1984) have conceived of the steps involved in an integrative research review as parallel to those familiar in the conduct of primary research. Cooper (1984) outlines and details five steps in a research review and the "functions, sources of variance, and potential threats to validity associated with each stage of the review process" (1984, p. 12). These five stages are outlined below. Problem Formulation At this first stage of the review, the researcher must outline the research questions for the review and the kinds of evidence that should be sought in order to address those questions. Here the reviewer deals with the conceptualization and operationalization of constructs, the specificity versus generality of conclusions to be drawn, and the question of whether to conduct a review which tests hypotheses on the basis of "study-generated evidence" or a review which proposes hypotheses on the basis of "review-generated evidence." Study- generated evidence comprises information about effects examined within studies, such as treatment effects or the relationships of critical subject characteristics to treatment effects. Review- generated evidence concerns effects that cannot be, or usually are not, tested within single studies. For example, evidence about the relationship to study results of features of research design or methodology would be review-generated evidence.

11 Data Co11 ection At this stage of the review, the issue is the identification and collection of studies. Cooper details many literature-search procedures, and discusses ways to evaluate their adequacy, Data Evaluation _, This stage of the research involves the accumulation of study results and the "coding" of study features which may later serve as explanations for patterns of study outcomes. During this step, the meta-analyst computes quantitative indices of study outcomes (representing treatment effects, degrees of relationships between variables, or other outcomes ) which will later be analyzed. Also at this stage the issues of subject and treatment characteristics and study quality become crucial. Features of the sub jects (both experimental and control sub jects), the treatments, and the context of the study may be related either purposely or accidentally to study outcomes. Some guidance about which features should be important will come from the problem formulation stage of the review. Important treatment he vim the Trot i rn 1 features and sub ject characteristics that importance must be noted for each study in order to examine plausible explanations for differences for similarities) in study re s u It s . Cooper describes two approaches for evaluating study quality, the "threats-to-validityn approach and the "methods- de s crip Lion " approa ch . The threats -t o-va li city app roe ch invo lve s determining whether each study in the review ~ s sub ject to any of a number of threats to validity (such as those listed by Campbell

·~-G1 r ~ aQ~C~ 12 and Stanley, 1963) and the methods-description approach involves the description of the features of study design via coding of the primary researchers' descriptions of the methodology of the studies. Clearly, either approach has the weakness that different reviewers may choose to list different threats to validity or methodological features, but the methods-description has the advantages of requiring fewer judgments and being more detailed (because finer details of study methods are noted). Data Analysis and Interpretation At this stage the reviewer selects and applies procedures in order to draw inferences about the questions formulated at the first stage of the review procedure. Different procedures are available for analyzing measures of effect magnitude such as correlations and standardized mean differences, and for analyzing probability values from independent studies. Different inferences can be based on these two kinds of analyses. Public Presentation of Result s Finally, the reviewer must prepare the results of the integrative review for public consumption. Here issues of the amount of detail that should be reported about the conduct of the four previous stages are critical. Clearly the inclusion of every detail, regardless of its eventual importance in the findings of the review, is unwise. However, Cooper argues that the omission of details about the conduct of the review constitute a primary threat to the validity of the review. Summary The Claris ication alone of the process of conduct ding an integrative review has done much to enable researchers to take a

La-L rrac~lce 13 more rigorous and systematic approach to research reviewing. Even so, in each review there will be special considerations suggested by the nature of the research topic or the data available that do not allow the conduct of such a review to be an automatic, thoughtless process. Glass's Effect Size ,~ . For many years the quantitative summarization of measures of effect magnitude was not possible for much of the research in the social sciences. Glass's popularization of the effect size, or standardized mean difference, as a measure of treatment effect that could be compared across studies using nonidentical instruments or measures, was the breakthrough that allowed the broad application of quantitative research synthesis techniques in the social and behavior sciences. The effect size for a comparison between the experimental and control groups in a study is the standardized mean difference where yE and yC are the experimental and control group means, respectively, and S is the pooled within-groups estimate of o, the common population standard deviation of the scores. (Though &lass proposed using the control group standard deviation as S. Hedges (1981) noted that the pooled standard deviation is a more precise estimate of 0 when the assumption of equal population va ri ance s is s at is f _e d . ~

14 The effect size represents the difference between the means of the experimental and control groups relative to the amount of random variation within those groups. Many reviewers discuss values of the effect size in terms of Standard-deviation units, in much the same way that a z score or standard score would be discussed. Thus, an effect size of 0.75 indicates that the means of the experimental and control groups differ by three fourths of one standard deviation. Another way to interpret the effect size is in terms of the performance of any average sub ject in the control group. An effect size of 0.75 indicates that the treatment implemented raises the score of the average sub ject three f ourths of one s tandard deviat ion ~ Statistical Analyses for Effect-size Data Glass's Analyses When Glass proposed using quantitative methods to effect sizes, he argued that the effect sizes could be summarize t reate d as "typical" data and analyzed using familiar procedures (e.g., ANOVA, regression). The rationale for using such analyses was that the reviewer wanted to examine variation in the results of studies, in much the same way that a researcher might examine differences between sub jects in a primary data analysis. Thus, analysis of variance was used to compare results of classes or categorizations of studies, and regression was used to examine the relationships of continuous predictor variables to the study result ts . Though many meta-analyses have been based on this approach to summarizing data from series of studies, -. ncluding our

.~~ a 1 r L a L;L 1 15 original review of the mental practice literature (Feltz & Landers, 1983), the approach is problematic because the effect- size "data" (or the correlations or proportions ~ do not usually satisfy the homoscedasticity assumption required of standard statistical analyses. The variance of the effect-size estimate is inversely related to the size of the sample for which it is calculated (Hedges, 1981), and sample sizes of studies in research reviews often differ by several orders of magnitude. Furthermore, though the influence on decisions of violations of this sort has not been well studied, it seems likely to be associated with serious errors in the significance levels of tests (e.g., t and F tests) based on the analyses (Hedges, 1984). Thus, analyses designed specifically for the examination of ef'ect-size data are to be preferred over the seemingly sensible ad hoc methods used initially. Analyses for Effect Sizes ~ , Analyses based on sample effect sizes allow inferences about corresponding population parameters. Reds ( 1981) noted that g As' estimates a population effect size, S. which may be written as ~_^ ~ E ~ Cat The parameters ME and TIC are the population means on Y f or the experimental and control groups, respectively, and 6 is the population standard deviation of the Y scores within the groups of the study. When the reviewer considers a set of k studies, the parameters ol, ..., Ok are the population values about which inferences are made when sample effect sizes are analyzed. ( 2)

.~ practice 16 Though there are many similarities between the familiar analyses first employed in meta-analysis (like ANOVA and regression)! and the analyses designed specifically for effect sizes, there are also differences. Statistical analyses designed specifically for effect sizes not only avoid the statistical problems of traditional analysis methods, but also provide tests of the adequacy of proposed models for the effect sizes which are not available from traditional methods. Rather than detail the statistical theory for the effect-size analyses, which is presented clearly by Hedges (e.g., Hedges, 1982a,b; Hedges & Olkin, 1985) we outline here the hypotheses that are addressed by the analyses. Hypotheses for Effect-size Analysis . The hypotheses appropriate for effect-size data are discussed here for the context of studies comparing one experimental group to a control group on a simple posttest measure. The simplest null hypothesis for effect-size data is that all of the studies are of populations in which there are essentially no treatment effects. This is typically tested in two steps using statistical analyses for effect sizes. First, the hypothesis that all of the studies provide similar or consistent results is tested. This is the model that 81 2 = = Ok = §' where k is the number of studies being summarized. Hedges (1982a) and Rosenthal and Rubin (1982) showed how to test this model using a chi-square statistic with ~ - 1 degrees of freedom, which provides a test similar to the goodness-of-fit test from a log-linea- model. Because this homogeneity test informs the (3)

Incas tract ace 17 reviewer about differences in the size of the treatment effect across studies, in a sense it provides a test for a nstudy by treatments interaction. If the results from all of the studies are consistent with the model of a single underlying population effect size (i.e., one treatment effect ), the meta-an~vst can test whether the value of that single effect (I) differs from zero. The formal hypothesis to be tested is that {; = 0. A z score is calculated by dividing the weighted mean effect size by its standard error. The test is done by comparing that sample z to a table of standard normal values. Fu r t h e r Hyp 0 t h e s e s . If the test of homogeneity for the effect sizes (model 3) is re jected and the reviewer concludes that the results are nnot consistent," many alternative methods of analyzing the effect sizes are available as a next step. Many of these methods are covered in detail by Hedges (1982b,c; 1983) and others (Raudenbush & Bryk, 1985; Rosenthal & Rubin, 1982). The logic behind these alternatives is described brief ly. The goal of the alternative statistical analyses designed for effect sizes is to either "explain," estimate, or identify the sources of variability in study results. Tests for the significance of specific explanatory models are accompanied by tests for the adequacy of those models. Similarly, methods for identifying outliers provide ways to assess the impact of the omission of the outliers on the data analys is.

`~ ~ '` . c ~ . ~ ~ c ~ ~ ~ e 18 "F~xed-effects ~ analyses assume that all effect-size parameters are functions of known concomitant variables (study or sample characteristics), and thus can be ~explained" in terms of an appropriate statistical model. The model may be a regression- like linear model relating predictors (e.g., sample or study features) to the effect-size outcomes (Hedges, 1982c) or a categorical model (conceptually similar to ANOVA) which posits different population effect sizes for qualitatively different sets of studies. Other analyses assume that "random-effects" or mixed models are more appropriate for describing effect-size outcomes. The underlying assumption of these methods is that the effect-size parameters vary in much the same way as their sample realizations. The goal of random-effects analyses is to estimate the amount of random parameter variability in a set of outcomes. Mixed models do not obviate the possibility of between-study differences due to fixed factors. Such models simply do not presume that such fixed differences can explain all variability in outcomes. Thus, a reviewer using mixed-model methods might seek to reduce outcome variation via explanatory models but would not expect to eliminate that variation. In this approach tests of model 'adequacy' are often accompanied or replaced by estimates of residual variation in effects. Another approach to the analysis of effect sizes, which is often combined with those mentioned above, involves the identification of outliers, or unusual effect-size estimates. Methods described by Hedges and Olkin (1985) allow the reviewer to locate studies that contribute heavily to the misspecification

`. ~ 1.~ a ~ r rac-~=c~ 19 of proposed models for differences in effect sizes. The studies from which these estimates arise sometimes differ from other studies in ways that were not coded or thought important during preliminary data evaluation. Sometimes the features of such unusual studies can be included in a model which then explains adequately the pattern of results. Occasionally outliers are eliminated if they result from incommensurate outcome measures or because of problems in effect-size computation. The methods for identifying unusual studies can be used not only to identify problem studies, but also to identify exemplary studies. THE NEED FOR THE PRESENT STUDY This reanalysis of the mental practice literature will be valuable for several reasons. First, the analysis will improve upon the earlier review by expanding the set of studies investigated to include those examining a treatment featuring combinations of mental and physical practice. The Feltz and Landers meta-analysis ( 1983) examined only the comparison of mental practice to no practice at all. Second, our present study will improve upon the earlier review by Feltz and Landers (1983) by using modern statistical analyses for effect sizes. Feltz and Landers employed the meta- analysis strategy initially proposed by Glass, which is problematic both because of the violation of the assumption of homogeneity of variances discussed previously, and because of the inability of this strategy to assess the adequacy of the models

20 for differences in effect sizes. We will use the methods described by Hedges and Olkin to avoid these problems. Furthermore, we will use the methods described by Hedges and 01kin for identifying outliers or unusual studies to pinpoint very large effect sizes. Thus, we will be able to select studies that show particularly strong mental-practice or combined mental and physical practice effects, which might serve to identify problem studies or exemplars for the design of mental-practice interventions. Our reanalysis will also use a slightly modified version of Glass's effect size as a measure of the effectiveness of mental practice training. In their previous review, Feltz and Landers (1983) used the typical experimental versus control effect size, contrasting motor-skill performance between mental-practice and control groups. In our reanalysis, we will use separate effect sizes for the mental practice and control groups (as well as for combined mental and physical practice groups) to represent change in motor skill performance. The use of this "difference-score effect size" (discussed in more detail below) will enable us to estimate not only the difference in performance due to the mental-practice intervention, but also the amount of change that would be expected for groups receiving no training or a combination of mental and physical practice. Thus, our overall null hypothesis will be that all studies show on average the same degree of change in motor skill for the mental practice, physical practice, combined' and control groups.

METHOD In this section, we detail the methodology for our meta- analysis. The first section details the 'iterature search procedures used to identify our collection of studies. Next, the definition and computation of effect-size measures, and the coding of study features are discussed. The remaining section deals with the analysis of our effect-size data. We discuss the comparisons or practice paradigms to be made, as well as other discrete ~grouping) variables that may be related to the amount of change in motor skills. Then we present a rationale for the investigation of several continuous predictor variables for the amount of change in motor skill performance. Finally, we discuss our rationale and methodology for the examination of outliners. The Collection of Studies Study sources were obtained from the E"eltz and Landers (1983) review and from a manual search of the literature subsequent to 1982. From this search we identified 50 unpublished sources, 48 of which were obtainable and 48 published sources, all of which were obtainable. This resulted in a total of 96 distinct sources that were retrieved and identified as having examined the effects of some form of mental practice on motor performance. Each article was then read, effect-size measures were extracted where sufficient data were provided, and relevant study features were coded. This procedure produced 55 studies from which effect sizes could be obtained. Of the 41 studies ~ hat could not be used, 37 did not report enough

~ `` ~ ~ ~ r ~ c: ~ ~ 1 `: 22 information on which to calculate effect sizes and four were not relevant to the purp ose of this review. Definition a_ Computation of Effect-Size Measure s Notation for a Series of Studies Consider a series of k studies each examining the treatment effect in one or several samples. Let Xi j1 and Yi j1 be the pretest (X) and posttest (Y), respectively, for the 1th person in the With sample of the ith study. If a study examines the pretest and posttest motor-skill performance of sub jects in mental- practice and control groups, it has two independent samples. Denote the Ji as the number of independent samples in the with study, nip as the sample size in the With sample in study i, and assume that in sample i of study i, Xi j1 and Yi j1 are independently normally distributed with means ,z~= and ;~ and with variances did and ~i;' respectively. Thus - J1' - N(~i:, 6-i~)' for ~ = 1,...,ni:, 1 = 1,..., J_, it, and <) Yij~'v Nimbi:, - i), for ~ = 1' ni:' i = 1, "Ji' ~ = 1,...,k. The Difference Score Effect Size We define the difference-score effect size as the standardized difference between the posttest and pretest means for a single sample, divided by the pretest standard deviation.

AL rraC.lC~ 23 We write YiJ XiJ gi_ = ' (4) S ~ ~ _1J where Yi: and Xi: are the posttest and pretest means, respectively, and Si: is the pretest standard deviation in the ith sample from the Oath study. We define the difference-score effect size in the metric of the pretest scores for two reasons. The primary reason is interpretability. By dividing by a standard deviation of scores (rather than of change scores) we obtain an effect size in score units. Thus a difference-score effect size of 0.75 for a mental practice group indicates that the average subject in that group increased his or her performance by three-fourths of one standard deviation. If the skill in question were basketball jump shots, and the standard deviation of the number of pretreatment shots made was 10, then the average change is easily seen to be 7.5 additional shots made. The second reason is that the pretest standard deviations would not be influenced by the treatments. They should be roughly equivalent across groups withing studies, assuming that subjects were randomly assigned to groups, thus large difference- score effects should not result from decreased variation in scores in groups where the treatment may have affected score variability (note influence of 62= W2 The sample change-score effect population effect size, [ij' which on variance of o). size, pi.. estimates a J mav be written AR

24 ~e · ~i; (5) 6.. (~ 1 ~ As above, his and P,i: are the population means on X and Y for the pith sample in ith study and hi. is the population standard deviation of the X scores within the pith sample of study i. Below we will see that the sampling distribution for the effect ~ 2 ~ size is greatly simplified if we assume th-at 6 =~2. Inferences are made about the k parameters all'. Eke when sample effect sizes or their significance values are analyzed. Computation of Effect Sizes Most studies provided the pretest and posttest means and the pretest standard deviation needed to compute the effect size directly, as shown in equation 1. Effect sizes were computed for as many distinct control, mental practice, physical practice, or combined mental/physical practice groups as were examined. Thus, a single source could provide any number of difference-score effect sizes. In the present review, the maximum number of effect sizes from any one source was 96 (Wills, 1966). The Wills (1966) study measured 8 outcomes for 12 independent samples of subjects. When several outcomes were studied, when single outcomes were scored in more than one way (e.g., in terms of both speed and accuracy), and when multiple test trials were reported, we computed several (dependent) effect sizes for each group. (No interdependent data are combined in our analyses, however). When raw means and pretest standard deviations were not reported, effect sizes were computed In other ways. In two studies, the posttest standard deviation replaced the pretest

standard deviation in the computation of g. In some cases, gain- score standard deviations (Sg) or t tests for change in performance were reported. In these cases ~ was computed via a simple algebraic ideal ity. (Y - X) 2 ( 1-r) Sg using the gair~-s core s tandard deviation, and via ~ = 11 when n is the sample size of the group and t is the t test for change f or only that group. ( Note that the square root of the corresponding change-score F could also be used in place of t here. ) The correlation r represents the correlation between X and Y. the pretest and posttest measures. Values of r were not generally reported, thus had to be obtained from a subset of studies which reported either the pre-post correlation or raw data (which allowed computation of r ). The values of r used f or the f our t reatme nt groups were r = +.69 for control groups, r = +.64 for the mental practice groups, r = +.20 for the physical practice groups, and r = +.16 for the combined mental/physical groups. These values are the median correlations retrieved from the subset of studies which reported r or raw data. Each median r is based on a set of between seven and 10 correlations. The values of the median correlations suggest that the pretest-posttest correlation ~ s quite strong fo-

26 control and mental-practice groups. Where some intervening physical practice has taken place, the relationship is leaker; the correlations for physical and combined groups are less than one-third the size of the control and mental practice corre rations . We also computed some effect sizes by approximating the value of Sg with the pooled within-groups mean square from a gain-score analysis of variance. Thus, with this method, we used the same standard deviation for all groups resulting from one article or study. Our formula for ~ was ~ Y - X) V 2 ( 1-r ~ \/ M3W Preliminary analyses indicated, however, that effect sizes computed using this approach were systematically larger than effect sizes from studies similar in other aspects. This may have resulted because of between-group differences in variation or pretest versus posttest differences in variation which could not be detected (because the necessary variances were not reported). Six studies with effect sizes computed via this method were eliminated from further statistical analysis. Variance of the Ef f ect Size Hedges (1981) presented asymptotic distribution theory for Glass's estimate of effect size. The ga e-score effect size has a similar distribution. The gain-score effect size is biased, but an unbiased estimate of the population value is computed as _ = c (n-l)g ~ where c(_) = 1 - (3/(4m-1)), and the variance of d is approximately

2( l-r) + d2 _ = _ ~ n 2(n-1) _ _ Again r is the estimated pre-post correlation and n is the sample - size. The estimate d is asymptotically normal with an expected ~c' en value of Depopulation difference-score effect size and a variance given by V. Analyses of our difference-score effect sizes are based on those described in detail by Hedges (e.g., Hedges, 1982; Hedges & Olkin, 1985~. Coding of Study Feature s ~ . . Numerous study characteristics were coded for the 55 studies in the final collection. Table 1 presents a list of the study f eatures used in our analyses . These study features are the same as those used by Feltz and Landers ( 1983) with the exception of sub ject 's sex and design characteristics as well as categories of open/closed skills. Sub ject's sex was not found to be important in moderating the effect of mental practice and was, therefore, not coded in our d~fference-score effect s izes were computed in des ign characteris tics used by Feltz and appropriate . review. Because our analysis, the Lande rs we re not Types of Comparisons Our primary comparison of interest was among the treatment groups or different types of practice. It has been theorized

28 that combined mental and physical practice is better than either physical practice or mental practice alone (Corbin, 1972). However, this comparison has not yet been made within a meta- analysis. In addition, as was done in the Feltz and Landers (1983) review, comparisons were made by task type, publication status, subject experience, and time of posttest. Comparisons that had not been made previously were between studies using different types of dependent measures and between studies using subjects with different levels of imagery ability. The continuous predictor variables that were investigated were number of practice sessions and number of practice trials per session or length of each practice session in seconds. Some researchers have suggested that the greater the number of mental rehearsals the greater the effect on performance (Sackett, 1935; Smyth, 1975), whereas others have suggested that there may be an optimal number of practice sessions and length of practice in which mental practice is most effective (Corbin, 1972; Twining, 1949). Feltz and Landers (1983) found no linear or curvilinear relationship between number of practice sessions and effect size; however, they did find curvilinear relationships between length of practice and effect size. Unfortunately, they were not able to determine, statistically, whether other variables (e.g., task type ) moderated ~ hese relationships. Rationale and Methodology for Outliers Outliers were examined in the first step of the data analysis to identify unusual studies that could bias subsequent results. Confidence intervals were computed and plotted for each effect size. Unusual results were identified by examining the

confidence interval plots for the separate treatment groups. s tudies Dentin fed were then re -read to determine any unusual f eatures . On the basis of this preliminary analysis, six studies that had effect sizes computed by approximating the value of Sg with the pooled within-groups mean square were eliminated from further analysis. One study (Corbin, 1966) was eliminated because the pretest task was different than the posttest task. In addition, the Kelsey (1961) study was eliminated because it was the only study that measured muscular endurance. Consequently the physical practice sample in this study had extremely high effect sizes . RE SILTS Overall Test of Homogeneit y From the 55 studies in which effect sizes were computed, 48 were used in our meta-analysis. These 48 studies had examined change in motor skills for 223 separate samples. 4 summary of the characteristics for these studies is presented in Table 2. Included in this table is an indication of random assignment of sub jects to groups, whether pretreatment group differences existed, and how effect sizes were computed.: We first tested the consistency of change in motor skill across 223 samples. The overall homogeneity test AT value was 788.32, which as chi-square variable with k-1 = 221 degrees of 1 The effect sizes for these studies can be obtained by writing the first author.

30 freedom, is quite large (p<.OOl). All the change-score effect sizes cannot be represented with one population parameter. This does not seem surprising since the biased uncorrected effect sizes range from -0.38 to 13.91. The weighted average effect size for all studies is estimated to be C.43 standard deviations, which differs from zero (~<.05). This value represents the average change effect from pre- to posttest across all types of practice treatments. The value is just slightly lower than the unweighted average effect size (0.48) reported by Feltz and Landers (1983) which was computed using the mental practice versus control means rather than computing cliff erence-s core ef f ect s sizes . Categorical Comparison s We next grouped the effects according to treatment group or type of practice. Table 3 shows the homogeneity statist) cs obtained for this categorical analysis and the overall homogeneity test (Hedges, 1982b). An overall test of the within- groups homogeneity, lit, is the sum of the homogeneity values for each subgroup. Its value, 668.69 is significant at the .001 level (df=218). Thus, there is still considerable variation in the sizes of change over practice within the treatment groups. The results within the four treatment categories are also not homo ge ne ou s . The test for differences among mean effect sizes for the treatment groups is given by HB, which is also a chi-square variable, with 3 degrees of f reendow. the conclude that the four

31 sets of pre-post differences have different population effect sizes, since HB = 119.63 is significant. Mean change differences for all of the treatment groups were significantly greater than zero with physical practice showing the greatest change effects (0.79) and, as we would expect, the control groups showing the smallest change effects (0.22). The average weighted change-score effect size for mental practice groups (0.47) is very close to the unweighted effect size reported by Feltz and Landers (1983). Contrary to what has been previously theorized in the literature (Corbin, 1972), combined mental and physical practice does not appear to be more effective than either mental or physical pract~ ce alone. We next subdivided the cliff Brent treatment groups according to task type since this was the categorized variable that Feltz and Landers ~ 1983) f ound to be mos t s ignif icant in differentiating effect sizes. The task-type categories were motor tasks ~ cognitive tasks, and s trength tasks. The homogeneity statistics for task type divided by treatment group are shown in Table 4. An inspection of Table 4 indicates that most of the variation in effect sizes occur with the motor tasks. The overall test of within-groups homogeneity is significant, HW(df=155) = 547.74 as well as the four treatment categories. Since grouping the studies by task type for four treatment groups did not fully explain the variations in pre/post differences, we explored the use of another study feature, type of dependent measure used, as a grouping variable for motor type tasks. The dependent measure categories were accuracy, speed, form, distance, and time on target or -~ n balance. The

32 homogeneity statistics for measure type by treatment group are shown in Table 5. It appears that most of the variation in effect sizes for motor tasks is from studies using measures of accuracy or time on target/in balance. Analyses Using Continuous Predictors In order to determine the influence of number of practice sessions and length of practice per session, we conducted separate regression analyses for each predictor variable for each of the four treatment groups. In each regression analysis, we tested for ta) overall significance of the regression model using four polynomial predictors (linear, quadratic, cubic, and quartic), (b) the fit of the regression model (analogous to Hw homogeneity tests), and (c) Z tests for significance of individual predictors. Table 6 contains the summary statistics for these analyses. For the number of practice sessions variable, the overall models were significant for mental practice, physical practice and combined practice groups, but the chi squares for model fit were also significant indicating a large amount of error in the models. For the length of practice per session variable which was measured in terms of number of practice trials, the overall models were significant for control, mental practice and physical practice groups with the control group having the only nonsignificant chi square for model fit. Although the control group regression analysis was significant and showed good fit, none of the individual polynomial predictors were significant using a Z test. This may be due to the multicolinearity among

the predictors. Thus, unlike Feltz and Landers (1983) who found a curvilinear relationship between length of practice and effect size, we found no linear or curvilinear relationships between the continuous variables measured and effect size. Discussion Comparing across all types of tasks and practice conditions used in the 48 studies reviewed, the results of the meta-analysis showed that the average difference in effect size from pretest to posttest was 0.43 standard deviations (p<.05). Likewise, the average effect size for mental practice was 0.47 (p<.OS). The overall learning, as indicated by the magnitude of the difference in pretest to posttest effect sizes, is of similar magnitude to the overall mental practice effect size (0.48) reported by Feltz and Landers (1983). Regardless of whether the effect size was computed using mental practice versus control (Feltz and Landers, 1983) or computed using change-score effect sizes, the resulting effect sizes represent approximately one-half a standard deviation. Considering the marked differences in types of tasks, ages, background of subjects, and research designs/methodologies employed in the studies subjected to meta-analysis, it is clear that: (a) mental practice does facilitate learning, (b) these results are replicable, and (c) they have surprisingly good generality. When the overall effect sizes were broken down to examine moderating variables of task type and type of dependent measure, most of the variation was found in tasks that predominantly involved accuracy or tasks that were primarily "motor" in nature

34 (versus cognitive and strength). The failure to find Variation for strength and cognitive tasks, as well as speed, distance, time-on-target/in balance and form-dependent measures was most likely due to the insufficient number of samples in some practice conditions (N ~ 5~. Examination of the categorical comparisons of practice conditions f or the motor and accuracy tasks showed that the learning associated with mental practice was two ce as great as that achieved from the minimal (but significant) learning demonstrated by the subjects in the no practice (control ) condition. Compared to the physical practice, however, mental practice was 41-45% less effective than physical practice. These results support the general findings in the literature that physical practice is a more effective learning strategy than mental practice (Weinberg, 1982). Although some learning was achieved by the control subjects, it was 71-73% less than that achieved through physical practice. Of particular interest in the present meta-analytic review was the categorical comparisons for the combined practice condition. Previous reviewers (Richardson, 1967; Weinberg, 1982) have maintained that a combination of mental and physical practice "is more ef f ective than either physical practice or mental practice alone" (Weinberg, 1982, p. 203). Richardson (1967a) is much more cautious suggesting only a trend for the motor performance of combined practice to be nas good or better than physical practice trials onlyn (p. 103). These conclusions were not supported by the findings of the meta-ana~ysis. Where

the number of effect sizes were sufficient for legitimate statistical comparisons to be made,2 the results showed that the effect sizes for combined practice was always less than those for physical practice. For the effect size summed across types of tasks as well as the effect sizes for motor and accuracy tasks, the combined practice was respectively 22%, 8% and 27% less than that achieved by the exclusive employment of physical practice. It appears that overall there is a reduction in performance efficiency when physical practice is replaced by mental practice. However, there are times when such a loss may be acceptable or even desirable. For example, some motor or accuracy tasks for which actual physical practice may either be expensive, time- consuming' physically or mentally fatiguing or potentially dangerous, the small decrements in performance resulting from combined practice may be an effective teaching-learning strategy, since its effects are nearly as good as physical practice with only half the number of physical practice trials. With only one exception (Oxendine, 1969), most of the combined practice consisted of a 50:50 ratio of physical practice to mental practice trials. In Oxendine's (1969) study, only one of the three tasks examined showed differences among the following ratios of physical practice to mental practice trials: 8:0, 6:2, 4:4, and 2:6. The 8:0 and 6:2 ratios had the greatest improvement in time-on-target scores with means of 4.37 and 4.43, 2 For task measures of time-on-target/ n balance, combined practice actually had a larger difference score effect size than either physical or mental practice. However, this finding is of questionable significance due to the relatively small number of samples and a much larger s tandard error of measurement.

36 re spe at i ve 1 y . With f ewe r physical practice trials, the scores were considerably less (i.e., the 2:6 ratio ~ . 3.98 for the 4:4 ratio and 2.94 for Although much more research is needed to confirm these findings, it appears that the conclusions of Richardson ( 1967a ~ and Weinberg ( 1982) may be valid, but only if the ratio of the physical to mental practice trials is at least 75:25.