March 25, 2010
What are the major issues for state-level policymakers in using value-added teacher effectiveness measures?
Value-added models for estimating the effects of teachers, schools, or districts on student learning hold some potential for enhancing efforts at assessment and accountability. The complexity of designing effective models, however, makes implementing a model challenging for policymakers. Several outstanding areas of investigation remain for developing value-added models, including issues of test design, data system requirements, and model selection.
Value-added models are statistical models for estimating the contribution of teachers, schools, or districts to changes in measures of student achievement. These statistical models are complex regressions that can model the effects of many independent variables. The models may allow researchers to disentangle teacher and school effects from other factors that influence student learning, especially individual student attributes such as prior achievement or family characteristics.
Some recent research suggests that teachers have significant effects on student achievement, and that these effects may persist over time. Value-added methods estimate the size of such effects using test score growth rather than score levels. This approach may be useful because some research suggests that changes in a student’s test scores over time may be less sensitive to student characteristics than are the student’s test score level at any one time. As a result, value added methods may be able to reduce or even eliminate the confounding effects of these student characteristics on measures of student achievement. Some advocates have suggested the potential advantages of value-added methods include improved accuracy and fairness of school and teacher evaluation systems and reduced incentives to “game” those systems as currently designed. Others cite the potential for value-added methods to handle problems that might otherwise threaten the validity of simpler student achievement measures, such as decays in student learning or the provision of extra resources to high-achieving students (Toch 2005; Meyer and Christian 2008).
Questions remain, however, as to whether value added evaluation can actually deliver on these promises. The models can be difficult to understand for those without some background in statistics. Additionally, proprietary models and data systems developed and implemented by private firms are not always fully accessible to educators or the public at large. Implementation difficulties, including issues of test design, data system requirements, and model selection, can be substantial. Perhaps most importantly, it is not completely clear whether or not the statistical methods currently available are able to tease out teacher effects on student achievement from the many other influences. In addition, there is currently no evidence-based link between statistically-estimated teacher or school effects and teacher attributes or instructional practices. Value-added models, even when they accurately identify high-performing schools or teachers, cannot yet tell us what these individuals or organizations are doing to produce that performance.
Construct Validity. Value-added evaluation requires tests that are accurate measures of student achievement. If the test covers some standards poorly, then teachers who target those standards can be disadvantaged by value-added measures. Teachers who, on the other hand, focus only on standards addressed by the test may leave gaps in student ability that will show up in later grades. Value added measures, therefore, can be sensitive to the nature of the tests used (Braun 2004; Braun et. al. 2010).
Vertical Scaling. Some types of value-added models require that test scores be “vertically scaled” so that students’ scores are comparable between years. As proper vertical scaling involves careful test design and thorny psychometric issues, it can itself be difficult to implement. According to Richard J. Patz of CTB/McGraw Hill, several elements are required to develop vertically scaled tests: vertically-aligned content standards with considerable grade-to-grade overlap and a systematic, intentional increase in difficulty psychometric test development and data collection approaches that include
Since test companies are not always willing to make their test design methods fully transparent, the reliability of vertical scaling can be hard to assess.
Other test-related issues can also influence student achievement and teacher effect estimates, including changes in test timing and frequency, alternative weightings of topics, and differences in scaling (Patz 2007, McCaffery et. al. 2003; Braun et. al. 2010).
Test Scores as Measures of Student Achievement. Two additional problems may thwart attempts to use test score gains as measures of student achievement that are truly independent of initial ability levels. First, insufficient variability of test scores across students and over time can reduce the reliability of value-added teacher effect measures. If, for example, there is little change in the test score ordering across students between successive test offerings, then growth measures (the difference in test scores) may provide little additional information about student achievement compared to simply looking at score levels.
Second, changes in test scores across students of different initial abilities are not necessarily equivalent in terms of the amount of teacher effort or quality needed to produce the gains. For example, the level of effort required to move a student from, say, 35% correct responses to 45% may not be the same as the level of effort required to move a student from 85% to 95%. If this is the case, then the relationship between teacher effectiveness and changes in test scores will still be dependent on initial student ability level.
Value-added evaluation requires individual student data that are linked over time and linked to each teacher. The models require at least two years of linked data (and some versions require at least three years) to conduct teacher evaluations. This implies that substantial amounts of data and extensive computing capacity can be required to model specific teacher effects. Additionally, certain types of value added models (though not all) require the data to be scaled or transformed to allow comparisons of student progress expectations across teachers, schools, or districts that are independent of students’ beginning achievement level (Goldschmidt 2008; McCaffery et. al. 2003; Sanders et. al. 2009).
Missing Data. At this point, little is known about the effects of missing student achievement or student-teacher link data on value added teacher effect estimates. Missing data can be a serious problem because poorer and lower-ability students are often more likely to miss exams or move away from a district. In such cases, simply excluding the potentially non-random sample of students who have missing data from the larger sample can result in biased estimates of teacher effects (McCaffery et. al. 2003).
Access to data can also be an issue when value added modeling is performed by private firms. In Tennessee, for example, the data used for value added assessments are totally separate from the state’s longitudinal data system.
One key statistical modeling issue: what assumptions are made about the nature of teacher effects? Different assumptions may produce different teacher effect estimates. For example, some types of models are good at measuring the effectiveness of teachers close to average performance levels, but perform less well than other types at measuring the effectiveness of very good or poor teachers (“outliers”). Terry Hipbshman of the Kentucky Education Professional Standards Board suggests that there is likely to be a tradeoff between the accuracy and precision of estimates and the mathematical and computational complexity of the model (Hipbshman 2004; McCaffery et. al. 2003).
Certain types of models may also be more or less sensitive to sampling errors. As Dale Ballou argues, since teachers are not assigned a random sample of students, student achievement can differ across teachers for reasons that are not related to teacher performance. Michael Hansen, former REL-Appalachia field representative to the Kentucky Department of Education, believes that the lack of information about the student-teacher matching process (how much influence do teachers have over where they teach, or parents over which teachers teach their children?) is the biggest barrier to obtaining valid value added measures of teacher effectiveness. One consequence of small sample sizes are teacher effectiveness estimates that can jump around from year to year. (See Ballou 2008; Braun et. al. 2010; Hansen 2007; Sawchuk 2009; Rothstein 2009).
A second important issue involves choosing the set of variables to be controlled for. This issue is crucial, argue researchers at RAND Education, especially when different schools serve distinctly different student populations. Such differences, if they influence both achievement levels and growth, can lead to biased estimates of teacher effects if measures of them are left out of statistical estimators. Additionally, classroom effects (e.g. the presence of disruptive students in a class), school effects, district effects, the effects of prior teachers, as well as peer learning effects can all be very difficult to disentangle from the effects of current individual teachers. Hipbshman concludes that that some types of models which do not include student or school-level control variables may nonetheless be able to identify teachers at the extremes of the effectiveness distribution, although they might be biased in favor of teachers who work with more advantaged populations (McCaffery et. al 2003; Hipbshman 2004).
Perhaps the most prominent current example of the use of value added methods in the U.S. is the Tennessee Value Added Assessment System (TVAAS). In 1992, the state began rating each school in the state based on value-added measures. The system is now run by the SAS Corporation of Cary, NC. SAS provides the state with three services through TVAAS: value added scoring (teacher effects), a growth model for use in meeting NCLB accountability reporting requirements, and predictive information on students (ACT scores, for example).
The basic statistical model used in TVAAS is a multivariate (fits simultaneously the entire set of observed test scores belonging to each student), linear, mixed (multilevel – can estimate both “fixed” school and “random” teacher-level effects), longitudinal (data are repeated observations on individual students) regression model. This model requires that the data be scaled or transformed to allow valid comparisons of achievement growth for students with differing initial achievement levels. According to the model’s developer, Bill Sanders, these data transformation and estimation methods allow students to serve as their own controls and eliminate the need to control for socioeconomic status and other student characteristics. Many researchers and policymakers are skeptical of these claims. SAS and Sanders have at times been criticized for being less than fully transparent about exactly how their methods are carried out (Sanders et. al. 1997; Sanders et. al. 2009).
To date, Tennessee has not required that the system be used for formal teacher evaluation. The state’s governor, Phil Bredesen, is attempting to change state law to facilitate incorporation of TVAAS into teacher evaluation. As mentioned, TVAAS has been used as the basis for a school ranking system based on student achievement growth. The transition to this system from the older accountability system based on achievement levels generated substantial resistance from schools that scored well under the traditional system but were ranked lower under the growth model.
Memphis City Schools is in the early stages of implementing a new teacher evaluation system, of which value added measures of teacher effectiveness are one component. Memphis “Teacher Effectiveness Measure” (TEM) is based on four factors: value added (35%), classroom observation (35%), teacher knowledge measured by tests (15%) and stakeholder perceptions measured by surveys (15%). Memphis plans to use the measure to determine retention, promotion, tenure, and compensation decisions, as well as the allocation of high-quality teachers to high-need schools.
To develop the TEM, the district (which educates 118,000 students in 185 schools) has proposed the following budget (annually, 5 years):
Hire 7 new FTEs to support the TEM office $ 670,000
Purchase additional assessments for grades K-2 $ 750,000
Develop new rubric and technology to support classroom observation $ 1,300,000
Purchase surveys to capture stakeholder perception $ 600,000
Total $ 3,320,000 per year
Note that this proposed budget does not include the cost of salary increases or bonuses to high-performing teachers (Memphis City Schools 2009).
Dallas schools have also used value added methods for both school ranking and teacher evaluation since the 1990s. In contrast with Tennessee’s approach, however, the Dallas system also controls for a whole host of additional variables, including students’ ethnicity, language proficiency, and socioeconomic status, and school-level mobility, crowdedness, racial composition, and poverty indicators, among others. Data requirements include at least two prior years of data on each school-level outcome variable. Dallas is also relatively unusual in that it estimates its own value-added models in-house.
Originally, the school and teacher indices estimated by the system were used to identify high- and low-performing schools. Group bonuses were apportioned to staff at the high-performing schools, while poor performers were targeted for additional resources, replacement of administrators, or restructuring. Thomas Toch cites a study which suggests that schools’ value added ratings were correlated with other attributes often associated with good schools: good discipline, qualified teachers, extra help for students who needed it, and flexible personnel policies.
The Dallas system also included “teacher effectiveness indices” generated by identifying students whose actual test scores were above or below the values predicted by the statistical model. Initially these indices were used only for internal planning, but during the mid-1990s an effort was underway to incorporate them into teacher evaluations (Webster and Mendro 1997; Toch 2005).
Ballou, Dale (2008). “Value-Added Analysis: Issues in the Economics Literature.” Vanderbilt University, October 6, 2008.
Braun, Henry (2004). “Value-Added Modeling: What Does Due Diligence Require?” Educational Testing Service, December 20, 2004.
Braun, Henry, Naomi Chudowsky, and Judith Koenig, eds (2010). “Getting Value Out of Value-Added: Report of a Workshop.” Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Accountability; National Research Council. http://www.nap.edu/catalog/12820.html
Goldschmidt, Pete (2008). “Practical Considerations for Using Value Added Models for Monitoring Teacher Effectiveness.” National Center for Research on Evaluation, Standards, and Student Testing, California State University at Northridge, October 2008.
Hansen, Michael (2007). “Value Added Assessment Methods and Their Implications for Teacher Evaluation in Kentucky.” REL Appalachia at CNA for Kentucky Department of Education.
Hipbshman, Terry (2004). “A Review of Value Added Models.” Kentucky Education Professional Standards Board, September 2004.
McCaffrey, Daniel F., J.R. Lockwood, Daniel M. Koretz, and Laura S. Hamilton (2003). Evaluating Value-Added Models for Teacher Accountability. RAND Education.
Memphis City Schools (2009). “Teacher Effectiveness Initiative (TEI) Proposal.”
Meyer, Robert H., and Michael S. Christian (2008). “Value-Added and Other Methods for Measuring School Performance: An Analysis of Performance Measurement Strategies in Teacher Incentive Fund Proposals.” Value-Added Research Center, University of Wisconsin-Madison.
Miller, Ann (2008). “Non-traditional Compensation for Teachers.” REL Appalachia at CNA for West Virginia Department of Education, June 17, 2008.
Patz, Richard J. (2007). “Vertical Scaling in Standards-Based Educational Assessment and Accountability Systems.” Council of Chief State School Officers.
Rothstein, Jesse (2009). “Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement.” Princeton University and NBER. Forthcoming in the Quarterly Journal of Economics. http://www.princeton.edu/~jrothst/published/rothstein_vam_may152009.pdf
Sanders, William L., Arnold M. Saxton, and William P. Horn (1997). “The Tennessee Value Added Assessment System: A Quantitative, Outcomes-Based Approach to Educational Assessment.” In Millman, Jason, ed. Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluation Measure? Thousand Oaks, CA: Corwin Press.
Sanders, William L., S. Paul Wright, June C. Rivers, Jill G. Leandro (2009). “A Response to Criticisms of SAS® EVAAS®.” SAS Institute, November 13, 2009
Sawchuk, Stephen (2009). “Some Academics Push Back on Teacher-Student Link in 'Race to the Top'” Education Week, September 11, 2009.
Toch, Thomas (2005). “Measure for Measure.” Atlantic Monthly, October/November 2005.
Webster, William J. and Robert L. Mendro (1997). “The Dallas Value-Added Accountability System.” In Millman, Jason, ed. Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluation Measure? Thousand Oaks, CA: Corwin Press.
This publication was prepared under a contract with the U.S. Department of Education’s Institute of Education Sciences, Contract ED-06-CO-0021, by Regional Educational Laboratory Appalachia, administered by CNA. The content of the publication does not necessarily reflect the views or policies of IES or the U.S. Department of Education, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. government.