REL Reference Desk

Value Added Measures of Teacher Effectiveness

March 25, 2010

Request

What are the major issues for state-level policymakers in using value-added teacher effectiveness measures?

Response

Value-added models for estimating the effects of teachers, schools, or districts on student learning hold some potential for enhancing efforts at assessment and accountability. The complexity of designing effective models, however, makes implementing a model challenging for policymakers. Several outstanding areas of investigation remain for developing value-added models, including issues of test design, data system requirements, and model selection.

Introduction – What are Value-Added Models?

Value-added models are statistical models for estimating the contribution of teachers, schools, or districts to changes in measures of student achievement. These statistical models are complex regressions that can model the effects of many independent variables. The models may allow researchers to disentangle teacher and school effects from other factors that influence student learning, especially individual student attributes such as prior achievement or family characteristics.

Strengths and Weaknesses of Value-Added Evaluation

Some recent research suggests that teachers have significant effects on student achievement, and that these effects may persist over time. Value-added methods estimate the size of such effects using test score growth rather than score levels. This approach may be useful because some research suggests that changes in a student’s test scores over time may be less sensitive to student characteristics than are the student’s test score level at any one time. As a result, value added methods may be able to reduce or even eliminate the confounding effects of these student characteristics on measures of student achievement. Some advocates have suggested the potential advantages of value-added methods include improved accuracy and fairness of school and teacher evaluation systems and reduced incentives to “game” those systems as currently designed. Others cite the potential for value-added methods to handle problems that might otherwise threaten the validity of simpler student achievement measures, such as decays in student learning or the provision of extra resources to high-achieving students (Toch 2005; Meyer and Christian 2008).

Questions remain, however, as to whether value added evaluation can actually deliver on these promises. The models can be difficult to understand for those without some background in statistics. Additionally, proprietary models and data systems developed and implemented by private firms are not always fully accessible to educators or the public at large. Implementation difficulties, including issues of test design, data system requirements, and model selection, can be substantial. Perhaps most importantly, it is not completely clear whether or not the statistical methods currently available are able to tease out teacher effects on student achievement from the many other influences. In addition, there is currently no evidence-based link between statistically-estimated teacher or school effects and teacher attributes or instructional practices. Value-added models, even when they accurately identify high-performing schools or teachers, cannot yet tell us what these individuals or organizations are doing to produce that performance.

Implementation Difficulties
Testing requirements

Construct Validity. Value-added evaluation requires tests that are accurate measures of student achievement. If the test covers some standards poorly, then teachers who target those standards can be disadvantaged by value-added measures. Teachers who, on the other hand, focus only on standards addressed by the test may leave gaps in student ability that will show up in later grades. Value added measures, therefore, can be sensitive to the nature of the tests used (Braun 2004; Braun et. al. 2010).

Vertical Scaling. Some types of value-added models require that test scores be “vertically scaled” so that students’ scores are comparable between years. As proper vertical scaling involves careful test design and thorny psychometric issues, it can itself be difficult to implement. According to Richard J. Patz of CTB/McGraw Hill, several elements are required to develop vertically scaled tests: vertically-aligned content standards with considerable grade-to-grade overlap and a systematic, intentional increase in difficulty psychometric test development and data collection approaches that include

  • sufficient numbers of common items across levels or sufficient numbers of students taking multiple forms
  • conditions closely approximating operational conditions
  • statewide data or large, statistically representative samples of students

Since test companies are not always willing to make their test design methods fully transparent, the reliability of vertical scaling can be hard to assess.

Other test-related issues can also influence student achievement and teacher effect estimates, including changes in test timing and frequency, alternative weightings of topics, and differences in scaling (Patz 2007, McCaffery et. al. 2003; Braun et. al. 2010).

Test Scores as Measures of Student Achievement. Two additional problems may thwart attempts to use test score gains as measures of student achievement that are truly independent of initial ability levels. First, insufficient variability of test scores across students and over time can reduce the reliability of value-added teacher effect measures. If, for example, there is little change in the test score ordering across students between successive test offerings, then growth measures (the difference in test scores) may provide little additional information about student achievement compared to simply looking at score levels.

Second, changes in test scores across students of different initial abilities are not necessarily equivalent in terms of the amount of teacher effort or quality needed to produce the gains. For example, the level of effort required to move a student from, say, 35% correct responses to 45% may not be the same as the level of effort required to move a student from 85% to 95%. If this is the case, then the relationship between teacher effectiveness and changes in test scores will still be dependent on initial student ability level.

Data requirements

Value-added evaluation requires individual student data that are linked over time and linked to each teacher. The models require at least two years of linked data (and some versions require at least three years) to conduct teacher evaluations. This implies that substantial amounts of data and extensive computing capacity can be required to model specific teacher effects. Additionally, certain types of value added models (though not all) require the data to be scaled or transformed to allow comparisons of student progress expectations across teachers, schools, or districts that are independent of students’ beginning achievement level (Goldschmidt 2008; McCaffery et. al. 2003; Sanders et. al. 2009).

Missing Data. At this point, little is known about the effects of missing student achievement or student-teacher link data on value added teacher effect estimates. Missing data can be a serious problem because poorer and lower-ability students are often more likely to miss exams or move away from a district. In such cases, simply excluding the potentially non-random sample of students who have missing data from the larger sample can result in biased estimates of teacher effects (McCaffery et. al. 2003).

Access to data can also be an issue when value added modeling is performed by private firms. In Tennessee, for example, the data used for value added assessments are totally separate from the state’s longitudinal data system.

Choosing a statistical model

One key statistical modeling issue: what assumptions are made about the nature of teacher effects? Different assumptions may produce different teacher effect estimates. For example, some types of models are good at measuring the effectiveness of teachers close to average performance levels, but perform less well than other types at measuring the effectiveness of very good or poor teachers (“outliers”). Terry Hipbshman of the Kentucky Education Professional Standards Board suggests that there is likely to be a tradeoff between the accuracy and precision of estimates and the mathematical and computational complexity of the model (Hipbshman 2004; McCaffery et. al. 2003).

Certain types of models may also be more or less sensitive to sampling errors. As Dale Ballou argues, since teachers are not assigned a random sample of students, student achievement can differ across teachers for reasons that are not related to teacher performance. Michael Hansen, former REL-Appalachia field representative to the Kentucky Department of Education, believes that the lack of information about the student-teacher matching process (how much influence do teachers have over where they teach, or parents over which teachers teach their children?) is the biggest barrier to obtaining valid value added measures of teacher effectiveness. One consequence of small sample sizes are teacher effectiveness estimates that can jump around from year to year. (See Ballou 2008; Braun et. al. 2010; Hansen 2007; Sawchuk 2009; Rothstein 2009).

A second important issue involves choosing the set of variables to be controlled for. This issue is crucial, argue researchers at RAND Education, especially when different schools serve distinctly different student populations. Such differences, if they influence both achievement levels and growth, can lead to biased estimates of teacher effects if measures of them are left out of statistical estimators. Additionally, classroom effects (e.g. the presence of disruptive students in a class), school effects, district effects, the effects of prior teachers, as well as peer learning effects can all be very difficult to disentangle from the effects of current individual teachers. Hipbshman concludes that that some types of models which do not include student or school-level control variables may nonetheless be able to identify teachers at the extremes of the effectiveness distribution, although they might be biased in favor of teachers who work with more advantaged populations (McCaffery et. al 2003; Hipbshman 2004).

Teacher Evaluation Systems that Use Value Added

Tennessee

Perhaps the most prominent current example of the use of value added methods in the U.S. is the Tennessee Value Added Assessment System (TVAAS). In 1992, the state began rating each school in the state based on value-added measures. The system is now run by the SAS Corporation of Cary, NC. SAS provides the state with three services through TVAAS: value added scoring (teacher effects), a growth model for use in meeting NCLB accountability reporting requirements, and predictive information on students (ACT scores, for example).

The basic statistical model used in TVAAS is a multivariate (fits simultaneously the entire set of observed test scores belonging to each student), linear, mixed (multilevel – can estimate both “fixed” school and “random” teacher-level effects), longitudinal (data are repeated observations on individual students) regression model. This model requires that the data be scaled or transformed to allow valid comparisons of achievement growth for students with differing initial achievement levels. According to the model’s developer, Bill Sanders, these data transformation and estimation methods allow students to serve as their own controls and eliminate the need to control for socioeconomic status and other student characteristics. Many researchers and policymakers are skeptical of these claims. SAS and Sanders have at times been criticized for being less than fully transparent about exactly how their methods are carried out (Sanders et. al. 1997; Sanders et. al. 2009).

To date, Tennessee has not required that the system be used for formal teacher evaluation. The state’s governor, Phil Bredesen, is attempting to change state law to facilitate incorporation of TVAAS into teacher evaluation. As mentioned, TVAAS has been used as the basis for a school ranking system based on student achievement growth. The transition to this system from the older accountability system based on achievement levels generated substantial resistance from schools that scored well under the traditional system but were ranked lower under the growth model.

Memphis City Schools

Memphis City Schools is in the early stages of implementing a new teacher evaluation system, of which value added measures of teacher effectiveness are one component. Memphis “Teacher Effectiveness Measure” (TEM) is based on four factors: value added (35%), classroom observation (35%), teacher knowledge measured by tests (15%) and stakeholder perceptions measured by surveys (15%). Memphis plans to use the measure to determine retention, promotion, tenure, and compensation decisions, as well as the allocation of high-quality teachers to high-need schools.

To develop the TEM, the district (which educates 118,000 students in 185 schools) has proposed the following budget (annually, 5 years):

Hire 7 new FTEs to support the TEM office $ 670,000

Purchase additional assessments for grades K-2 $ 750,000

Develop new rubric and technology to support classroom observation $ 1,300,000

Purchase surveys to capture stakeholder perception $ 600,000

Total $ 3,320,000 per year

Note that this proposed budget does not include the cost of salary increases or bonuses to high-performing teachers (Memphis City Schools 2009).

Dallas Independent School District

Dallas schools have also used value added methods for both school ranking and teacher evaluation since the 1990s. In contrast with Tennessee’s approach, however, the Dallas system also controls for a whole host of additional variables, including students’ ethnicity, language proficiency, and socioeconomic status, and school-level mobility, crowdedness, racial composition, and poverty indicators, among others. Data requirements include at least two prior years of data on each school-level outcome variable. Dallas is also relatively unusual in that it estimates its own value-added models in-house.

Originally, the school and teacher indices estimated by the system were used to identify high- and low-performing schools. Group bonuses were apportioned to staff at the high-performing schools, while poor performers were targeted for additional resources, replacement of administrators, or restructuring. Thomas Toch cites a study which suggests that schools’ value added ratings were correlated with other attributes often associated with good schools: good discipline, qualified teachers, extra help for students who needed it, and flexible personnel policies.

The Dallas system also included “teacher effectiveness indices” generated by identifying students whose actual test scores were above or below the values predicted by the statistical model. Initially these indices were used only for internal planning, but during the mid-1990s an effort was underway to incorporate them into teacher evaluations (Webster and Mendro 1997; Toch 2005).

Outstanding Issues

  1. Evaluating the sensitivity of estimated teacher effects to other factors such as test design and scaling, student characteristics, and missing data. This step can help ensure the accuracy of such estimates (McCaffery et. al. 2003).
  2. When integrating value added teacher effectiveness measures into evaluation and/or compensation systems.
    1. Be aware of potential unintended negative consequences.
      1. Threat of score inflation. Dallas, for example, developed a system to monitor patterns of excessively large test score gains to permit further investigation of potential cheating (McCaffery et. al. 2003; Webster and Mendro 1997).
      2. Changes in the intended curriculum, instruction and test preparation. For example, there is some evidence that pay-for-performance systems based only on reading and math scores give teachers incentives to diminish focus on other subjects (Miller 2008).
      3. Teacher morale and retention. Compensation systems based strictly on pay for performance measured only by standardized tests often generate teacher resistance (Miller 2008).
    2. Link value added estimates to alternative indicators of teacher effectiveness. This step can help to validate the value added estimates and may avoid problems with teacher resistance to strict test-based pay-for-performance systems. Perhaps more importantly, this step would allow value-added methods to serve as a research tool for studying the practices of high-performing teachers or schools (McCaffery et. al 2003; Hansen 2007).
    3. Initially implement value added evaluation/compensation systems through pilot programs. Research suggests that including stakeholders in pilot program development and informing pilot designs using information on previously-implemented systems and expert views increase chances for success (Miller 2008).
  3. Conducting a cost-benefit analysis
    1. Balancing Costs and Benefits
      1. Potential Benefits: better ability to identify and target rewards to high- performing teachers and extra staff development resources or sanctions to lower- performers
      2. Potential Costs: setup costs – personnel, test redesign, data system changes, validating a statistical model; costs of providing bonuses or extra staff development. Costs may vary if some or all of this work is done by outside vendors versus in-house.
    2. Comparing Different Decision Rules. Consider questions such as – what kinds of decisions are going to be made with value added methods (e.g. who gets rewards or sanctions?). What are the costs of making errors in those decisions? How can an evaluation system be implemented to minimize such costs? (McCaffery et. al. 2003). 

Resources

Ballou, Dale (2008). “Value-Added Analysis: Issues in the Economics Literature.” Vanderbilt University, October 6, 2008.

The author discusses issues in the specification and estimation of value-added models. These issues include: (1) Non-random assignment of students to teachers, and the potential for omitted variable and selection bias; (2) Model misspecification, due to uncertainty about the relationship between schooling inputs and achievement; and (3) Properties of achievement tests, including: (a) how well tests are aligned with the curriculum; (b) ceiling and floor effects; (c) test measurement error and its implications for measured achievement gains; (d) whether test scores are reported on interval scales required for forms of value-added analysis in current use; and, (e) the timing of test administration, which rarely coincides neatly with the beginning or end of an academic year..

Braun, Henry (2004). “Value-Added Modeling: What Does Due Diligence Require?” Educational Testing Service, December 20, 2004.

This paper provides an introduction to value-added modeling by introducing seven “key assumptions” that (the author argues) must hold if the use of value-added metrics for teacher evaluation is to be fully valid. These assumptions are: (1) the construct validity of tests; (2) the interval scaling property of test scores; (3) the absence of selection bias in assignment of students to classes; (4) the absence of missing data; (5) the use of linear mixed statistical models; (6) the persistence of teacher effects; and, (7) that any estimated achievement effects are caused by the teacher and not some other factor. The author argues that most of these assumptions have only a low to moderate level of reasonableness, and in several cases departures from the assumptions will compromise the validity of estimates.

Braun, Henry, Naomi Chudowsky, and Judith Koenig, eds (2010). “Getting Value Out of Value-Added: Report of a Workshop.” Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Accountability; National Research Council. http://www.nap.edu/catalog/12820.html

Report of the Workshop on Valued-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, held on November 13 and 14, 2008, in Washington DC. Introduces the concept of value-added models; discusses use of the models for research, school and instructional improvement, and program evaluation; problems in measuring student achievement; different methods for estimating value-added; and considerations for policy makers, including implementation issues and incorporating value-added measures into evaluation systems.

Goldschmidt, Pete (2008). “Practical Considerations for Using Value Added Models for Monitoring Teacher Effectiveness.” National Center for Research on Evaluation, Standards, and Student Testing, California State University at Northridge, October 2008.

Covers measurement issues – specifically, the basics of value-added models, comparison of different types of models, reliability and validity of assessments; data requirements, and modeling issues.

Hansen, Michael (2007). “Value Added Assessment Methods and Their Implications for Teacher Evaluation in Kentucky.” REL Appalachia at CNA for Kentucky Department of Education.

A memo from the REL Appalachia field scientist that considers value-added assessment and its implications for the way in which the Kentucky Department of Education assesses the contributions of teachers. The author concludes that, although the rationale for value-added models is compelling, in practice they are not actually able to measure the true impact of teachers on student learning because of the non-random way in which teachers and students are assigned to one another and a lack of detailed information describing the assignment process.

Hipbshman, Terry (2004). “A Review of Value Added Models.” Kentucky Education Professional Standards Board, September 2004.

This report identifies the major models that have been proposed, and assesses their applicability for use in Kentucky. Differences in the models stem from efforts by statisticians to resolve the various technical problems that have arisen as the field has developed. The report assesses strengths and limitations of the various approaches. The author urges caution in using any of the models for high-stakes purposes, as none of them solves all of the known technical problems, and some problems have proven intractable. The author believes that there is no generally-applicable value-added model whose results can be trusted for rank-ordering teachers or schools with enough precision to justify their use in a high-stakes environment. Some of the models, however, have proven useful for specific limited purposes, such as identifying teachers at the extremes of the performance distribution.

McCaffrey, Daniel F., J.R. Lockwood, Daniel M. Koretz, and Laura S. Hamilton (2003). Evaluating Value-Added Models for Teacher Accountability. RAND Education.

This monograph attempts to clarify the primary questions raised by the use of value-added models for measuring teacher effects, review the most important recent applications, and discuss a variety of important statistical and measurement issues that might affect the validity of value-added inferences. The authors identify four groups of issues involved with modeling student achievement data to estimate teacher effects: (1) choice of statistical model (e.g. single-year vs. full multivariate models, “fixed” vs. “random” effects, etc), (2) effects of omitted variables and missing data, (3) use of achievement test scores as outcome measures (e.g. effects on estimates of differences in test content, timing, scaling), and (4) uncertainty about estimated effects. The authors conclude that the research base is currently insufficient to support the use of value-added models for high-stakes decisions.

Memphis City Schools (2009). “Teacher Effectiveness Initiative (TEI) Proposal.”

This proposal describes Memphis City School’s Teacher Effectiveness Initiative (TEI) for improving student achievement through improved teaching. This ambitious plan involves developing a common definition of effective teaching and a more informative teacher evaluation process, recognizing and rewarding excellent teaching, and more effectively addressing mediocre teaching. Central to the plan is the creation of a new Teacher Effectiveness Measure (TEM), consisting of growth in student learning, classroom observations, measures of student and parent perceptions, and measures of teacher knowledge (standardized tests). The TEM serves as the basis for planned improvements in the teacher evaluation process, connection of professional development to individual teacher needs, creation of differentiated career paths, payment of teachers based on these differentiated roles and performance, and placement of the best teachers where they are most needed. MCS plans to support these measures with additional steps to improve principal leadership and school culture, and develop a new technology platform that will support data-driven decision-making. The proposal includes initial cost estimates for putting the program in place.

Meyer, Robert H., and Michael S. Christian (2008). “Value-Added and Other Methods for Measuring School Performance: An Analysis of Performance Measurement Strategies in Teacher Incentive Fund Proposals.” Value-Added Research Center, University of Wisconsin-Madison.

This paper reviews methods proposed by U. S. Department of Education Teacher Incentive Fund (TIF) grantees for measuring the performance of schools, teachers, and administrators with respect to student achievement. The most commonly used measure of school performance used among the TIF grantees is value added. Other methods used and discussed in the paper include gain, movement across proficiency levels, proficiency rates/average attainment, and individual achievement plans. The authors conclude that, relative to the others, value-added models can provide valid, reliable estimates of teacher, administrator, and school performance that can also take into account important locally-specific conditions. Attainment indicators, which do not take into account prior student achievement, perform poorly relative to value-added methods. Simple average gain measures can produce similar results to value-added, but only under the following conditions: little or no decay in student achievement over time, little or no differential resource allocation based on student ability, no differences in pretest and posttest test scales, little nonlinearity in the test scaling algorithm, and little influence of differences in student characteristics across schools on variation in average achievement growth. The larger are deviations from these assumptions, the worse the gain indicator will perform relative to value-added in accurately measuring school performance. Proficiency-level indicators perform similarly to gain measures, but only when student achievement growth is large enough to enable individuals to cross a proficiency threshold.

Miller, Ann (2008). “Non-traditional Compensation for Teachers.” REL Appalachia at CNA for West Virginia Department of Education, June 17, 2008.

This literature review provides an overview of teacher compensation systems that differ from the traditional single seniority-based salary schedule with additional compensation based on educational attainment. It describes specific types of reward systems, discusses unintended consequences associated with some of them, and addresses issues of pay system design. The success of non-traditional pay systems varies by location and program type, suggesting a need for state- and district-specific pilots and designed solutions attentive to local conditions. Programs that strictly pay for performance to individuals based on standardized tests have not endured, due to pushback from teachers, unions, parents, and changes in leadership.

Patz, Richard J. (2007). “Vertical Scaling in Standards-Based Educational Assessment and Accountability Systems.” Council of Chief State School Officers.

Vertical scaling is a method for linking a set of test forms of increasing difficulty. This paper explores advantages and limitations of using these methods in the design of standards-based educational achievement tests under both status-based and growth-based accountability systems. The paper discusses the application of vertical scales to support growth models, as well as alternatives to vertical scaling that meet the needs of accountability system.

Rothstein, Jesse (2009). “Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement.” Princeton University and NBER. Forthcoming in the Quarterly Journal of Economics. http://www.princeton.edu/~jrothst/published/rothstein_vam_may152009.pdf

This paper presents evidence that value-added measures of long-run teacher effects are not persistent – they tend to fade out after a year or so – and are widely variable over time.

Sanders, William L., Arnold M. Saxton, and William P. Horn (1997). “The Tennessee Value Added Assessment System: A Quantitative, Outcomes-Based Approach to Educational Assessment.” In Millman, Jason, ed. Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluation Measure? Thousand Oaks, CA: Corwin Press.

This chapter provides an introduction to the Tennessee Value Added Assessment System (TVAAS). Most of the chapter discusses statistical methodology, including basic system properties, data encoding, and interpretation of results. There is also a (now dated) discussion of computing requirements.

Sanders, William L., S. Paul Wright, June C. Rivers, Jill G. Leandro (2009). “A Response to Criticisms of SAS® EVAAS®.” SAS Institute, November 13, 2009

This short paper answers some common criticisms that have been made about the Educational Value Added Assessment System (EVAAS, the general version of Tennessee’s TVAAS). Issues considered include: using standardized tests as outcome measures, effects of missing student test data, class size and use of “shrinkage” estimation, lack of controls for socioeconomic factors, excessive complexity and lack of transparency associated with the method, lack of peer review of the method, lack of verification of predictions, and using EVAAS for formative assessment.

Sawchuk, Stephen (2009). “Some Academics Push Back on Teacher-Student Link in 'Race to the Top'” Education Week, September 11, 2009.

Summarizes some of the questions that some researchers have about value-added methods, especially regarding the ability of value-added measures to disentangle teacher effects from other effects.

Toch, Thomas (2005). “Measure for Measure.” Atlantic Monthly, October/November 2005.

This article provides a non-technical introduction to use of value-added models in measuring school and teacher effectiveness. The article compares value-added methods to the NCLB “adequate yearly progress” accountability system.

Webster, William J. and Robert L. Mendro (1997). “The Dallas Value-Added Accountability System.” In Millman, Jason, ed. Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluation Measure? Thousand Oaks, CA: Corwin Press.

This book chapter introduces the value-added system used in the Dallas Independent School District since the mid-1980s. The chapter discusses how the district uses the approach to identify effective schools and teachers (including coverage of the statistical method), and assess rewards and penalties based on those ratings.

This publication was prepared under a contract with the U.S. Department of Education’s Institute of Education Sciences, Contract ED-06-CO-0021, by Regional Educational Laboratory Appalachia, administered by CNA. The content of the publication does not necessarily reflect the views or policies of IES or the U.S. Department of Education, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. government.