Lessons—Flaws in Annual Testing

These pieces originally appeared as a weekly column entitled “Lessons” in The New York Times between 1999 and 2003.

[ THIS ARTICLE FIRST APPEARED IN THE NEW YORK TIMES ON JANUARY 24, 2001 ]

Flaws in Annual Testing

The big idea in President Bush’s education plan, sent to Congress yesterday, is to hold schools with poor children accountable, using an annual test. Pupils at schools with two years of inadequate test-score gains can transfer to another public school. After a third year of little progress, they can use public money at private schools.

It seems reasonable to have tests identify schools that don’t improve. But the president may put too much faith in scores that are less accurate than he thinks.

Under President Bush’s proposal, commonplace measurement error could cause states to identify as “failing” some schools that don’t deserve the label.

This column previously described the pitfalls of using one test to evaluate a student. Because any day’s score might differ from a student’s average, or true, score if the test were taken many times, one test should not decide promotion. Judging a student by one test is like judging a baseball player by one day’s batting average.

It may seem that these problems don’t apply schoolwide because when some students have good days, others have bad ones. If these average out, accountability using a test should work.

But even a school average can wobble around its true value, so sanctions based on annual score changes run a risk of unfairness.

Thomas J. Kane, an economist at the Hoover Institution in Palo Alto, Calif., and Douglas O. Staiger, an economics professor at Dartmouth College, have studied the accuracy of North Carolina’s tests. When schools there have above-average score gains, teachers get bonuses.

But Dr. Kane says even tiny sampling errors can keep scores from reflecting true performance. Consider how teachers think of their classes as “good” or “bad” in any year. A less- able teacher could produce greater score gains in a class with better pupils than a more-able teacher could produce with worse students — even if the classes are demographically identical.

Dr. Kane says that in typical North Carolina elementary schools, nearly one-third of the variance in school reading gains is a result of this “luck of the draw,” that is, whether this year’s seemingly identical students are easier to teach than last year’s.

This affects small schools more, Dr. Kane says, because a few extreme scores can more easily distort an average. Thus, small schools with big gains may seem more effective. But small schools also more often post tiny gains or even losses. A small school’s results may have more to do with sampling error than school quality.

Also, even at large schools, a rainy day or other random events may change children’s dispositions. This can affect a school’s rank: nearly another third of the variance in score gains is linked to the fact that a school’s average can vary from one day to another.

North Carolina solved one problem while creating another. Like the Bush plan, the state’s program judges schools by annual gains, not absolute scores, to avoid rewarding schools that test well only because they have privileged students. But sampling and random events affect both earlier and later scores, compounding the inaccuracy.

David Rogosa, a professor of education statistics at Stanford University, is analyzing error rates in California, where teachers at schools with rising scores will soon receive bonuses as high as $25,000 each.

Professor Rogosa estimated that if awards were based on school averages alone, over one-fourth of schools with no gains would still qualify.

But California avoids this problem by insisting that schools succeed for low-income students and each minority group, as well as schoolwide. This reduces the chances of undeserved awards, because simultaneous false gains in each group are unlikely.

But the reverse is also true. Schools deserving rewards will be more likely to lose them because if any group fails as a result of random events or sampling error, the school will be disqualified. Diverse schools will fail more often; they have more subgroups where false declines can occur.

President Bush wants to hold schools accountable for gains schoolwide as well as for disadvantaged students. So if either the school or the disadvantaged group posts a false decline, the school could be wrongly labeled failing. But the president’s requirement that sanctions follow two successive years of failure is a partial safeguard against measurement error.

The administration’s proposal is intended to spur achievement. Schools showing adequate progress will be encouraged to continue their practices; failing schools should change their ways.

But if the wrong schools are sanctioned, what message do we send? Will unfairly sanctioned schools drop methods that work? Will ineffective schools get rewards and continue poor teaching?

Surely we can find better ways to measure schools than relying on an annual test.

Return to the Education Column Archive