These pieces originally appeared as a weekly column entitled “Lessons” in The New York Times between 1999 and 2003.
[THIS ARTICLE FIRST APPEARED IN THE NEW YORK TIMES ON SEPTEMBER 13, 2000]
How Tests Can Drop The Ball
Mike Piazza, batting .332, could win this year’s Most Valuable Player award. He has been good every year, with a .330 career average, twice a runner-up for m.v.p. and a member of each All- Star team since his rookie season. The Mets reward Piazza for this high achievement, at the rate of $13 million a year.
But what if the team decided to pay him based not on overall performance but on how he hit during one arbitrarily chosen week? How well do one week’s at-bats describe the ability of a true .330 hitter?
Not very. Last week Piazza batted only .200. But in the second week of August he batted .538. If you picked a random week this season, you would have only a 7-in-10 chance of choosing one in which he hit .250 or higher.
Are standardized-test scores, on which many schools rely heavily to make promotion or graduation decisions, more indicative of true ability than a ballplayer’s weekly average?
Not really. David Rogosa, a professor of educational statistics at Stanford University, has calculated the “accuracy” of tests used in California to abolish social promotion. (New York uses similar tests.)
Consider, Dr. Rogosa says, a fourth-grade student whose “true” reading score is exactly at grade level (the 50th percentile). The chances are better than even (58 percent) that this student will score either above the 55th percentile or below the 45th on any one test.
Results for students at other levels of true performance are also surprisingly inconsistent. So if students are held back, required to attend summer school or denied diplomas largely because of a single test, many will be punished unfairly.
About half of fourth-grade students held back for scores below the 30th percentile on a typical reading test will actually have “true” scores above that point. On any particular test, nearly 7 percent of students with true scores at the 40th percentile will likely fail, scoring below the 30th percentile.
Are Americans prepared to require large numbers of students to repeat a grade when they deserve promotion?
Professor Rogosa’s analysis is straightforward. He has simply converted technical reliability information from test publishers (Harcourt Educational Measurement, in this case) to more understandable “accuracy” guides.
Test publishers calculate reliability by analyzing thousands of student tests to estimate chances that students who answer some questions correctly will also answer others correctly. Because some students at any performance level will miss questions that most students at that level get right, test makers can estimate the reliability of each question and of an entire test.
Typically, districts and states use tests marketed as having high reliability. Yet few policy makers understand that seemingly high reliability assures only rough accuracy – for example, that true 80th percentile students will almost always have higher scores than true 20th percentile students.
But when test results are used for high-stakes purposes like promotion or graduation decisions, there should be a different concern: How well do they identify students who are truly below a cutoff point like the 30th percentile? As Dr. Rogosa has shown, the administering of a single test may do a poor job of this.
Surprisingly, there has not yet been a wave of lawsuits by parents of children penalized largely because of a single test score. As more parents learn about tests’ actual accuracy, litigation regarding high-stakes decisions is bound to follow. Districts and states will then have to abandon an unfair reliance on single tests to evaluate students.
When Mike Piazza comes to bat, he may face a pitcher who fools him more easily than most pitchers do, or fools him more easily on that day. Piazza may not have slept well the night before, the lights may bother him, or he may be preoccupied by a problem at home. On average, over a full season, the distractions do not matter much, and the Mets benefit from his overall ability.
Likewise, when a student takes a test, performance is affected by random events. He may have fought with his sister that morning. A test item may stimulate daydreams not suggested by items in similar tests, or by the same test on a different day. Despite a teacher’s warning to eat a good breakfast, he may not have done so.
If students took tests over and over, average accuracy would improve, just as Mike Piazza’s full-season batting average more accurately reflects his hitting prowess. But school is not baseball; if students took tests every day, there would be no time left for learning.
So to make high-stakes decisions, like whether students should be promoted or attend summer school, giving great importance to a single test is not only bad policy but extraordinarily unfair. Courts are unlikely to permit it much longer.