Lessons—Testing Reaches a Fork in the Road

These pieces originally appeared as a weekly column entitled “Lessons” in The New York Times between 1999 and 2003.

[ THIS ARTICLE FIRST APPEARED IN THE NEW YORK TIMES ON MAY 22, 2002 ]

Testing reaches a fork in the road

Children take one of two types of standardized test, one “norm-referenced,” the other “criteria-referenced.” Although those names have an arcane ring, most parents are familiar with how the exams differ.

Norm-referenced tests give percentile scores, as when a student is said to be at the 40th percentile; that means the child did better than 40 percent of a sample of students who took the test in an earlier, base year, and worse than 60 percent.

Criteria-referenced tests, on the other hand, tell if a student has learned the assigned curriculum: a child who meets the state’s standard is termed “proficient,” one at a higher standard is “advanced,” one at a lower standard is “basic,” and one even lower is “below basic.”

The difference between the two exams has now become important, because a group of liberal Democratic senators are insisting that the new federal education law be interpreted as requiring the use of criteria-referenced tests. The senators say tests giving proficiency levels are better because they tell if students have learned what they should know, while tests giving percentiles “compare students only to each other.”

The criticism is misguided. To see why, consider the familiar college entrance exam, the SAT. It is a norm-referenced test, although the College Board converts percentiles to plain numbers before releasing results. Thus, “500” simply means doing as well as the average senior did in 1991, when SAT norms were set; the College Board could just as well say that the student has a 50th percentile rank. Similarly, a score of 600 is only another way to say that a student has an 84th percentile rank, again relative to the base year, 1991.

Such norm-referenced scores convey a good sense of student abilities. We know that a student with a verbal score of 600 can comprehend difficult material and that one with 700 can probably excel at an elite college. That is better information than if the SAT were criteria-referenced and only labeled those two students “advanced.”

Critics of norm-referenced tests for elementary-school pupils complain that they are mostly multiple choice, without essays that a good curriculum requires. But test publishers have been addressing that complaint by adding essay questions, while still giving percentile scores.

Critics also say that tests giving percentile ranks deliberately include some questions that many students will miss and others that most students will miss (along with those that few will miss), to distinguish among students at varying points in the range of scores. Reporting proficiency, criteria-referenced tests do not require big score spreads, and with reason, the critics say: in good schools, all students should get most questions right.

That complaint would be valid if criteria-referenced tests told only whether students had passed or failed. But tests that have several possible results (advanced, proficient, basic and below basic) must also be designed with a range of difficulty. So the designs of criteria- and norm-referenced tests actually differ very little.

Percentile ranks are more meaningful than proficiency levels, in part because expert judgments as to what constitutes proficiency are arbitrary. There is little difference between one student at the 39th percentile and a second at the 41st. But if the 40th percentile on a norm-referenced test corresponds to minimum proficiency on a criteria-referenced test, then the criteria-referenced report will say the first student failed and the second passed.

Sometimes the outcomes on state criteria-referenced exams defy common sense: when, for example, “proficient” scores are achieved by 40 percent of third and fifth graders alike but by only 30 percent of fourth graders. If the criteria were valid, such results would suggest (implausibly) that fourth-grade teachers were awful but fifth-grade teachers terrific. Instead, when faced with such anomalies, states have quietly finagled their standards to place fourth-grade proficiency as well at the 40th percentile.

Criteria-referenced reporting can’t detect growth except when a student passes one of only a few fixed points on a scale. Compare a student who goes from the equivalent of the 25th percentile to the equivalent of the 35th with one who goes from 35th to 45th. If a 40th percentile score corresponds with proficiency, criteria-referenced reports will judge the second student (going from “basic” to “proficient”) improved, but not the first.

Many people share the view that percentile ranks compare students only with one another. But as with the SAT, test norms not only come from national samples but also typically remain fixed for several years. Percentiles do not compare students within the same school (or even the whole nation) within the same year; the comparison is against test-takers in the initial national pool.

So if states don’t frequently change their tests, percentile ranks, like SAT scores, will come to signify real achievement in the public mind. That is better than switching to criteria-referenced tests, with their flawed proficiency definitions and limited ability to detect progress.

Return to the Education Column Archive

Lessons—Testing Reaches a Fork in the Road

Testing reaches a fork in the road

Sign up to stay informed

Track EPI on Twitter