May 20, 2001
NONE OF THE ABOVE / First of two articles Second of two articles
Right Answer, Wrong Score: Test Flaws Take Toll
By DIANA B. HENRIQUES and JACQUES STEINBERG
ne day last May, a few weeks before commencement, Jake Plumley was pulled out of the classroom at Harding High School in St. Paul and told to report to his guidance counselor.
The counselor closed the door and asked him to sit down. The news was grim. Jake, a senior, had failed a standardized test required for graduation. To try to salvage his diploma, he had to give up a promising job and go to summer school. "It changed my whole life, that test," Jake recalled.
In fact, Jake should have been elated. He actually had passed the test. But the company that scored it had made an error, giving Jake and 47,000 other Minnesota students lower scores than they deserved.
An error like this made by NCS Pearson, the nation's biggest test scorer is every testing company's worst nightmare. One executive called it "the equivalent of a plane crash for us."
But it was not an isolated incident. The testing industry is coming off its three most problem-plagued years. Its missteps have affected millions of students who took standardized proficiency tests in at least 20 states.
An examination of recent mistakes and interviews with more than 120 people involved in the testing process suggest that the industry cannot guarantee the kind of error-free, high-speed testing that parents, educators and politicians seem to take for granted.
Now President Bush is proposing a 50 percent increase in the workload of this tiny industry a handful of giants with a few small rivals. The House could vote on the Bush plan this week, and if Congress signs off, every child in grades 3 to 8 will be tested each year in reading and math. Neither the Bush proposal nor the Congressional debate has addressed whether the industry can handle the daunting logistics of this additional business.
Already, a growing number of states use these so-called high-stakes exams not to be confused with the SAT, the college entrance exam to determine whether students in grades 3 to 12 can be promoted or granted a diploma. The tests are also used to evaluate teachers and principals and to decide how much tax money school districts receive. How well schools perform on these tests can even affect property values in surrounding neighborhoods.
Each recent flaw had its own tortured history. But all occurred as the testing industry was struggling to meet demands from states to test more students, with custom-tailored tests of greater complexity, designed and scored faster than ever.
In recent years, the four testing companies that dominate the market have experienced serious breakdowns in quality control. Problems at NCS, for example, extend beyond Minnesota. In the last three years, the company produced a flawed answer key that incorrectly lowered multiple-choice scores for 12,000 Arizona students, erred in adding up scores of essay tests for students in Michigan and was forced with another company to rescore 204,000 essay tests in Washington because the state found the scores too generous. NCS also missed important deadlines for delivering test results in Florida and California.
"I wanted to just throw them out and hire a new company," said Christine Jax, Minnesota's top education official. "But then my testing director warned me that there isn't a blemish-free testing company out there. That really shocked me."
One error by another big company resulted in nearly 9,000 students in New York City being mistakenly assigned to summer school in 1999. In Kentucky, a mistake in 1997 by a smaller company, Measured Progress of Dover, N.H., denied $2 million in achievement awards to deserving schools. In California, test booklets have been delivered to schools too late for the scheduled test, were left out in the rain or arrived with missing pages.
Many industry executives attribute these errors to growing pains.
The boom in high-stakes tests "caught us somewhat by surprise," said Eugene T. Paslov, president of Harcourt Educational Measurement, one of the largest testing companies. "We've turned around, and responded to these issues, and made some dramatic improvements."
Despite the recent mistakes, the industry says, its error rate is infinitesimal on the millions of multiple-choice tests scored by machine annually. But that is only part of the picture. Today's tests rely more heavily on essay-style questions, which are more difficult to score. The number of multiple-choice answer sheets scored by NCS more than doubled from 1997 to 2000, but the number of essay- style questions more than quadrupled in that period, to 84.4 million from 20 million.
Even so, testing companies turn the scoring of these writing samples over to thousands of temporary workers earning as little as $9 an hour.
Several scorers, speaking publicly for the first time about problems they saw, complained in interviews that they were pressed to score student essays without adequate training and that they saw tests scored in an arbitrary and inconsistent manner.
"Lots of people don't even read the whole test the time pressure and scoring pressure are just too great," said Artur Golczewski, a doctoral candidate, who said he has scored tests for NCS for two years, most recently in April.
NCS executives dispute his comments, saying that the company provides careful, accurate scoring of essay questions and that scorers are carefully supervised.
Because these tests are subject to error and subjective scoring, the testing industry's code of conduct specifies that they not be the basis for life-altering decisions about students. Yet many states continue to use them for that purpose, and the industry has done little to stop it.
When a serious mistake does occur, school districts rarely have the expertise to find it, putting them at the mercy of testing companies that may not be eager to disclose their failings. The surge in school testing in the last five years has left some companies struggling to find people to score tests and specialists to design them.
"They are stretched too thin," said Terry Bergeson, Washington State's top education official. "The politicians of this country have made education everybody's top priority, and everybody thinks testing is the answer for everything."
The Mistake: When 6 Wrongs Were Rights
The scoring mistake that plagued Jake Plumley and his Minnesota classmates is a window into the way even glaring errors can escape detection. In fact, NCS did not catch the error. A parent did.
Martin Swaden, a lawyer who lives in Mendota Heights, Minn., was concerned when his daughter, Sydney, failed the state's basic math test last spring. A sophomore with average grades, Sydney found math difficult and had failed the test before.
This time, Sydney failed by a single answer. Mr. Swaden wanted to know why, so he asked the state to see Sydney's test papers. "Then I could say, `Syd, we gotta study maps and graphs,' or whatever," he explained.
But curiosity turned to anger when state education officials sent him boilerplate e-mail messages denying his request. After threatening a lawsuit, Mr. Swaden was finally given an appointment. On July 21, he was ushered into a conference room at the department's headquarters, where he and a state employee sat down to review the 68 questions on Sydney's test.
When they reached Question No. 41, Mr. Swaden immediately knew that his daughter's "wrong" answer was right.
The question showed a split-rail fence, and asked which parts of it were parallel. Sydney had correctly chosen two horizontal rails; the answer key picked one horizontal rail and one upright post.
"By the time we found the second scoring mistake, I knew she had passed," Mr. Swaden said. "By the third, I was concerned about just how bad this was."
After including questions that were being field-tested for future use, someone at NCS had failed to adjust the answer key, resulting in 6 wrong answers out of 68 questions. Even worse, two quality control checks that would have caught the errors were never done.
Eric Rud, an honor-roll student except in math, was one of those students mislabeled as having failed. Paralyzed in both legs at birth, Eric had achieved a fairly normal school life, playing wheelchair hockey and dreaming of becoming an architect. But when he was told he had failed, his spirits plummeted, his father, Rick Rud, said.
Kristle Glau, who moved to Minnesota in her senior year, did not give up on high school when she became pregnant. She persevered, and assumed she would graduate because she was confident she had passed the April test, as, in fact, she had.
"I had a graduation party, with lots of presents," she recalled angrily. "I had my cap and gown. My invitations were out." Finally, she said, her mother learned what her teachers did not have the heart to tell her; according to NCS, she had failed the test and would not graduate.
When the news of NCS's blunder reached Ms. Jax, the state schools commissioner, she wept. "I could not believe," she said, "how we could betray children that way."
But when she learned that the error would have been caught if NCS had done the quality control checks it had promised in its bid, she was furious. She summoned the chief executive of NCS, David W. Smith, to a news conference and publicly blamed the company for the mistake.
Mr. Smith made no excuses. "We messed up," he said. "We are extremely sorry this happened." NCS has offered a $1,000 tuition voucher to the seniors affected, and is covering the state's expenses for retesting. It also paid for a belated graduation ceremony at the State Capitol.
Jake Plumley and several other students are suing NCS on behalf of Minnesota teenagers who they say were emotionally injured by NCS's mistake. NCS has argued that its liability does not extend to emotional damages.
The court cases reflect a view that is common among parents and even among some education officials: that standardized testing should be, and can be, foolproof.
The Task: Trying to Grade 300 Million Test Sheets
The mistake that derailed Jake Plumley's graduation plans occurred in a bland building in a field just outside Iowa City. From the driveway on North Dodge Street, the structure looks like an overgrown suite of medical offices with a small warehouse in the back.
Casually dressed workers, most of them hired for the spring testing season, gather outside a loading dock to smoke, or wander out for lunch at Arby's.
This is ground zero for the testing industry, NCS's Measurement Services unit. More of the nation's standardized tests are scored here than anywhere else. Last year, nearly 300 million answer sheets coursed through this building, the vast majority without mishap. At this facility and at other smaller ones around the country, NCS scores a big chunk of the exams from other companies. What the company does in this building affects not only countless students, but the reputation of the entire industry.
Inside, machines make the soft sound of shuffling cards as they scan in student answers to multiple-choice questions. Handwritten answers are also scanned in, to be scored later by workers.
But behind the soft whirring and methodical procedures is an often frenzied rush to meet deadlines, a rush that left many people at the company feeling overwhelmed, current and former employees said.
"There was a lack of personnel, a lack of time, too many projects, too few people," sighed Nina Metzner, an education assessment consultant who worked at NCS. "People were spread very, very thin."
Those concerns were echoed by other current and former NCS employees, several of whom said those pressures had played a role in the Minnesota error and other problems at the company.
Mr. Smith, the NCS chief executive, disputed those reports. The company has sustained a high level of accuracy, he said, by matching its staffing to the volume of its business. The Minnesota mistake, he said, was not caused by the pressures of a heavy workload but by "pure human error caused by individuals who had the necessary time to perform a quality function they did not perform."
Betsy Hickok, a former NCS scoring director, said she had worked hard to ensure the accurate scoring of essays. But that became more difficult, she said, as she and her scorers were pressed into working 12-hour days, six days a week.
"I became concerned," Ms. Hickok said, "about my ability, and the ability of the scorers, to continue making sound decisions and keeping the best interest of the student in mind."
Mr. Smith said NCS was "committed to scoring every test accurately."
The Workers: Some Questions About Training
The pressures reported by NCS executives are affecting the temporary workers who score the essay questions in vogue today, said Mariah Steele, a former NCS scorer and a graduate student in Iowa City.
In today's tight labor markets, Ms. Steele is the testing industry's dream recruit. She is college-educated but does not have a full-time job; she lives near a major test-scoring center and is willing to work for $9 an hour.
For her first two evenings, she and nearly 100 other recruits were trained to score math tests from Washington State. This training is critical, scoring specialists say, to make sure that scorers consistently apply a state's specific standards, rather than their own.
But one evening in late July, as the Washington project was ending, Ms. Steele said, she was asked by her supervisor to stop grading math and switch to a reading test from another state, without any training.
"He just handed me a scoring rubric and said, `Start scoring,' " Ms. Steele said. Perhaps a dozen of her co-workers were given similar instructions, she added, and were offered overtime as an inducement.
Baffled, Ms. Steele said she read through the scoring guide and scored tests for about 30 minutes. "Then I left, and didn't go back," she said. "I really was not confident in my ability to score that test."
Two other former scorers for NCS say they saw inconsistent grading.
Renée Brochu of Iowa City recalled when a supervisor explained that a certain response should be scored as a 2 on a two-point scale. "And someone would gasp and say, `Oh, no, I've scored hundreds of those as a 1," Ms. Brochu said. "There was never the suggestion that we go back and change the ones already scored."
Another former scorer, Mr. Golczewski, accused supervisors of trying to manipulate results to match expectations. "One day you see an essay that is a 3, and the next day those are to be 2's because they say we need more 2's," he said.
He recalled that the pressure to produce worsened as deadlines neared. "We are actually told," he said, "to stop getting too involved or thinking too long about the score to just score it on our first impressions."
Mr. Smith of NCS dismissed these anecdotes as aberrations that were probably caught by supervisors before they affected scores.
"Mistakes will occur," he said. "We do everything possible to eliminate those mistakes before they affect an individual test taker."
New York City did not use NCS to score its essay-style tests; instead, like a few other states, it used local teachers. But like the scorers in Iowa, they also complained that they had not been adequately trained.
One reading teacher said she was assigned to score eighth-grade math tests. "I said I hadn't been in eighth-grade math class since I was in eighth grade," she said.
Another teacher, she said, arrived late at the scoring session and was put right to work without any training.
Roseanne DeFabio, assistant education commissioner in New York State, said she thought the complaints were exaggerated. State audits each year of 10 percent of the tests do not show any major problems, she said, "so I think it's unlikely that there's any systemic problem with the scoring."
The Demand: States Pushing For More, Faster
Testing specialists argue that educators and politicians must share the blame for the rash of testing errors because they are asking too much of the industry.
They say schools want to test as late in the year as possible to maximize student performance, while using tests that take longer to score. Yet schools want the results before the school year ends so they can decide about school financing, teacher evaluations, summer school, promotions or graduation.
"The demands may just be impossible," said Edward D. Roeber, a former education official who is now vice president for external affairs for Measured Progress.
The law's "unrealistic" deadlines, state auditors said later, contributed to the numerous quality control problems that plagued the test contractor, Harcourt Educational Measurement, for the next two years.
That state audit, and an audit done for Harcourt by Deloitte & Touche, paint a devastating portrait of what went wrong. There was not time to test the computer link between Harcourt, the test contractor, and NCS, the subcontractor. When needed, it did not work, causing delays. Some test materials were delivered so late that students could not take the tests on schedule.
It got worse. Pages in test booklets were duplicated, missing or out of order. One district's test booklets, more than two tons of paper, were dumped on the sidewalk outside the district offices at 5 p.m. on a Friday in the rain. Test administrators were not adequately trained. When school districts got the computer disks from NCS that were supposed to contain the test results, some of the data was inaccurate and some of the disks were blank.
In 1998, nearly 700 of the state's 8,500 schools got inaccurate test results, and more than 750,000 students were not included in the statewide analysis of the test results.
Then, in 1999, Harcourt made a mistake entering demographic data into its computer. The resulting scores made it appear that students with a limited command of English were performing better in English than they actually were, a politically charged statistic in a state that had voted a year earlier to eliminate bilingual education in favor of a one-year intensive class in English.
"There's tremendous political pressure to get tests in place faster than is
prudent," said Maureen G. DiMarco, a vice president at Houghton Mifflin
Dr. Paslov, who became president of Harcourt Educational Measurement after the 1999 problems, said that the current testing season in California is going smoothly and that Harcourt has addressed concerns about errors and delays.
But California is still sprinting ahead.
In 1999, Gov. Gray Davis signed a bill directing state education officials to develop another statewide test, the California High School Exit Exam. Once again, industry executives said, speed seemed to trump all other considerations.
None of the major testing companies bid on the project because of what Ms. DiMarco called "impossible, unrealistic time lines."
With no bidders, the state asked the companies to draft their own proposals. "We had just 10 days to put it together," recalled George W. Bohrnstedt, senior vice president for research at the American Institutes for Research, which has done noneducational testing but is new to school testing.
Phil Spears, the state testing director, said A.I.R. faced a "monumental task, building and administering a test in 18 months."
"Most states," Mr. Spears said, "would take three-plus years to do that kind of test."
The new test was given for the first time this spring.
The Concern: Life Choices Based on a Score
States are not just demanding more speed; they are demanding more complicated exams. Test companies once had a steady business selling the same brand- name tests, like Harcourt's Stanford Achievement Test or Riverside's Iowa Test of Basic Skills, to school districts. These "shelf" tests, also called norm- referenced tests, are the testing equivalent of ready-to-wear clothing. Graded on a bell curve, they measure how a student is performing compared with other students taking the same tests.
But increasingly, states want custom tailoring, tests designed to fit their homegrown educational standards. These "criterion referenced" tests measure students against a fixed yardstick, not against each other.
That is exactly what Arizona wanted when it hired NCS and CTB/McGraw-Hill in December 1998. What it got was more than two years of errors, delays, escalating costs and angry disappointment on all sides.
Some of the problems Arizona encountered occurred because the state had established standards that, officials later conceded, were too rigorous. But the state blames other disruptions on NCS.
"You can't trust the quality assurance going on now," said Kelly Powell, the Arizona testing director, who is still wrangling with NCS.
For its part, NCS has thrown up its hands on Arizona. "We've given Arizona nearly $2 of service for every dollar they have paid us," said Jeffrey W. Taylor, a senior vice president of NCS. Mr. Taylor said NCS would not bid on future business in that state.
Each customized test a state orders must be designed, written, edited, reviewed by state educators, field-tested, checked for validity and bias, and calibrated to previous tests an arduous process that requires a battery of people trained in educational statistics and psychometrics, the science of measuring mental function.
While the demand for such people is exploding, they are in extremely short supply despite salaries that can reach into the six figures, people in the industry said. "All of us in the business are very concerned about capacity," Mr. Bohrnstedt of A.I.R. said.
And academia will be little help, at least for a while, because promising candidates are going into other, more lucrative areas of statistics and computer programming, testing executives say.
Kurt Landgraf, president of the Educational Testing Service in Princeton, N.J., the titan of college admission tests but a newcomer to high-stakes state testing, estimated that there are about 20 good people coming into the field every year.
Already, the strain on the test-design process is showing. A supplemental math test that Harcourt developed for California in 1999 proved statistically unreliable, in part because it was too short. Harcourt had been urged to add five questions to the test, state auditors said, but that was never done.
Even more troubling, most test professionals say, is the willingness of states like Arizona to use standardized tests in ways that violate the testing industry's professional standards. For example, many states use test scores for determining whether students graduate. Yet the American Educational Research Association, the nation's largest educational research group, specifically warns educators against making high-stakes decisions based on a single test.
Among the reasons for this position, testing professionals say, is that some students are emotionally overcome by the pressure of taking standardized tests. And a test score, "like any other source of information about a student, is subject to error," noted the National Research Council in a comprehensive study of high- stakes testing in 1999.
But industry executives insist that, while they try to persuade schools to use tests appropriately, they are powerless to enforce industry standards when their customers are determined to do otherwise. A few executives say privately that they have refused to bid on state projects they thought professionally and legally indefensible.
"But we haven't come to the point yet, and I don't know if we will, where we are going to tell California where we sell $44 million worth of business `Nope! We don't like the way you people are using these instruments, so we're not going to sell you this test,' " Dr. Paslov said.
Besides, as one executive said, "If I don't sell them, my competitors will."
The Expectations: Bush Proposal Raises the Bar
President Bush explained in a radio address on Jan. 24 why he wanted to require annual testing of students in grades 3 to 8 in reading, math and science. "Without yearly testing," he said, "we do not know who is falling behind and who needs our help."
While many children will clearly need help, so will the testing industry if it is called upon to carry out Mr. Bush's plan, education specialists said.
Currently, only 13 states test for reading and math in all six grades required by the Bush plan. If Mr. Bush's plan is carried out, the industry's workload will grow by more than 50 percent.
Ms. Jax, Minnesota's top school official, says she is not close to being ready. "It's just impossible to find enough people," she said. "I will have to add at least four tests. I don't have the capacity for that, and I'm not convinced that the industry does either."
Certainly the industry has been generating revenues that could support some expansion. In 1999, its last full year as an independent company, NCS reported revenues of more than $620 million, up 30 percent from the previous year. The other major players, all corporate units, do not disclose revenues.
Several of the largest testing companies have assured the administration that the industry can handle the additional work. "It's taken the testing industry a while to gear up for this," said Dr. Paslov of Harcourt. "But we are ready."
Other executives are far less optimistic. "I don't know how anyone can say that we can do this now," said Mr. Landgraf of the Educational Testing Service.
Russell Hagen, chief executive of the Data Recognition Corporation, a midsize testing company in Maple Grove, Minn., worries that the added workload from the Bush proposal would create even more quality control problems, with increasingly serious consequences for students. "Take the Minnesota experience and put it in 50 states," he said.
The Minnesota experience is still a fresh fact of life for students like Jake Plumley, who is working nights for Federal Express and hoping to find another union job like the one he gave up last summer.
But despite his difficult experience, he does not oppose the kind of testing that derailed his post-graduation plans. "The high-stakes test it keeps kids motivated. So I understand the idea of the test," he said. "But they need to do it right."