Putting Education Reform To The Test

New Florida Writing Test Will Use Computers To Grade Student Essays

Florida writing tests will be graded by a human and a computer program, according to bid documents for the new test.

jeffrey james pacres / Flickr

Florida writing tests will be graded by a human and a computer program, according to bid documents for the new test. And just 2 percent of students will take a pencil and paper exam in 2015.

A computer program will grade student essays on the writing portion of the standardized test set to replace the FCAT, according to bid documents released by the Florida Department of Education.

The essays will be scored by a human and a computer, but the computer score will only matter if the score is significantly different from that of the human reviewer. If that happens, the documents indicate the essay will be scored by another human reviewer.

Florida writing tests are currently graded by two human scorers and the state has never used computerized grading on the exam.

The Florida Department of Education announced Monday it chose the non-profit American Institutes for Research to produce new tests tied to Florida’s Common Core-based math and language arts standards. Spokesmen for the agency and AIR said they had yet to sign a contract, were still working out the details and declined to comment about the specifics of the new test.

“It’s speculative at this point to think about what is on the assessments,” said Joe Follick, communications director for the Florida Department of Education.

But the bid documents show using computers to grade the state writing test will save $30.5 million over the course of the six-year, $220 million contract with AIR. The change was part of a list which trimmed more than $100 million from AIR’s initial proposal.

The documents also indicate Florida will license its test items from Utah in 2015, the first year the new Florida test will be given. AIR will create Florida-specific questions by the time the test is administered in 2016, saving $20.4 million in licensing fees.

Florida would also save another $14.5 million by limiting the number of pencil and paper tests in favor of online exams. The documents call for just 2 percent of tests to be delivered by pencil and paper the first two years, and 1 percent in future years.

That would put more pressure on school districts to ensure they have the bandwidth and computers necessary to administer the new test.

And Florida will eliminate all paper reporting of test results, saving $14 million.

The use of computer-graded essays may become a necessity, said University of Akron researcher Mark Shermis, because Common Core-tied exams will expand the number of students taking writing exams each year.

Currently, Florida students in grades four, eight and ten take the FCAT writing exam. Under Common Core, students take a writing exam every year.

Florida and 44 other states have fully adopted the Common Core. The standards outline what students should know at the end of each grade.

“Even if you had the money,” Shermis said, “you wouldn’t have the people to do the vast amount of grading required under the Common Core State Standards.”

Shermis found computer programs — including AIR’s AutoScore– performed at least as well as human grading in two of three trials that have been conducted. His research concluded computers were reliable enough to be used as a second reviewer for high stakes tests.

But while the technology is improving, Shermis said districts need to study whether computer-graded essays put any class of students at a disadvantage.

Other researchers are less bullish on the technology.

“Of the 12 errors noted in one essay, 11 were incorrect,” Les Perlman of the Massachusetts Institute of Technology told our colleagues at StateImpact Ohio in 2012. “There were a few places where I intentionally put in some comma errors and it didn’t notice them. In other words, it doesn’t work very well.”

Many states and the two multi-state consortia developing Common Core-tied tests said they are watching computerized essay grading.

Utah has used computer essay grading since 2010, said Utah Department of Education spokesman Mark Peterson. The state trusts the technology enough that computers provide the primary scoring for the state’s writing exams. Peterson said state reviews have found fewer biases in computer grading than human grading.

Utah uses Measurement Incorporated technology to grade essays and will switch to AIR when the current contract runs out, Peterson said.

Smarter Balanced spokesman Jackie King said the test would use only human grading on the writing portion, but that the technology is promising. Officials with the Partnership for Assessment of Readiness for College and Careers, or PARCC, said they have not yet made a decision about the use of computerized grading.

You can read Shermis’ paper on the research below:


  • midgardia

    The title of this article is completely misleading. As the content section clearly states, a *human* grader scores the writing component. The computer is nothing more than an accessory verifier of scores.

    The author of this piece needs to take a course in ethics.

  • PAGster
  • Mike

    If we are judging a computer’s ability to judge student writing by it’s ability to identify misplaced commas, we’re missing the point. A computer cannot assess a student’s ability to think critically and put that thinking into writing. Because computers can’t think. (See, I used a sentence fragment for emphasis. I guess my computer robot judge is docking me points right now.)

    Asking a non-reason computer to assess whether students can reason is asinine. If anyone had any doubt that the whole standardized testing movement was more about the ability to profit off public school dollars rather than helping students learn, this piece of evidence should be the clincher.

  • racdula

    Writing? There should be a more up-to-date word to describe what poses as writing today. Most people who actually use a writing tool today, can’t even hold it correctly.
    And far too many ‘print’ letters; they do not write!

  • Folks – in a high-stakes test no one (human or computer) really reads or comprehends the subtlety of thinking (or its lack) within an essay. It’s also the case that the scoring process yields little useful for helping a student improve his or her writing.

    Consider the following facts:
    1) Assessment firms develop essay prompts.
    2) They develop a (usually single-dimensional) rubric. This is the point where the baby tends to go out with the bathwater – because it’s a decision to compress all possible student essays into 4 or 5 clusters.
    3) Along with the essay, the firm develops a set of essays representative of each level within the rubric (these are known as “anchor papers” or a “training set”).
    4) You train a team of human scorers using the anchor papers – and correct for deviance from the “official scores”.
    5) Once the scoring team is trained, you present them with real papers (along with a trickle of anchor papers to ensure their training doesn’t drift out of spec.
    6) When two humans disagree by only a point, you split the difference (e.g., if I score a paper as “3″ and you score it as “4″, it is reported as “3.5″).
    7) When two humans differ by more than one point, the scoring firm passes the paper to a senior scorer to arbitrate and assign a final score. If the dispute shows that there’s a “loophole” in the rubric implying another valid interpretation, then the rubric is updated and the affected papers might be rescored. This usually doesn’t happen beyond the field test.

    At this point, all we’ve done is train humans to mechanically score papers to force them into a rubric. It’s not that far a jump to use the computer as the second scorer – by training it with the same anchor set as the humans. The computer doesn’t get bored or tired so it applied its “understanding” of the rubric consistently. In fact, Prof Shermis’ report shows that the computer can achieve an “inter-rater reliability” higher than a second human.

    The flaw is not using a computer, but causing students to write essays for which they will get no substantive feedback that helps them improve as writers. If test sponsors simply want to bin students based on a rubric, there’s not much loss by having a computer cut the labor cost by 50%. Summative testing isn’t terribly interested with stimulating learning – it’s merely an ex post facto quality control check on what happened prior to the test.

  • Les Perelman

    See my refutation of the Shermis study


About StateImpact

StateImpact seeks to inform and engage local communities with broadcast and online news focused on how state government decisions affect your lives.
Learn More »