Alison Yin for EdSource Today

As millions of students prepare, for the first time, to take a battery of assessments aligned with the Common Core using computers, at least portions of the tests will have to be scored the old-fashioned way: by humans.

That’s because the so-called Smarter Balanced tests, aligned with the Common Core State Standards, include essay questions designed to measure critical thinking skills. Even the math tests require students to explain how they reach their answers.

And unlike the old multiple-choice California Standards Tests that students took every year until the spring of 2013, those more complex portions of the Smarter Balanced tests can’t be easily scored by machine.

To score them, the Educational Testing Service,* which will administer the tests under contract with the California Department of Education, is in the process of hiring 6,500 scorers in California. It has almost reached its goal. As of Feb. 19 it has recruited 6,294 people to work as hand scorers, pending their passing certification. Of those, 3,777 have passed certification, according to the California Department of Education.

“Because the test is so new, I wanted to see exactly what they’re looking for when they’re assessing students,” said Christopher Vue, a math teacher at Washington Union High School in Easton.

The use of people rather than computers to rate essay questions has been routine for portions of tests like the Graduate Record Exams and the Graduate Management Admissions Test. But they will be used to score many more answers than on previous tests taken by K-12 students in California.

While human scoring has advantages over automated computer scoring on essay-type questions, it also has its disadvantages. How well the testing service recruits and trains scorers could have an impact on individual students’ scores.

A 2013 paper published by the Educational Testing Service noted that “humans can make mistakes due to cognitive limitations that can be difficult or even impossible to quantify, which in turn can add systematic biases to the final scores.” That’s on top of the logistics of managing and training – and paying – thousands of scorers, a process that the testing service paper described as “labor intensive, time consuming and expensive.”

According to an Educational Testing Service recruiting flyer, to become a test “rater” – the term used in testing parlance – a bachelor’s degree in any field is required, although teaching experience is “strongly preferred.”

Among those who have been hired so far, only 241 are current California teachers.

One of them is Christopher Vue, a math teacher at Washington Union High School in Easton, near Fresno. To get certified as a rater, he recently sat down in front of his home computer to figure out the best way to grade the critical thinking skills of students he’s never met. The test results he was asked to score were those of students who took the Smarter Balanced field tests administered last spring.

On one middle school math problem on “proportional relationships” that Vue was asked to score, a student’s response could earn up to three points. A clear set of guidelines provided by the testing service helped him figure out how to score five possible responses to the same problem, he said.

It took Vue two hours to read and review all of the material for the training and certification. When he begins scoring students this spring, he said that he will be able to ask a team leader if he is unsure about how to score a particular answer.

While Vue will earn $13 an hour for working as a scorer, his reason for signing up was not the extra cash. “Because the test is so new, I wanted to see exactly what they’re looking for when they’re assessing students,” Vue said.

Natalie Albrizzio, a math specialist in the Ventura Unified School District, had a similar motivation for becoming  a test scorer. But she has already had experience with the process.

When California adopted the Common Core Standards in 2010, her school district created its own math exams that required students to explain the reasoning behind their answers. To determine how to score those exams on a scale of 0 to 3, she said, teachers discussed all the possible responses to the questions. For example, she said, they pondered whether to give a student whose answer was nonsensical a 1 for effort.

The funds to pay for hand scorers will come out of a $24 million budget that’s set aside for test processing, scoring and analysis of the Smarter Balanced assessments, according to officials at the California Department of Education.

California may have used hand scorers in a more limited way on statewide K-12 assessments, but the scope of their use has been widespread for decades in other states, including Washington and Connecticut, according to Shelbi Cole, the deputy director of content for the Smarter Balanced Assessment Consortium.

She said that the math and English Language Arts scoring guidelines that Smarter Balanced has distributed to California and other states using its assessments were developed by educators at meetings where they considered numerous answers students might provide to the test questions.

What kinds of responses earn a high score depends on the complexity or difficulty of a test item. For example, to earn a top score of 4 on an essay in which a student argues that the British Museum should return the Rosetta Stone to Egypt requires clear sourcing and citations, the use of expert opinions to rebut opposing views, and the appropriate use of vocabulary.

The Educational Testing Service is looking into ways more of the Common Core tests can be scored without human intervention, and experts in the testing field believe that more machine testing is inevitable. The 2013 testing service report concluded that “advances in artificial intelligence technologies have made machine scoring of essays a realistic option… and that it will be used more widely in educational assessments in the near future.”

But for now, scorers like Vue and Albrizzio will be essential to the process.

Administration of the Smarter Balanced assessments will begin in some districts at the end of March, and testing will run into June, depending on the district. Vue is now waiting for instructions about what to do next, including being told which grade levels he’ll be expected to score. “I feel like I’m really in the dark about what’s going to happen next,” he said.

Like Vue, Albrizzio is looking forward to getting a closer look at the tests themselves. “Especially for math, I think it’s important for teachers to participate,” she said.

*Correction: An earlier version of this article incorrectly stated the name of the Educational Testing Service. The story was also updated to reflect the extent to which California is using hand scorers for K-12 assessments.

To get more reports like this one, click here to sign up for EdSource’s no-cost daily email on latest developments in education.

Share Article

Comments (7)

Leave a Comment

Your email address will not be published. Required fields are marked * *

Comments Policy

We welcome your comments. All comments are moderated for civility, relevance and other considerations. Click here for EdSource's Comments Policy.

  1. Melanee Johnson 8 years ago8 years ago

    As stated by a previous commenter, this is not the first time California has used human scorers. The state writing test, which was just done away with a couple of years ago, has always been scored by humans. And back in the 90s, during the shirt-lived CLAS era, some colleagues of mine and I, along with hundreds of other teachers, scored writing, math, or reading responses. After calibrating a small group of scorers, our trainer … Read More

    As stated by a previous commenter, this is not the first time California has used human scorers. The state writing test, which was just done away with a couple of years ago, has always been scored by humans. And back in the 90s, during the shirt-lived CLAS era, some colleagues of mine and I, along with hundreds of other teachers, scored writing, math, or reading responses. After calibrating a small group of scorers, our trainer sat at the table with us as we scored. Items were scored by 2 people, and the trainer checked to make sure there was no more than a 1-point discrepancy in the 2 scores. If the scores on a particular test item were 2 points or more different (using a 4-point rubric), the trainer would then score the item herself, and the scorer who was “off” had to be recalibrated.

  2. el 8 years ago8 years ago

    Asking the dreaded “explain your answer” question for a standardized math test seems extremely problematic to me. Unless it is drilled ahead of time, it’s likely to degrade the scores of kids who are actually pretty capable, and that drilling time required to highlight all the elements expected in the answer is likely to end up being a distraction from the time for the actual material, rather than an activity that increases understanding.

  3. Replies

    • el 8 years ago8 years ago

      I agree. The best way to ensure your schools are graduating kids who are college ready is to have an entrance exam and a high tuition.

      • Gary Ravani 8 years ago8 years ago

        Actually, El, it's a bit more complicated than that. First a caveat about any comparison of private, charter, and regular public schools (see below), the small numbers of children attending non-regular public schools makes the statistical comparisons "iffy." That being said, what follows is just a "raw" comparison of three high schools from a Stanford study that looks at SAT scores and college attendance. It shows that "comparable" schools, being private, regular public, or magnet have … Read More

        Actually, El, it’s a bit more complicated than that.

        First a caveat about any comparison of private, charter, and regular public schools (see below), the small numbers of children attending non-regular public schools makes the statistical comparisons “iffy.” That being said, what follows is just a “raw” comparison of three high schools from a Stanford study that looks at SAT scores and college attendance. It shows that “comparable” schools, being private, regular public, or magnet have similar results for students in those areas, though in the case of the “magnet’ school, that is when you allow a public school to treat students in the same fashion as private schools with entrance tests, etc, then the magnet school students outperform all other students by considerable margins.

        The main thing you can glean from looking at a broad array of studies is that: 1) students SES factors are the main determinant in SAT scores, college attendance, etc.; 2) with the caveats in mind, when public and private school students are compared and adjustments are made for SES differences, public school students outperform other students in several significant ways.

        (Note: It’s just got to be amazing that the public high school cited in Orange County performs so well and seems to have dodged all of the “bad teachers” that allegedly plague the rest of CA’s schools even though that school operates under the same statutes as lesser performing schools and draws from the same pool of teachers. Then again, maybe it’s not statutes and teachers that make up the greater part of the differences in school performance. Just sayin’.)

        National Student Clearinghouse Research Center

        “It is important to note, however, that the sample sizes for charter and private high schools are relatively smaller than those for public non-charter schools. The results for charter and private schools are therefore subject to higher variance and uncertainty than the results for public non-charter schools.”

        Public Education vs. Private Education (excerpted)

        Robin Walker
        Spring Quarter 1998-1999
        Stanford University

        “Mater Dei High School is a private, Catholic school located in Orange County, California. In order to be admitted into this school, students must pass a standardized exam, which is also used for the placement of students into more advanced classes, and students must also write a personal essay stating why he or she would like to attend the school..

        In respect to SAT scores, Mater Dei students average a verbal score of 522 and a math score of 527. These scores are 26 and 13 points higher than the average scores for California students, and 17 and 16 points higher than the U.S. average. From this data, one sees that a private education is definitely a worthwhile option. To emphasize the validity of this statement, 97~ of the graduating class attends college, with 70% attending a four-year college and the remaining 27% going to a two-year college.

        Los Alamitos High School is one such public school that provides a comparable education when compared to a private school. The school is located in Orange County, California and is responsible for providing a college-preparatory education for 2836 students.

        When comparing the students’ SAT scores, 524 in verbal and 541 in math, to Mater Dei’s average, Los Alamitos actually ranks higher. The graduation rate at Los Alamitos is 98.7%, with 86% of those students moving on towards a college education. Of the 86% who goes to college, 52% attend a four-year college and 34% go to a two-year college. Those 14% who do not move on, either enroll in vocational training, the military, or find employment.

        Another type of high school education that was mentioned earlier was magnet schools. One example of a magnet school is Thomas Jefferson High School for Science and Technology, which is located in Alexandria, Virginia. There are exactly 1600 students at the school, with 400 in each grade level. In order to apply for admission into Thomas Jefferson, students must live in a region served by the high school. Along with this, students must meet three criteria: have an aptitude for successful study of science, math, computer science, and related technology, have a record of prior academic achievement, and pass an admissions examination.

        With such a top-level education, it is no surprise that the average SAT score at Thomas Jefferson is 1470, with a 690 average in verbal and 760 in math. This average surpasses the national average by a couple of hundred points. The SAT scores alone prove that magnet schools have the potential for being just as good as the best private schools. This is substantiated when one looks at the graduation rate of Thomas Jefferson. Of the 400 senior students, 392 graduate, with 99% of those students attending a four-year university. “

  4. Bruce William Smith 8 years ago8 years ago

    Even if reforms such as these are moving in the right direction, a badly inferior service is still being foisted on the state school pupils in California and other states. Tests like these, federally required by No Child Left Behind, have not closed the achievement gap between the United States and leading educational jurisdictions overseas; instead, the gap has widened. The Common Core won't close it, since its mathematics standards leave American students three years … Read More

    Even if reforms such as these are moving in the right direction, a badly inferior service is still being foisted on the state school pupils in California and other states. Tests like these, federally required by No Child Left Behind, have not closed the achievement gap between the United States and leading educational jurisdictions overseas; instead, the gap has widened. The Common Core won’t close it, since its mathematics standards leave American students three years behind the Chinese and two years behind everyone else in east Asia, even if they are learned properly and on schedule. Assessments based on inferior standards cannot themselves be particularly good; and this article, which incorrectly asserts that essays have not been used in California’s K-12 schools previously (the essay portion of the California High School Exit Exam has been in existence for many years), shows that the use of non-educators to rate the exams is a cheap practice Americans have not gotten away from, in contrast to European countries, where teachers, and only teachers, rate essay responses as a basic portion of their job duties.

  5. Donald LaPlante 8 years ago8 years ago

    The statement that this is the first time that raters have been used to score tests by K-12 students is simply incorrect. I served as a rater for some of the English/Language Arts essays about 20 years. We sat in a big room for a week double rating each essay. (That was how students could get an "odd" number score. One rater gave a "3" and another a "4," etc. … Read More

    The statement that this is the first time that raters have been used to score tests by K-12 students is simply incorrect. I served as a rater for some of the English/Language Arts essays about 20 years. We sat in a big room for a week double rating each essay. (That was how students could get an “odd” number score. One rater gave a “3” and another a “4,” etc. That was too expensive so they went to one reader and just doubled that person’s score to get the old 2 to 8 rating.)

    Computers can do a lot of things, but they aren’t likely to spots some nuances and the like. I’m sure an essay by Hemmingway probably would get a “1” from the computer.