“Robo-readers aren’t as good as human readers — they’re better,” the headline says. Hmmm. Annie Murphy Paul writes,

Instructors at the New Jersey Institute of Technology have been using a program called E-Rater in this fashion since 2009, and they’ve observed a striking change in student behavior as a result. Andrew Klobucar, associate professor of humanities at NJIT, notes that students almost universally resist going back over material they’ve written. But, Klobucar told Inside Higher Ed reporter Scott Jaschik, his students are willing to revise their essays, even multiple times, when their work is being reviewed by a computer and not by a human teacher. They end up writing nearly three times as many words in the course of revising as students who are not offered the services of E-Rater, and the quality of their writing improves as a result. Crucially, says Klobucar, students who feel that handing in successive drafts to an instructor wielding a red pen is “corrective, even punitive” do not seem to feel rebuked by similar feedback from a computer….

When critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment. But it’s the very non-humanness of a computer that may encourage students to experiment, to explore, to share a messy rough draft without self-consciousness or embarrassment. In return, they get feedback that is individualized, but not personal — not “punitive,” to use the term employed by Andrew Klobucar of NJIT.

There are some serious conceptual confusions and evaded questions here. The most obviously evaded question is this: When students are robo-graded, the quality of their writing improves by what measure?

Les Perelman’s objections are vital here. He has written,

Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language. The fallacy underlying this approach is confusing association with causation. A person makes the observation that many smart college professors wear tweed jackets and then believes that if she wears a tweed jacket, she will be a smart college professor.

Robo-graders rely on the same twisted logic. Papers written under time pressure often have a significant correlation between length and score. Robo-graders are able to match human scores simply by over-valuing length compared to human readers. A much publicized study claimed that machines could match human readers. However, the machines accomplished this feat primarily by simply counting words.

And there’s this:

ETS says its computer program tests “organization” in part by looking at the number of “discourse units” – defined as having a thesis idea, a main statement, supporting sentences and so forth. But Perelman said that the reward in this measure of organization is for the number of units, not their quality. He said that under this rubric, discourse units could be flopped in any order and would receive the same score – based on quantity.

Other parts of the formula, he noted, punish creativity. For instance, the computer judges “topical analysis” by favoring “similarity of the essay’s vocabulary to other previously scored essays in the top score category.” “In other words, it is looking for trite, common vocabulary,” Perelman said. “To use an SAT word, this is egregious.” Word complexity is judged, among other things, by average word length…. And the formula also explicitly rewards length of essay.

Perelman went on to show how Lincoln would have received a poor grade on the Gettysburg Address (except perhaps for starting with “four score,” since it was short and to the point).

Notice, not incidentally, that Perelman’s actual arguments belie Paul’s statement that “critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment.” It’s perfectly clear even from these excerpts that Perelman’s point is not that the robo-graders are non-human, but that they reward bad writing and punish good. And since the software only follows the algorithms that have been programmed into it, the problem actually begins with the programmers, who may not have any real understanding of what makes writing effective, or — and this seems to me more likely — can’t find algorithms that identify it.

I suspect, then, that with this automated grading we’re moving perilously close to a model that redefines good writing as “writing that our algorithms can recognize.” So why would any teachers ever adopt such software? That one has a simple answer: because the students are happier when they interact with the machines about their writing than when they have to respond to human teachers. If you read Paul’s whole essay, you’ll see that that’s all the system has to commend it: it pacifies the children, while the teachers just stand by and watch. The software really is teaching the children, and what it’s teaching them is to do what the software tells them to do. The achievement here is not improved writing, but improved obedience to algorithmic machines.

Welcome to the future of education.

Text Patterns

August 13, 2014


  1. We recently ran a lovely piece on this theme in The New Atlantis, under the title "Machine Grading and Moral Learning":

    "Responding to and evaluating students’ written work does more than just describe students, or distinguish them. Grading is also pedagogical: it corrects and informs, rewards and reinforces someone’s understanding of the world….

    "Grading should communicate not only what students have achieved but what they can. A professor can encourage intelligent but lazy students with a lower grade than their work strictly merits, and struggling but passionate students with one higher…. Good professors will challenge a gifted student to address an overlooked problem on a passable term paper purely for the joy of initiating him or her into the life of the mind. They will discourage the well-meaning student from following a line of thought whose path that they know to be littered with intellectual blind alleys and moral dead ends….

    "To grade as if the point were to identify and label mistakes is to grade as mechanics give estimates: this is what is broken and what it will cost to fix. To grade with charity is to treat students not as busted but as becoming. It is to take even their mistaken ideas seriously when they are sincerely offered, by responding with truth and with hope. It means treating grading as a means to continue a conversation older than any of us, and wisdom as both a goal and a common good."

  2. I'm working on a chapter of my book on Teaching Machines on "robo-graders" and "robo-readers" — this chapter in particular is tied to issues around labor, and I must say, I like the article linked above that talks about grading and “love” rather than drudgery.

    I feel like this whole article (not yours, Paul’s) is built on a series of deliberate elisions — done of course, as her work always frames itself, in the name of making education a better “science.”

    Take Shermis’s research about the superiority of robot readers over human graders, for example. This is the opening assertion of her story, and she situates Perelman’s work in contrast to Shermis’s “scientific” claims.

    But who are these graders who Shermis’s study analyzed? (Hint: not students’ teachers.) What were they reading? (Hint: long-form responses on standardized tests — not necessarily even “essays.”) What were the circumstances under which the students worked (wrote)? What were the circumstances under which the graders worked (graded)?

    I’m interested in Shermis’s claims about the inferiority of human graders because of what I know about the hiring practices and the grading practices — the working conditions – of the people hired by major testing companies to grade papers. These jobs, which are often advertised on Craigslist, require a college degree; turnover is very high. Human scorers are pushed to move quickly through students’ papers, forced to follow a rubric that is not of their design, and punished if their scores deviate too much from one another. (See: Making the Grades: My Misadventures in the Standardized Testing Industry by Todd Farley.) It’s not a surprise that a robot can be programmed to perform this task better.

    But my response isn’t: replace the humans with robots. It’s to ask why this sort of writing — for a narrow rubric – is what we want students to do.

    It’s to ask why “immediate feedback” is important for writing development. (Is it?) It’s to ask why the phrase “individualized feedback” — like “personalization” — is used to describe a process that is algorithmic, not personal.

    And although Paul insists that these robots can have distinct usages, I must ask too: why is the “work” of reading different than the “work” of grading? How? Can we really automate the former and maintain with a straight face that it’s a separate effort from automating the latter?

  3. from Paul's article "In this role, the computer functions not as a grader but as a proofreader and basic writing tutor, providing feedback on drafts, which students then use to revise their papers before handing them in to a human." This role is narrowly defined and completely ignored by Alan Jacobs – it's intellectually dishonest and Alan Jacob's article is at best misleading if the reader trusts him for an accurate accounting.

Comments are closed.