Watson, Can You Hear Me? (The Significance of the “Jeopardy” AI win)

Yesterday, on Jeopardy!, a computer handily beat its human competitors. Stephen Gordon asks, “Did the Singularity Just Happen on Jeopardy?” If so, then I think it’s time for me and my co-bloggers to pack up and go home, because the Singularity is damned underwhelming. This was one giant leap for robot publicity, but only a small step for robotkind.

Unlike Deep Blue, the IBM computer that in 1997 defeated the world chess champion Garry Kasparov, I saw no indication that the Jeopardy! victory constituted any remarkable innovation in artificial intelligence methods. IBM’s Watson computer is essentially search engine technology with some basic natural language processing (NLP) capability sprinkled on top. Most Jeopardy! clues contain definite, specific keywords associated with the correct response — such that you could probably Google those keywords, and the correct response would be contained somewhere in the first page of results. The game is already very amenable to what computers do well.
In fact, Stephen Wolfram shows that you can get a remarkable amount of the way to building a system like Watson just by putting Jeopardy! clues straight into Google:
Once you’ve got that, it only requires a little NLP to extract a list of candidate responses, some statistical training to weight those responses properly, and then a variety of purpose-built tricks to accommodate the various quirks of Jeopardy!-style categories and jokes. Watching Watson perform, it’s not too difficult to imagine the combination of algorithms used.
Compiling Watson’s Errors
On that large share of search-engine-amenable clues, Watson almost always did very well. What’s more interesting to note is the various types of clues on which Watson performed very poorly. Perhaps the best example was the Final Jeopardy clue from the first game (which was broadcast on the second of three nights). The category was “U.S. Cities,” and the clue was “Its largest airport is named for a World War II hero; its second largest, for a World War II battle.” Both of the human players correctly responded Chicago, but Watson incorrectly responded Toronto — and the audience audibly gasped when it did.
Watson performed poorly on this Final Jeopardy because there were no words in either the clue or the category that are strongly and specifically associated with Chicago — that is, you wouldn’t expect “Chicago” to come up if you were to stick something like this clue into Google (unless you included pages talking about this week’s tournament). But there was an even more glaring error here: anyone who knows enough about Toronto to know about its airports will know that it is not a U.S. city.
There were a variety of other instances like this of “dumb” behavior on Watson’s part. The partial list that follows gives a flavor of the kinds of mistakes the machine made, and can help us understand their causes.
  • With the category “Beatles People” and the clue “‘Bang bang’ his ‘silver hammer came down upon her head,’” Watson responded, “What is Maxwell’s silver hammer.” Surprisingly, Alex Trebek accepted this response as correct, even though the category and clue were clearly asking for the name of a person, not a thing.
  • With the category “Olympic Oddities” and the clue “It was the anatomical oddity of U.S. gymnast George Eyser, who won a gold medal on the parallel bars in 1904,” Watson responded, “What is leg.” The correct response was, “What is he was missing a leg.”
  • In the “Name the Decade” category, Watson at one point didn’t seem to know what the category was asking for. With the clue “Klaus Barbie is sentenced to life in prison & DNA is first used to convict a criminal,” none of its top three responses was a decade. (Correct response: “What is the 1980s?”)
  • Also in the category “Name the Decade,” there was the clue, “The first modern crossword puzzle is published & Oreo cookies are introduced.” Ken responded, “What are the twenties.” Trebek said no, and then Watson rang in and responded, “What is 1920s.” (Trebek came back with, “No, Ken said that.”)
  • With the category “Literary Character APB,” and the clue “His victims include Charity Burbage, Mad Eye Moody & Severus Snape; he’d be easier to catch if you’d just name him!” Watson didn’t ring in because his top option was Harry Potter, with only 37% confidence. His second option was Voldemort, with 20% confidence.
  • On one clue, Watson’s top option (which was correct) was “Steve Wynn.” Its second-ranked option was “Stephen A. Wynn” — the full name of the same person.
  • With the clue “In 2002, Eminem signed this rapper to a 7-figure deal, obviously worth a lot more than his name implies,” Watson’s top option was the correct one — 50 Cent — but its confidence was too low to ring in.
  • With the clue “The Schengen Agreement removes any controls at these between most EU neighbors,” Watson’s first choice was “passport” with 33% confidence. Its second choice was “Border” with 14%, which would have been correct. (Incidentally, it’s curious to note that one answer was capitalized and the other was not.)
  • In the category “Computer Keys” with the clue “A loose-fitting dress hanging from the shoulders to below the waist,” Watson incorrectly responded “Chemise.” (Ken then incorrectly responded “A,” thinking of an A-line skirt. The correct response was a “shift.”)
  • Also in “Computer Keys,” with the clue “Proverbially, it’s ‘where the heart is,’” Watson’s top option (though it did not ring in) was “Home Is Where the Heart Is.”
  • With the clue “It was 103 degrees in July 2010 & Con Ed’s command center in this N.Y. borough showed 12,963 megawatts consumed at 1 time,” Watson’s first choice (though it did have enough confidence to ring in) was “New York City.”
  • In the category “Nonfiction,” with the clue “The New Yorker’s 1959 review of this said in its brevity & clarity it is ‘unlike most such manuals, a book as well as a tool.’” Watson incorrectly responded “Dorothy Parker.” The correct response was “The Elements of Style.”
  • For the clue “One definition of this is entering a private place with the intent of listening secretly to private conversations,” Watson’s first choice was “eavesdropper,” with 79% confidence. Second was “eavesdropping,” with 49% confidence.
  • For the clue “In May 2010 5 paintings worth $125 million by Braque, Matisse & 3 others left Paris’ museum of this art period,” Watson responded, “Picasso.”
We can group these errors into a few broad, somewhat overlapping categories:
  • Failure to understand what type of thing the clue was pointing to, e.g. “Maxwell’s silver hammer” instead of “Maxwell”; “leg” instead of “he was missing a leg”; “eavesdropper” instead of “eavesdropping.”
  • Failure to understand what type of thing the category was pointing to, e.g.,“Home Is Where the Heart Is” for “Computer Keys”; “Toronto” for “U.S. cities.”
  • Basic errors in worldly logic, e.g. repeating Ken’s wrong response; considering “Steve Wynn” and “Stephen A. Wynn” to be different responses.
  • Inability to understand jokes or puns in clues, e.g. 50 Cent being “worth” “more than his name implies”; “he’d be easier to catch if you’d just name him!” about Voldemort.
  • Inability to respond to clues lacking keywords specifically associated with the correct respone, e.g. the Voldemort clue; “Dorothy Parker” instead of “The Elements of Style.”
  • Inability to correctly respond to complicated clues that involve inference and combining facts in subsequent stages, rather than combining independent associated clues; e.g. the Chicago airport clue.
What these errors add up to is that Watson really cannot process natural language in a very sophisticated way — if it did, it would not suffer from the category errors that marked so many of its wrong responses. Nor does it have much ability to perform the inference required to integrate several discrete pieces of knowledge, as required for understanding puns, jokes, wordplay, and allusions. On clues involving these skills and lacking search-engine-friendly keywords, Watson stumbled. And when it stumbled, it often seemed not just ignorant, but completely thoughtless.
I expect you could create an unbeatable Jeopardy! champion by allowing a human player to look at Watson’s weighted list of possible responses, even without the weights being nearly as accurate as Watson has them. While Watson assigns percentage-based confidence levels, any moderately educated human will be immediately be able to discriminate potential responses into the three relatively discrete categories “makes no sense,” “yes, that sounds right,” and “don’t know, but maybe.” Watson hasn’t come close to touching this.
The Significance of Watson’s Jeopardy! Win
In short, Watson is not anywhere close to possessing true understanding of its knowledge — neither conscious understanding of the sort humans experience, nor unconscious, rule-based syntactic and semantic understanding sufficient to imitate the conscious variety. (Stephen Wolfram’s post accessibly explains his effort to achieve the latter.) Watson does not bring us any closer, in other words, to building a Mr. Data, even if such a thing is possible. Nor does it put us much closer to an Enterprise ship’s computer, as many have suggested.
In the meantime, of course, there were some singularly human characteristics on display in the Jeopardy! tournament, and evident only in the human participants. Particularly notable was the affability, charm, and grace of Ken Jennings and Brad Rutter. But the best part was the touches of genuine, often self-deprecating humor by the two contestants as they tried their best against the computer. This culminated in Ken Jennings’s joke on his last Final Jeopardy response:
Nicely done, sir. The closing credits, which usually show the contestants chatting with Trebek onstage, instead showed Jennings and Rutter attempting to “high-five” Watson and show it other gestures of goodwill:
I’m not saying it couldn’t ever be done by a computer, but it seems like joking around will have to be just about the last thing A.I. will achieve. There’s a reason Mr. Data couldn’t crack jokes. Because, well, humor — it is a difficult concept. It is not logical. All the more reason, though, why I can’t wait for Saturday Night Live’s inevitable “Celebrity Jeopardy” segment where Watson joins in with Sean Connery to torment Alex Trebek.

Progress in Robotics and AI: The Coming Demise of “Jeopardy”

With some irony, I expect, Gizmodo gave the following headline to a story this week about a rudimentary sprinting robot: “Someday, this robot will run faster than us all.” This week also brings the news that in a couple of months we will have a chance to see if IBM has made a champion artificially-intelligent Jeopardy player. I for one do not doubt that eventually, robots — maybe even the same robot — will be able to run faster than us all and win at Jeopardy and cook my dinner or at least provide me with a recipe that will use all the stray leftovers in my refrigerator. And then what will AI and robotics researchers do?

A hint to answering this question can be found by going to the IBM Research home page and putting in the search term “Deep Blue,” the name of the company’s chess-playing computer that famously beat World Chess Champion Garry Kasparov. The first results take you to what seem to be orphaned Web pages from 1997. Eventually you reach a page that acknowledges that the team has moved on to other projects. So too with the MIT Media Lab Personal Robotics Group which abounds in aspirational descriptions and videos, but seems short on actual results that conform to those aspirations. Has the teddy-bear robot called “Huggable” in fact been turned, as its makers expected, into a communication avatar, an early education companion, or a therapeutic companion? One would be hard-pressed to know.

My guess is that graduate students graduate and funding opportunities change. And some questions get answered, or perhaps not; in either case researchers move on, maybe building on what they have done, maybe moving in a new direction entirely. Doubtless, as in any other kind of research, there are times when the results have a nearly immediate impact in the wider world, or eventually get filtered into products and processes that we come to take for granted. But in these academic fields, as in all others, it looks to me like a good deal of what gets done amounts to lines, sometimes very expensive lines, on a C.V.

For those of us who observe this world from the outside, knowing it works this way provides two cautionary lessons. First, there is not necessarily a great idea or accomplishment behind every great-sounding press release or polished website. No surprise there, I hope. Second, it usually takes some time to judge the full impact of the new knowledge and abilities that we gain in these kinds of research programs. If IBM’s “Watson” program wins its Jeopardy match, we will doubtless be treated to a good deal of speculation about what it means — I might be tempted to engage in some myself. But the best response will still probably be that we can only wait and see. That’s good, because time is a useful thing for us slow-thinking humans. But it is also problematic, as the frog in the slowly warming pan of water eventually finds out.

[Photo via MGM Television via Curt Alliaume.]