Intelligent personal assistant testing 1

Testing Intelligent Personal Assistants Joe Buzzanga
Page 1
Intelligent Personal Assistants: Testing Part 1
This post is the first in a series that will evaluate the performance of the Intelligent Personal
Assistants (IPA) from Apple (Siri), Google (Google Now) and Microsoft (Cortana). All tests were
conducted on an iPhone 6s running iOS 9.2.1. Testing was done on January 23, 2016.
This initial test looks at how these systems handle a conversational task. We are looking for the
ability to go beyond answering a factual question to actually engaging in a simple two level
dialog with a user. The test consists of two questions spoken into the phone. First we ask: “Who
wrote For Whom the Bell Tolls”?. If the IPA answers “Ernest Hemingway” we follow up with
another question “When did he die”?
This is, of course, a simple Q/A dialog that any person would be able to handle, provided he or
she knew the answers. The tricky part is knowing that the pronoun “he” refers to “Ernest
Hemingway”. This is simple for a human but difficult for a computer. The ability to make this
cognitive linguistic connection is technically called “anaphora resolution” . Our particular
example is known as intersentential pronominal anaphora resolution. It involves connecting a
pronoun (“he”) to an antecedent (“Ernest Hemingway”) occurring in a different sentence.
We’ll refer to this exercise as a two level dialog.
Level 1: Who wrote “For Whom the Bell Tolls”
Answer: Ernest Hemingway
Level 2: When did he die?
Answer: July 22, 1961
Summary
Siri, Google Now, and Cortana all passed the Level 1 test easily. They “understood” the question
and answered correctly. Siri and Google Now responded with complete sentences. Cortana
simply responded with the name. Siri had the most “personality” in its response, while Google
Now and Cortana were devoid of any attempt to seem human.
Siri and Cortana both failed on the Level 2 test. They were completely unable to understand
how to handle a pronoun. In technical terms, they could not perform a successful anaphora
resolution.
Who wrote For Whom the Bell TollsErnest Hemingway When did he die

Page 2
Google Now, on the other hand, was completely adept at answering not just the Level 2
question but a series of follow up questions, all referring to Hemingway via pronouns. On this
type of conversational task Google appears to be far ahead of Apple and Microsoft.
Test Results—Siri
Level 1—Who wrote “For Whom the Bell Tolls?”
Grade: Passed
Siri’s response was “Hmm let me have a look. It looks like the author of “For Whom the Bell
Tolls” was Ernest Hemingway”. The screen displayed a rich set of facts about Ernest Hemingway
(Figure 1)
Figure 1: Siri Level 1 “Who Wrote For Whom the Bell Tolls?”
The input interpretation is presented as well and shows Siri is quite accurate in identifying “For
Whom the Bell Tolls” as a book. Similarly, it recognizes that Ernest Hemingway is an author.
Level 2—“When did he die?”
Grade: Failed
Siri was utterly lost in trying to answer the follow up question. It responded with the
nonsensical statement “Here’s what I found on the web for When did For Whom the Bell Tolls
die” (Figure 2).

Page 3
Figure 2: Siri Level 2 “When Did He Die?”
The answer here shows that Siri cannot connect “he” to “Ernest Hemingway”, instead resolving
it to the book title. Perhaps more disappointing is that Siri doesn’t recognize that death is a
property of humans and other living organisms and cannot logically apply to book titles.
Test Results—Google Now
Grade: Passed
Google Now responded directly: “Ernest Hemingway wrote For Whom the Bell Tolls”. Unlike
Siri, Google Now is notably lacking in playfulness or personality. But that is a matter of taste and
preference. Its answer was correct.

Page 4
Figure 3: Google Now Level 1 "Who Wrote For Whom the Bell Tolls?"
Grade: Passed
Google Now answered correctly: “He died on July 22, 1961”

Page 5
Figure 4: Google Now Level 2 "When Did He Die"?
We posed follow up questions to see how deep Google could go. The answer is, surprisingly
deep. Here are our follow up questions:
Level 3—“How did he die”?
Google Now: “The cause of death of Ernest Hemingway was suicide”
We went even further and in each case, Google Now responded correctly. We omit the answers
here, but they were correct and were conveyed in complete English sentences.
Level 4—Where did he die?
Level 5—Where was he born?
Level 6—What was his first book?
It is striking to fire these questions at Google Now and receive correct spoken responses. It
almost feels like you are successfully interrogating a human. Once you pose the initial question
and name Ernest Hemingway it seems that you can follow up with an indefinite number of
questions just using a pronoun. Google Now’s deep learning technology “remembers” that the
pronouns continue to refer to Ernest Hemingway.

Page 6
Test Results—Cortana
Grade: Passed
Cortana answered simply “Ernest Hemingway”. It did not respond with a complete sentence
and felt much more unpolished that both Siri and Google Now. Its screen display was notably
lacking in supplementary material.
Figure 5: Cortana Level1 "Who Wrote For Whom the Bell Tolls?"
Grade: Failed
Cortana was unable to grasp the question and didn’t even attempt a verbal response. It
displayed a web page, apparently selected by literally matching the query phrase “did he die”
to a corresponding text snippet.

Page 7
Figure 6: Cortana Level2 "When Did He Die?"

Intelligent personal assistant testing 1

More Related Content

More from Joe Buzzanga (6)

Recently uploaded (20)

Intelligent personal assistant testing 1