Testing Intelligent Personal Assistants Joe Buzzanga
Page 1
Intelligent Personal Assistants: Testing Part 1
This post is the first in a series that will evaluate the performance of the Intelligent Personal
Assistants (IPA) from Apple (Siri), Google (Google Now) and Microsoft (Cortana). All tests were
conducted on an iPhone 6s running iOS 9.2.1. Testing was done on January 23, 2016.
This initial test looks at how these systems handle a conversational task. We are looking for the
ability to go beyond answering a factual question to actually engaging in a simple two level
dialog with a user. The test consists of two questions spoken into the phone. First we ask: “Who
wrote For Whom the Bell Tolls”?. If the IPA answers “Ernest Hemingway” we follow up with
another question “When did he die”?
This is, of course, a simple Q/A dialog that any person would be able to handle, provided he or
she knew the answers. The tricky part is knowing that the pronoun “he” refers to “Ernest
Hemingway”. This is simple for a human but difficult for a computer. The ability to make this
cognitive linguistic connection is technically called “anaphora resolution” . Our particular
example is known as intersentential pronominal anaphora resolution. It involves connecting a
pronoun (“he”) to an antecedent (“Ernest Hemingway”) occurring in a different sentence.
We’ll refer to this exercise as a two level dialog.
Level 1: Who wrote “For Whom the Bell Tolls”
Answer: Ernest Hemingway
Level 2: When did he die?
Answer: July 22, 1961
Summary
Siri, Google Now, and Cortana all passed the Level 1 test easily. They “understood” the question
and answered correctly. Siri and Google Now responded with complete sentences. Cortana
simply responded with the name. Siri had the most “personality” in its response, while Google
Now and Cortana were devoid of any attempt to seem human.
Siri and Cortana both failed on the Level 2 test. They were completely unable to understand
how to handle a pronoun. In technical terms, they could not perform a successful anaphora
resolution.
Who wrote For Whom the Bell TollsErnest Hemingway When did he die
Testing Intelligent Personal Assistants Joe Buzzanga
Page 2
Google Now, on the other hand, was completely adept at answering not just the Level 2
question but a series of follow up questions, all referring to Hemingway via pronouns. On this
type of conversational task Google appears to be far ahead of Apple and Microsoft.
Test Results—Siri
Level 1—Who wrote “For Whom the Bell Tolls?”
Grade: Passed
Siri’s response was “Hmm let me have a look. It looks like the author of “For Whom the Bell
Tolls” was Ernest Hemingway”. The screen displayed a rich set of facts about Ernest Hemingway
(Figure 1)
Figure 1: Siri Level 1 “Who Wrote For Whom the Bell Tolls?”
The input interpretation is presented as well and shows Siri is quite accurate in identifying “For
Whom the Bell Tolls” as a book. Similarly, it recognizes that Ernest Hemingway is an author.
Level 2—“When did he die?”
Grade: Failed
Siri was utterly lost in trying to answer the follow up question. It responded with the
nonsensical statement “Here’s what I found on the web for When did For Whom the Bell Tolls
die” (Figure 2).
Testing Intelligent Personal Assistants Joe Buzzanga
Page 3
Figure 2: Siri Level 2 “When Did He Die?”
The answer here shows that Siri cannot connect “he” to “Ernest Hemingway”, instead resolving
it to the book title. Perhaps more disappointing is that Siri doesn’t recognize that death is a
property of humans and other living organisms and cannot logically apply to book titles.
Test Results—Google Now
Level 1—Who wrote “For Whom the Bell Tolls?”
Grade: Passed
Google Now responded directly: “Ernest Hemingway wrote For Whom the Bell Tolls”. Unlike
Siri, Google Now is notably lacking in playfulness or personality. But that is a matter of taste and
preference. Its answer was correct.
Testing Intelligent Personal Assistants Joe Buzzanga
Page 4
Figure 3: Google Now Level 1 "Who Wrote For Whom the Bell Tolls?"
Level 2—“When did he die?”
Grade: Passed
Google Now answered correctly: “He died on July 22, 1961”
Testing Intelligent Personal Assistants Joe Buzzanga
Page 5
Figure 4: Google Now Level 2 "When Did He Die"?
We posed follow up questions to see how deep Google could go. The answer is, surprisingly
deep. Here are our follow up questions:
Level 3—“How did he die”?
Google Now: “The cause of death of Ernest Hemingway was suicide”
We went even further and in each case, Google Now responded correctly. We omit the answers
here, but they were correct and were conveyed in complete English sentences.
Level 4—Where did he die?
Level 5—Where was he born?
Level 6—What was his first book?
It is striking to fire these questions at Google Now and receive correct spoken responses. It
almost feels like you are successfully interrogating a human. Once you pose the initial question
and name Ernest Hemingway it seems that you can follow up with an indefinite number of
questions just using a pronoun. Google Now’s deep learning technology “remembers” that the
pronouns continue to refer to Ernest Hemingway.
Testing Intelligent Personal Assistants Joe Buzzanga
Page 6
Test Results—Cortana
Level 1—Who wrote “For Whom the Bell Tolls?”
Grade: Passed
Cortana answered simply “Ernest Hemingway”. It did not respond with a complete sentence
and felt much more unpolished that both Siri and Google Now. Its screen display was notably
lacking in supplementary material.
Figure 5: Cortana Level1 "Who Wrote For Whom the Bell Tolls?"
Level 2—“When did he die?”
Grade: Failed
Cortana was unable to grasp the question and didn’t even attempt a verbal response. It
displayed a web page, apparently selected by literally matching the query phrase “did he die”
to a corresponding text snippet.
Testing Intelligent Personal Assistants Joe Buzzanga
Page 7
Figure 6: Cortana Level2 "When Did He Die?"

More Related Content

PPT
Dot and jabber
PPTX
Adjectives lesson
DOC
Question stems fairy tales
PPTX
The tell tale heart
DOCX
Google decode q3 2015 toc
PPTX
Jan Šedivý - Intelligent Personal Assistants
PDF
Google Decode q2 2014 toc
PPT
SLA Summer 2008
Dot and jabber
Adjectives lesson
Question stems fairy tales
The tell tale heart
Google decode q3 2015 toc
Jan Šedivý - Intelligent Personal Assistants
Google Decode q2 2014 toc
SLA Summer 2008

More from Joe Buzzanga (6)

PDF
U.S. Consumer Search Preferences Q1 2017
PDF
Google decode q3 toc
PPTX
Is Google Evil 3.0
PDF
Technology Intelligence for R&D
PDF
Building Network Elements Using Intel Network Processors and ATCA
PPT
London Online 2008
U.S. Consumer Search Preferences Q1 2017
Google decode q3 toc
Is Google Evil 3.0
Technology Intelligence for R&D
Building Network Elements Using Intel Network Processors and ATCA
London Online 2008
Ad

Recently uploaded (20)

PPTX
IPCNA VIRTUAL CLASSES INTERMEDIATE 6 PROJECT.pptx
PDF
simpleintnettestmetiaerl for the simple testint
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PDF
Buy Cash App Verified Accounts Instantly – Secure Crypto Deal.pdf
PPTX
The-Importance-of-School-Sanitation.pptx
PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PDF
si manuel quezon at mga nagawa sa bansang pilipinas
PDF
Exploring The Internet Of Things(IOT).ppt
PPTX
t_and_OpenAI_Combined_two_pressentations
PPTX
newyork.pptxirantrafgshenepalchinachinane
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PDF
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
PPTX
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PPTX
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
PDF
Alethe Consulting Corporate Profile and Solution Aproach
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PDF
BIOCHEM CH2 OVERVIEW OF MICROBIOLOGY.pdf
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PDF
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
IPCNA VIRTUAL CLASSES INTERMEDIATE 6 PROJECT.pptx
simpleintnettestmetiaerl for the simple testint
Exploring VPS Hosting Trends for SMBs in 2025
Buy Cash App Verified Accounts Instantly – Secure Crypto Deal.pdf
The-Importance-of-School-Sanitation.pptx
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
si manuel quezon at mga nagawa sa bansang pilipinas
Exploring The Internet Of Things(IOT).ppt
t_and_OpenAI_Combined_two_pressentations
newyork.pptxirantrafgshenepalchinachinane
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Top Website Bugs That Hurt User Experience – And How Expert Web Design Fixes
Alethe Consulting Corporate Profile and Solution Aproach
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
BIOCHEM CH2 OVERVIEW OF MICROBIOLOGY.pdf
Mathew Digital SEO Checklist Guidlines 2025
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
Ad

Intelligent personal assistant testing 1

  • 1. Testing Intelligent Personal Assistants Joe Buzzanga Page 1 Intelligent Personal Assistants: Testing Part 1 This post is the first in a series that will evaluate the performance of the Intelligent Personal Assistants (IPA) from Apple (Siri), Google (Google Now) and Microsoft (Cortana). All tests were conducted on an iPhone 6s running iOS 9.2.1. Testing was done on January 23, 2016. This initial test looks at how these systems handle a conversational task. We are looking for the ability to go beyond answering a factual question to actually engaging in a simple two level dialog with a user. The test consists of two questions spoken into the phone. First we ask: “Who wrote For Whom the Bell Tolls”?. If the IPA answers “Ernest Hemingway” we follow up with another question “When did he die”? This is, of course, a simple Q/A dialog that any person would be able to handle, provided he or she knew the answers. The tricky part is knowing that the pronoun “he” refers to “Ernest Hemingway”. This is simple for a human but difficult for a computer. The ability to make this cognitive linguistic connection is technically called “anaphora resolution” . Our particular example is known as intersentential pronominal anaphora resolution. It involves connecting a pronoun (“he”) to an antecedent (“Ernest Hemingway”) occurring in a different sentence. We’ll refer to this exercise as a two level dialog. Level 1: Who wrote “For Whom the Bell Tolls” Answer: Ernest Hemingway Level 2: When did he die? Answer: July 22, 1961 Summary Siri, Google Now, and Cortana all passed the Level 1 test easily. They “understood” the question and answered correctly. Siri and Google Now responded with complete sentences. Cortana simply responded with the name. Siri had the most “personality” in its response, while Google Now and Cortana were devoid of any attempt to seem human. Siri and Cortana both failed on the Level 2 test. They were completely unable to understand how to handle a pronoun. In technical terms, they could not perform a successful anaphora resolution. Who wrote For Whom the Bell TollsErnest Hemingway When did he die
  • 2. Testing Intelligent Personal Assistants Joe Buzzanga Page 2 Google Now, on the other hand, was completely adept at answering not just the Level 2 question but a series of follow up questions, all referring to Hemingway via pronouns. On this type of conversational task Google appears to be far ahead of Apple and Microsoft. Test Results—Siri Level 1—Who wrote “For Whom the Bell Tolls?” Grade: Passed Siri’s response was “Hmm let me have a look. It looks like the author of “For Whom the Bell Tolls” was Ernest Hemingway”. The screen displayed a rich set of facts about Ernest Hemingway (Figure 1) Figure 1: Siri Level 1 “Who Wrote For Whom the Bell Tolls?” The input interpretation is presented as well and shows Siri is quite accurate in identifying “For Whom the Bell Tolls” as a book. Similarly, it recognizes that Ernest Hemingway is an author. Level 2—“When did he die?” Grade: Failed Siri was utterly lost in trying to answer the follow up question. It responded with the nonsensical statement “Here’s what I found on the web for When did For Whom the Bell Tolls die” (Figure 2).
  • 3. Testing Intelligent Personal Assistants Joe Buzzanga Page 3 Figure 2: Siri Level 2 “When Did He Die?” The answer here shows that Siri cannot connect “he” to “Ernest Hemingway”, instead resolving it to the book title. Perhaps more disappointing is that Siri doesn’t recognize that death is a property of humans and other living organisms and cannot logically apply to book titles. Test Results—Google Now Level 1—Who wrote “For Whom the Bell Tolls?” Grade: Passed Google Now responded directly: “Ernest Hemingway wrote For Whom the Bell Tolls”. Unlike Siri, Google Now is notably lacking in playfulness or personality. But that is a matter of taste and preference. Its answer was correct.
  • 4. Testing Intelligent Personal Assistants Joe Buzzanga Page 4 Figure 3: Google Now Level 1 "Who Wrote For Whom the Bell Tolls?" Level 2—“When did he die?” Grade: Passed Google Now answered correctly: “He died on July 22, 1961”
  • 5. Testing Intelligent Personal Assistants Joe Buzzanga Page 5 Figure 4: Google Now Level 2 "When Did He Die"? We posed follow up questions to see how deep Google could go. The answer is, surprisingly deep. Here are our follow up questions: Level 3—“How did he die”? Google Now: “The cause of death of Ernest Hemingway was suicide” We went even further and in each case, Google Now responded correctly. We omit the answers here, but they were correct and were conveyed in complete English sentences. Level 4—Where did he die? Level 5—Where was he born? Level 6—What was his first book? It is striking to fire these questions at Google Now and receive correct spoken responses. It almost feels like you are successfully interrogating a human. Once you pose the initial question and name Ernest Hemingway it seems that you can follow up with an indefinite number of questions just using a pronoun. Google Now’s deep learning technology “remembers” that the pronouns continue to refer to Ernest Hemingway.
  • 6. Testing Intelligent Personal Assistants Joe Buzzanga Page 6 Test Results—Cortana Level 1—Who wrote “For Whom the Bell Tolls?” Grade: Passed Cortana answered simply “Ernest Hemingway”. It did not respond with a complete sentence and felt much more unpolished that both Siri and Google Now. Its screen display was notably lacking in supplementary material. Figure 5: Cortana Level1 "Who Wrote For Whom the Bell Tolls?" Level 2—“When did he die?” Grade: Failed Cortana was unable to grasp the question and didn’t even attempt a verbal response. It displayed a web page, apparently selected by literally matching the query phrase “did he die” to a corresponding text snippet.
  • 7. Testing Intelligent Personal Assistants Joe Buzzanga Page 7 Figure 6: Cortana Level2 "When Did He Die?"