SlideShare a Scribd company logo
Natural	
  Language	
  Processing	
  
Taylor	
  Berg-­‐Kirkpatrick	
  –	
  CMU	
  
Slides:	
  Dan	
  Klein	
  –	
  UC	
  Berkeley	
  
	
  
Language	
  Technologies
	
  
Goal:	
  Deep	
  Understanding	
  
§ Requires	
  context,	
  linguisEc	
  
structure,	
  meanings…	
  
Reality:	
  Shallow	
  Matching	
  
§ Requires	
  robustness	
  and	
  scale	
  
§ Amazing	
  successes,	
  but	
  
fundamental	
  limitaEons	
  	
  
Speech	
  Systems	
  
§ AutomaEc	
  Speech	
  RecogniEon	
  (ASR)	
  
§ Audio	
  in,	
  text	
  out	
  
§ SOTA:	
  0.3%	
  error	
  for	
  digit	
  strings,	
  5%	
  dictaEon,	
  50%+	
  TV	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
§ Text	
  to	
  Speech	
  (TTS)	
  
§ Text	
  in,	
  audio	
  out	
  
§ SOTA:	
  totally	
  intelligible	
  (if	
  someEmes	
  unnatural)	
  
	
  
“Speech Lab”
Example:	
  Siri
	
  
§ Siri	
  contains	
  
§ Speech	
  recogniEon	
  
§ Language	
  analysis	
  
§ Dialog	
  processing	
  
§ Text	
  to	
  speech	
  
Image:	
  Wikipedia	
  
Text	
  Data	
  is	
  Superficial
	
  
An iceberg is a large piece of freshwater ice that has
broken off from a snow-formed glacier or ice shelf and
is floating in open water.
…	
  But	
  Language	
  is	
  Complex
	
  
§ SemanEc	
  structures	
  
§ References	
  and	
  enEEes	
  
§ Discourse-­‐level	
  connecEves	
  
§ Meanings	
  and	
  implicatures	
  
§ Contextual	
  factors	
  
§ Perceptual	
  grounding	
  	
  
§ …	
  	
  
An iceberg is a large piece of
freshwater ice that has broken off
from a snow-formed glacier or ice
shelf and is floating in open water.
SyntacEc	
  Analysis	
  
§ SOTA:	
  ~90%	
  accurate	
  for	
  many	
  languages	
  when	
  given	
  many	
  training	
  
examples,	
  some	
  progress	
  in	
  analyzing	
  languages	
  given	
  few	
  or	
  no	
  examples	
  
Hurricane Emily howled toward Mexico 's Caribbean coast on Sunday
packing 135 mph winds and torrential rain and causing panic in Cancun ,
where frightened tourists squeezed into musty shelters .
Corpora	
  
§ A	
  corpus	
  is	
  a	
  collecEon	
  of	
  text	
  
§ O^en	
  annotated	
  in	
  some	
  way	
  
§ SomeEmes	
  just	
  lots	
  of	
  text	
  
§ Balanced	
  vs.	
  uniform	
  corpora	
  
§ Examples	
  
§ Newswire	
  collecEons:	
  500M+	
  words	
  
§ Brown	
  corpus:	
  1M	
  words	
  of	
  tagged	
  
“balanced”	
  text	
  
§ Penn	
  Treebank:	
  1M	
  words	
  of	
  parsed	
  
WSJ	
  
§ Canadian	
  Hansards:	
  10M+	
  words	
  of	
  
aligned	
  French	
  /	
  English	
  sentences	
  
§ The	
  Web:	
  billions	
  of	
  words	
  of	
  who	
  
knows	
  what	
  
Corpus-­‐Based	
  Methods	
  
§ A	
  corpus	
  like	
  a	
  treebank	
  gives	
  us	
  three	
  important	
  tools:	
  
§ It	
  gives	
  us	
  broad	
  coverage	
  
ROOT → S
S → NP VP .
NP → PRP
VP → VBD ADJ
Corpus-­‐Based	
  Methods	
  
§ It	
  gives	
  us	
  staEsEcal	
  informaEon	
  
11%
9%
6%
NP PP DT NN PRP
9% 9%
21%
NP PP DT NN PRP
7%
4%
23%
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Corpus-­‐Based	
  Methods	
  
§ It	
  lets	
  us	
  check	
  our	
  answers	
  
SemanEc	
  Ambiguity	
  
§ NLP	
  is	
  much	
  more	
  than	
  syntax!	
  
§ Even	
  correct	
  tree	
  structured	
  syntacEc	
  analyses	
  don’t	
  fully	
  nail	
  
down	
  the	
  meaning	
  
§ In	
  general,	
  every	
  level	
  of	
  linguisEc	
  structure	
  comes	
  with	
  its	
  
own	
  ambiguiEes…	
  
I haven’t slept for ten days
John’s boss said he was doing better
Other	
  Levels	
  of	
  Language	
  
§ TokenizaEon/morphology:	
  
§ What	
  are	
  the	
  words,	
  what	
  is	
  the	
  sub-­‐word	
  structure?	
  
§ O^en	
  simple	
  rules	
  work	
  (period	
  a^er	
  “Mr.”	
  isn’t	
  sentence	
  break)	
  
§ RelaEvely	
  easy	
  in	
  English,	
  other	
  languages	
  are	
  harder:	
  
§ SegementaEon	
  
§ Morphology	
  
§ Discourse:	
  how	
  do	
  sentences	
  relate	
  to	
  each	
  other?	
  
§ PragmaEcs:	
  what	
  intent	
  is	
  expressed	
  by	
  the	
  literal	
  meaning,	
  how	
  to	
  react	
  
to	
  an	
  ujerance?	
  
§ PhoneEcs:	
  acousEcs	
  and	
  physical	
  producEon	
  of	
  sounds	
  
§ Phonology:	
  how	
  sounds	
  pajern	
  in	
  a	
  language	
  
sarà andata
be+fut+3sg go+ppt+fem
“she will have gone”
QuesEon	
  Answering	
  
§ QuesEon	
  Answering:	
  
§ More	
  than	
  search	
  
§ Ask	
  general	
  
comprehension	
  quesEons	
  
of	
  a	
  document	
  collecEon	
  
§ Can	
  be	
  really	
  easy:	
  “What’s	
  
the	
  capital	
  of	
  Wyoming?”	
  
§ Can	
  be	
  harder:	
  “How	
  many	
  
US	
  states’	
  capitals	
  are	
  also	
  
their	
  largest	
  ciEes?”	
  
§ Can	
  be	
  open	
  ended:	
  “What	
  
are	
  the	
  main	
  issues	
  in	
  the	
  
global	
  warming	
  debate?”	
  
	
  
§ SOTA:	
  Can	
  do	
  factoids,	
  
even	
  when	
  text	
  isn’t	
  a	
  
perfect	
  match	
  
Example:	
  Watson
	
  
SummarizaEon	
  
§ Condensing	
  
documents	
  
§ An	
  example	
  of	
  
analysis	
  with	
  
generaEon	
  
ExtracEve	
  Summaries
	
  
Lindsay Lohan pleaded not guilty Wednesday to felony grand theft of a
$2,500 necklace, a case that could return the troubled starlet to jail rather
than the big screen. Saying it appeared that Lohan had violated her
probation in a 2007 drunken driving case, the judge set bail at $40,000 and
warned that if Lohan was accused of breaking the law while free he would
have her held without bail. The Mean Girls star is due back in court on Feb.
23, an important hearing in which Lohan could opt to end the case early.
Machine	
  TranslaEon
	
  
§ Translate	
  text	
  from	
  one	
  language	
  to	
  another	
  
§ Recombines	
  fragments	
  of	
  example	
  translaEons	
  
§ Challenges:	
  
§ What	
  fragments?	
  	
  [learning	
  to	
  translate]	
  
§ How	
  to	
  make	
  efficient?	
  	
  [fast	
  translaEon	
  search]	
  
§ Fluency	
  (next	
  class)	
  vs	
  fidelity	
  (later)	
  
NLP_guest_lecture.pdf
More	
  Data:	
  Machine	
  TranslaEon
	
  
Cela constituerait une solution transitoire qui permettrait de
conduire à terme à une charte à valeur contraignante.
That would be an interim solution which would make it possible to
work towards a binding charter in the long term .
[this] [constituerait] [assistance] [transitoire] [who] [permettrait]
[licences] [to] [terme] [to] [a] [charter] [to] [value] [contraignante] [.]
[it] [would] [a solution] [transitional] [which] [would] [of] [lead]
[to] [term] [to a] [charter] [to] [value] [binding] [.]
[this] [would be] [a transitional solution] [which would] [lead to] [a
charter] [legally binding] [.]
[that would be] [a transitional solution] [which would] [eventually
lead to] [a binding charter] [.]
SOURCE
HUMAN
1x DATA
10x DATA
100x DATA
1000x DATA
NLP_guest_lecture.pdf
Data	
  and	
  Knowledge
	
  
§ Classic	
  knowledge	
  representaEon	
  worry:	
  How	
  will	
  a	
  
machine	
  ever	
  know	
  that…	
  
§ Ice	
  is	
  frozen	
  water?	
  
§ Beige	
  looks	
  like	
  this:	
  
§ Chairs	
  are	
  solid?	
  
§ Answers:	
  
§ 1980:	
  write	
  it	
  all	
  down	
  
§ 2000:	
  get	
  by	
  without	
  it	
  
§ 2020:	
  learn	
  it	
  from	
  data	
  
Deeper	
  Understanding:	
  Reference
	
  
Names	
  vs.	
  EnEEes
	
  
Example	
  Errors
	
  
Discovering	
  Knowledge
	
  
Grounded	
  Language
	
  
Grounding	
  with	
  Natural	
  Data
	
  
… on the beige loveseat.
What	
  is	
  Nearby	
  NLP?	
  
§ ComputaEonal	
  LinguisEcs	
  
§ Using	
  computaEonal	
  methods	
  to	
  learn	
  more	
  
about	
  how	
  language	
  works	
  
§ We	
  end	
  up	
  doing	
  this	
  and	
  using	
  it	
  
§ CogniEve	
  Science	
  
§ Figuring	
  out	
  how	
  the	
  human	
  brain	
  works	
  
§ Includes	
  the	
  bits	
  that	
  do	
  language	
  
§ Humans:	
  the	
  only	
  working	
  NLP	
  prototype!	
  
§ Speech	
  Processing	
  
§ Mapping	
  audio	
  signals	
  to	
  text	
  
§ TradiEonally	
  separate	
  from	
  NLP,	
  converging?	
  
§ Two	
  components:	
  acousEc	
  models	
  and	
  language	
  
models	
  
§ Language	
  models	
  in	
  the	
  domain	
  of	
  stat	
  NLP	
  
Example:	
  NLP	
  Meets	
  CL	
  
§ Example:	
  Language	
  change,	
  reconstrucEng	
  ancient	
  forms,	
  phylogenies	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  …	
  just	
  one	
  example	
  of	
  the	
  kinds	
  of	
  linguisEc	
  models	
  we	
  can	
  build	
  
What	
  is	
  NLP	
  research?	
  
§ Three	
  aspects	
  we	
  o^en	
  invesEgate:	
  
§ LinguisEc	
  Issues	
  
§ What	
  are	
  the	
  range	
  of	
  language	
  phenomena?	
  
§ What	
  are	
  the	
  knowledge	
  sources	
  that	
  let	
  us	
  disambiguate?	
  
§ What	
  representaEons	
  are	
  appropriate?	
  
§ How	
  do	
  you	
  know	
  what	
  to	
  model	
  and	
  what	
  not	
  to	
  model?	
  
§ StaEsEcal	
  Modeling	
  Methods	
  
§ Increasingly	
  complex	
  model	
  structures	
  
§ Learning	
  and	
  parameter	
  esEmaEon	
  
§ Efficient	
  inference:	
  dynamic	
  programming,	
  search,	
  sampling	
  
§ Engineering	
  Methods	
  
§ Issues	
  of	
  scale	
  
§ Where	
  the	
  theory	
  breaks	
  down	
  (and	
  what	
  to	
  do	
  about	
  it)	
  
Some	
  Early	
  NLP	
  History	
  
§ 1950’s:	
  
§ FoundaEonal	
  work:	
  automata,	
  informaEon	
  theory,	
  etc.	
  
§ First	
  speech	
  systems	
  
§ Machine	
  translaEon	
  (MT)	
  hugely	
  funded	
  by	
  military	
  
§ Toy	
  models:	
  MT	
  using	
  basically	
  word-­‐subsEtuEon	
  
§ OpEmism!	
  
§ 1960’s	
  and	
  1970’s:	
  NLP	
  Winter	
  
§ Bar-­‐Hillel	
  (FAHQT)	
  and	
  ALPAC	
  reports	
  kills	
  MT	
  
§ Work	
  shi^s	
  to	
  deeper	
  models,	
  syntax	
  
§ …	
  but	
  toy	
  domains	
  /	
  grammars	
  (SHRDLU,	
  LUNAR)	
  
§ 1980’s	
  and	
  1990’s:	
  The	
  Empirical	
  RevoluEon	
  
§ ExpectaEons	
  get	
  reset	
  
§ Corpus-­‐based	
  methods	
  become	
  central	
  
§ Deep	
  analysis	
  o^en	
  traded	
  for	
  robust	
  and	
  simple	
  approximaEons	
  
§ Evaluate	
  everything	
  
§ 2000+:	
  Richer	
  StaEsEcal	
  Methods	
  
§ Models	
  increasingly	
  merge	
  linguisEcally	
  sophisEcated	
  representaEons	
  with	
  staEsEcal	
  
methods,	
  confluence	
  and	
  clean-­‐up	
  
§ Begin	
  to	
  get	
  both	
  breadth	
  and	
  depth	
  
Problem:	
  Structure	
  
§ Headlines:	
  
§ Enraged	
  Cow	
  Injures	
  Farmer	
  with	
  Ax	
  
§ Teacher	
  Strikes	
  Idle	
  Kids	
  
§ Hospitals	
  Are	
  Sued	
  by	
  7	
  Foot	
  Doctors	
  
§ Ban	
  on	
  Nude	
  Dancing	
  on	
  Governor’s	
  Desk	
  
§ Iraqi	
  Head	
  Seeks	
  Arms	
  
§ Stolen	
  PainEng	
  Found	
  by	
  Tree	
  
§ Kids	
  Make	
  NutriEous	
  Snacks	
  
§ Local	
  HS	
  Dropouts	
  Cut	
  in	
  Half	
  
§ Why	
  are	
  these	
  funny?	
  
PLURAL NOUN
NOUN
DET
DET
ADJ
NOUN
NP NP
CONJ
NP PP
Problem:	
  Scale	
  
§ People	
  did	
  know	
  that	
  language	
  was	
  ambiguous!	
  
§ …but	
  they	
  hoped	
  that	
  all	
  interpretaEons	
  would	
  be	
  “good”	
  ones	
  (or	
  
ruled	
  out	
  pragmaEcally)	
  
§ …they	
  didn’t	
  realize	
  how	
  bad	
  it	
  would	
  be	
  
Classical	
  NLP:	
  Parsing	
  
§ Write	
  symbolic	
  or	
  logical	
  rules:	
  
§ Use	
  deducEon	
  systems	
  to	
  prove	
  parses	
  from	
  words	
  
§ Minimal	
  grammar	
  on	
  “Fed	
  raises”	
  sentence:	
  36	
  parses	
  
§ Simple	
  10-­‐rule	
  grammar:	
  592	
  parses	
  
§ Real-­‐size	
  grammar:	
  many	
  millions	
  of	
  parses	
  
§ This	
  scaled	
  very	
  badly,	
  didn’t	
  yield	
  broad	
  coverage	
  tools	
  
Grammar (CFG) Lexicon
ROOT → S
S → NP VP
NP → DT NN
NP → NN NNS
NN → interest
NNS → raises
VBP → interest
VBZ → raises
…
NP → NP PP
VP → VBP NP
VP → VBP NP PP
PP → IN NP
Problem:	
  Sparsity	
  
§ However:	
  sparsity	
  is	
  always	
  a	
  problem	
  
§ New	
  unigram	
  (word),	
  bigram	
  (word	
  pair),	
  and	
  rule	
  rates	
  in	
  
newswire	
  
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200000 400000 600000 800000 1000000
Fraction
Seen
Numberof Words
Unigrams
Bigrams
The	
  (EffecEve)	
  NLP	
  Cycle	
  
§ Pick	
  a	
  problem	
  (usually	
  some	
  disambiguaEon)	
  
§ Get	
  a	
  lot	
  of	
  data	
  (usually	
  a	
  labeled	
  corpus)	
  
§ Build	
  the	
  simplest	
  thing	
  that	
  could	
  possibly	
  work	
  
§ Repeat:	
  
§ Examine	
  the	
  most	
  common	
  errors	
  are	
  
§ Figure	
  out	
  what	
  informaEon	
  a	
  human	
  might	
  use	
  to	
  avoid	
  them	
  
§ Modify	
  the	
  system	
  to	
  exploit	
  that	
  informaEon	
  
§ Feature	
  engineering	
  
§ RepresentaEon	
  redesign	
  
§ Different	
  machine	
  learning	
  methods	
  
§ We’re	
  do	
  this	
  over	
  and	
  over	
  again	
  
NLP_guest_lecture.pdf

More Related Content

PPTX
introduction to natural language processing lecture.pptx
PPTX
NLP_KASHK: Introduction
PPT
PPT
PDF
Computational linguistics
PDF
Adnan: Introduction to Natural Language Processing
PPTX
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
PPT
intro.ppt
introduction to natural language processing lecture.pptx
NLP_KASHK: Introduction
Computational linguistics
Adnan: Introduction to Natural Language Processing
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
intro.ppt

Similar to NLP_guest_lecture.pdf (20)

PPTX
Artificial Intelligence Notes Unit 4
PPT
NLP introduced and in 47 slides Lecture 1.ppt
PDF
Ijetcas14 458
PDF
AI Lesson 40
PDF
Lesson 40
PPTX
https://www.slideshare.net/amaresimachew/hot-topics-132093738
PDF
Text Analytics for Security
PPTX
AI UNIT-3 FINAL (1).pptx
PPTX
Artificial Intelligence_NLP
PPTX
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
PPTX
PPTX
Natural Language Processing - Unit 1
PPT
Intro 2 document
PPT
cs626-449-lect1-intro-2009-7-23-jshdih.ppt
PDF
An Introduction to Natural Language Processing with Deep Learning
PDF
NL Context Understanding 23(6)
PPTX
Introduction to nlp
PPT
Natural language procssing
Artificial Intelligence Notes Unit 4
NLP introduced and in 47 slides Lecture 1.ppt
Ijetcas14 458
AI Lesson 40
Lesson 40
https://www.slideshare.net/amaresimachew/hot-topics-132093738
Text Analytics for Security
AI UNIT-3 FINAL (1).pptx
Artificial Intelligence_NLP
gdhfjdhjcbdjhvjhdshbajhbvdjbklcbdsjhbvjhsdbvjjv
Natural Language Processing - Unit 1
Intro 2 document
cs626-449-lect1-intro-2009-7-23-jshdih.ppt
An Introduction to Natural Language Processing with Deep Learning
NL Context Understanding 23(6)
Introduction to nlp
Natural language procssing
Ad

Recently uploaded (20)

PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Lesson notes of climatology university.
PDF
IGGE1 Understanding the Self1234567891011
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
1_English_Language_Set_2.pdf probationary
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Weekly quiz Compilation Jan -July 25.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Lesson notes of climatology university.
IGGE1 Understanding the Self1234567891011
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Supply Chain Operations Speaking Notes -ICLT Program
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
1_English_Language_Set_2.pdf probationary
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
UNIT III MENTAL HEALTH NURSING ASSESSMENT
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Chinmaya Tiranga quiz Grand Finale.pdf
Cell Types and Its function , kingdom of life
Weekly quiz Compilation Jan -July 25.pdf
Ad

NLP_guest_lecture.pdf

  • 1. Natural  Language  Processing   Taylor  Berg-­‐Kirkpatrick  –  CMU   Slides:  Dan  Klein  –  UC  Berkeley    
  • 2. Language  Technologies   Goal:  Deep  Understanding   § Requires  context,  linguisEc   structure,  meanings…   Reality:  Shallow  Matching   § Requires  robustness  and  scale   § Amazing  successes,  but   fundamental  limitaEons    
  • 3. Speech  Systems   § AutomaEc  Speech  RecogniEon  (ASR)   § Audio  in,  text  out   § SOTA:  0.3%  error  for  digit  strings,  5%  dictaEon,  50%+  TV                   § Text  to  Speech  (TTS)   § Text  in,  audio  out   § SOTA:  totally  intelligible  (if  someEmes  unnatural)     “Speech Lab”
  • 4. Example:  Siri   § Siri  contains   § Speech  recogniEon   § Language  analysis   § Dialog  processing   § Text  to  speech   Image:  Wikipedia  
  • 5. Text  Data  is  Superficial   An iceberg is a large piece of freshwater ice that has broken off from a snow-formed glacier or ice shelf and is floating in open water.
  • 6. …  But  Language  is  Complex   § SemanEc  structures   § References  and  enEEes   § Discourse-­‐level  connecEves   § Meanings  and  implicatures   § Contextual  factors   § Perceptual  grounding     § …     An iceberg is a large piece of freshwater ice that has broken off from a snow-formed glacier or ice shelf and is floating in open water.
  • 7. SyntacEc  Analysis   § SOTA:  ~90%  accurate  for  many  languages  when  given  many  training   examples,  some  progress  in  analyzing  languages  given  few  or  no  examples   Hurricane Emily howled toward Mexico 's Caribbean coast on Sunday packing 135 mph winds and torrential rain and causing panic in Cancun , where frightened tourists squeezed into musty shelters .
  • 8. Corpora   § A  corpus  is  a  collecEon  of  text   § O^en  annotated  in  some  way   § SomeEmes  just  lots  of  text   § Balanced  vs.  uniform  corpora   § Examples   § Newswire  collecEons:  500M+  words   § Brown  corpus:  1M  words  of  tagged   “balanced”  text   § Penn  Treebank:  1M  words  of  parsed   WSJ   § Canadian  Hansards:  10M+  words  of   aligned  French  /  English  sentences   § The  Web:  billions  of  words  of  who   knows  what  
  • 9. Corpus-­‐Based  Methods   § A  corpus  like  a  treebank  gives  us  three  important  tools:   § It  gives  us  broad  coverage   ROOT → S S → NP VP . NP → PRP VP → VBD ADJ
  • 10. Corpus-­‐Based  Methods   § It  gives  us  staEsEcal  informaEon   11% 9% 6% NP PP DT NN PRP 9% 9% 21% NP PP DT NN PRP 7% 4% 23% NP PP DT NN PRP All NPs NPs under S NPs under VP
  • 11. Corpus-­‐Based  Methods   § It  lets  us  check  our  answers  
  • 12. SemanEc  Ambiguity   § NLP  is  much  more  than  syntax!   § Even  correct  tree  structured  syntacEc  analyses  don’t  fully  nail   down  the  meaning   § In  general,  every  level  of  linguisEc  structure  comes  with  its   own  ambiguiEes…   I haven’t slept for ten days John’s boss said he was doing better
  • 13. Other  Levels  of  Language   § TokenizaEon/morphology:   § What  are  the  words,  what  is  the  sub-­‐word  structure?   § O^en  simple  rules  work  (period  a^er  “Mr.”  isn’t  sentence  break)   § RelaEvely  easy  in  English,  other  languages  are  harder:   § SegementaEon   § Morphology   § Discourse:  how  do  sentences  relate  to  each  other?   § PragmaEcs:  what  intent  is  expressed  by  the  literal  meaning,  how  to  react   to  an  ujerance?   § PhoneEcs:  acousEcs  and  physical  producEon  of  sounds   § Phonology:  how  sounds  pajern  in  a  language   sarà andata be+fut+3sg go+ppt+fem “she will have gone”
  • 14. QuesEon  Answering   § QuesEon  Answering:   § More  than  search   § Ask  general   comprehension  quesEons   of  a  document  collecEon   § Can  be  really  easy:  “What’s   the  capital  of  Wyoming?”   § Can  be  harder:  “How  many   US  states’  capitals  are  also   their  largest  ciEes?”   § Can  be  open  ended:  “What   are  the  main  issues  in  the   global  warming  debate?”     § SOTA:  Can  do  factoids,   even  when  text  isn’t  a   perfect  match  
  • 16. SummarizaEon   § Condensing   documents   § An  example  of   analysis  with   generaEon  
  • 17. ExtracEve  Summaries   Lindsay Lohan pleaded not guilty Wednesday to felony grand theft of a $2,500 necklace, a case that could return the troubled starlet to jail rather than the big screen. Saying it appeared that Lohan had violated her probation in a 2007 drunken driving case, the judge set bail at $40,000 and warned that if Lohan was accused of breaking the law while free he would have her held without bail. The Mean Girls star is due back in court on Feb. 23, an important hearing in which Lohan could opt to end the case early.
  • 18. Machine  TranslaEon   § Translate  text  from  one  language  to  another   § Recombines  fragments  of  example  translaEons   § Challenges:   § What  fragments?    [learning  to  translate]   § How  to  make  efficient?    [fast  translaEon  search]   § Fluency  (next  class)  vs  fidelity  (later)  
  • 20. More  Data:  Machine  TranslaEon   Cela constituerait une solution transitoire qui permettrait de conduire à terme à une charte à valeur contraignante. That would be an interim solution which would make it possible to work towards a binding charter in the long term . [this] [constituerait] [assistance] [transitoire] [who] [permettrait] [licences] [to] [terme] [to] [a] [charter] [to] [value] [contraignante] [.] [it] [would] [a solution] [transitional] [which] [would] [of] [lead] [to] [term] [to a] [charter] [to] [value] [binding] [.] [this] [would be] [a transitional solution] [which would] [lead to] [a charter] [legally binding] [.] [that would be] [a transitional solution] [which would] [eventually lead to] [a binding charter] [.] SOURCE HUMAN 1x DATA 10x DATA 100x DATA 1000x DATA
  • 22. Data  and  Knowledge   § Classic  knowledge  representaEon  worry:  How  will  a   machine  ever  know  that…   § Ice  is  frozen  water?   § Beige  looks  like  this:   § Chairs  are  solid?   § Answers:   § 1980:  write  it  all  down   § 2000:  get  by  without  it   § 2020:  learn  it  from  data  
  • 28. Grounding  with  Natural  Data   … on the beige loveseat.
  • 29. What  is  Nearby  NLP?   § ComputaEonal  LinguisEcs   § Using  computaEonal  methods  to  learn  more   about  how  language  works   § We  end  up  doing  this  and  using  it   § CogniEve  Science   § Figuring  out  how  the  human  brain  works   § Includes  the  bits  that  do  language   § Humans:  the  only  working  NLP  prototype!   § Speech  Processing   § Mapping  audio  signals  to  text   § TradiEonally  separate  from  NLP,  converging?   § Two  components:  acousEc  models  and  language   models   § Language  models  in  the  domain  of  stat  NLP  
  • 30. Example:  NLP  Meets  CL   § Example:  Language  change,  reconstrucEng  ancient  forms,  phylogenies                              …  just  one  example  of  the  kinds  of  linguisEc  models  we  can  build  
  • 31. What  is  NLP  research?   § Three  aspects  we  o^en  invesEgate:   § LinguisEc  Issues   § What  are  the  range  of  language  phenomena?   § What  are  the  knowledge  sources  that  let  us  disambiguate?   § What  representaEons  are  appropriate?   § How  do  you  know  what  to  model  and  what  not  to  model?   § StaEsEcal  Modeling  Methods   § Increasingly  complex  model  structures   § Learning  and  parameter  esEmaEon   § Efficient  inference:  dynamic  programming,  search,  sampling   § Engineering  Methods   § Issues  of  scale   § Where  the  theory  breaks  down  (and  what  to  do  about  it)  
  • 32. Some  Early  NLP  History   § 1950’s:   § FoundaEonal  work:  automata,  informaEon  theory,  etc.   § First  speech  systems   § Machine  translaEon  (MT)  hugely  funded  by  military   § Toy  models:  MT  using  basically  word-­‐subsEtuEon   § OpEmism!   § 1960’s  and  1970’s:  NLP  Winter   § Bar-­‐Hillel  (FAHQT)  and  ALPAC  reports  kills  MT   § Work  shi^s  to  deeper  models,  syntax   § …  but  toy  domains  /  grammars  (SHRDLU,  LUNAR)   § 1980’s  and  1990’s:  The  Empirical  RevoluEon   § ExpectaEons  get  reset   § Corpus-­‐based  methods  become  central   § Deep  analysis  o^en  traded  for  robust  and  simple  approximaEons   § Evaluate  everything   § 2000+:  Richer  StaEsEcal  Methods   § Models  increasingly  merge  linguisEcally  sophisEcated  representaEons  with  staEsEcal   methods,  confluence  and  clean-­‐up   § Begin  to  get  both  breadth  and  depth  
  • 33. Problem:  Structure   § Headlines:   § Enraged  Cow  Injures  Farmer  with  Ax   § Teacher  Strikes  Idle  Kids   § Hospitals  Are  Sued  by  7  Foot  Doctors   § Ban  on  Nude  Dancing  on  Governor’s  Desk   § Iraqi  Head  Seeks  Arms   § Stolen  PainEng  Found  by  Tree   § Kids  Make  NutriEous  Snacks   § Local  HS  Dropouts  Cut  in  Half   § Why  are  these  funny?  
  • 34. PLURAL NOUN NOUN DET DET ADJ NOUN NP NP CONJ NP PP Problem:  Scale   § People  did  know  that  language  was  ambiguous!   § …but  they  hoped  that  all  interpretaEons  would  be  “good”  ones  (or   ruled  out  pragmaEcally)   § …they  didn’t  realize  how  bad  it  would  be  
  • 35. Classical  NLP:  Parsing   § Write  symbolic  or  logical  rules:   § Use  deducEon  systems  to  prove  parses  from  words   § Minimal  grammar  on  “Fed  raises”  sentence:  36  parses   § Simple  10-­‐rule  grammar:  592  parses   § Real-­‐size  grammar:  many  millions  of  parses   § This  scaled  very  badly,  didn’t  yield  broad  coverage  tools   Grammar (CFG) Lexicon ROOT → S S → NP VP NP → DT NN NP → NN NNS NN → interest NNS → raises VBP → interest VBZ → raises … NP → NP PP VP → VBP NP VP → VBP NP PP PP → IN NP
  • 36. Problem:  Sparsity   § However:  sparsity  is  always  a  problem   § New  unigram  (word),  bigram  (word  pair),  and  rule  rates  in   newswire   0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200000 400000 600000 800000 1000000 Fraction Seen Numberof Words Unigrams Bigrams
  • 37. The  (EffecEve)  NLP  Cycle   § Pick  a  problem  (usually  some  disambiguaEon)   § Get  a  lot  of  data  (usually  a  labeled  corpus)   § Build  the  simplest  thing  that  could  possibly  work   § Repeat:   § Examine  the  most  common  errors  are   § Figure  out  what  informaEon  a  human  might  use  to  avoid  them   § Modify  the  system  to  exploit  that  informaEon   § Feature  engineering   § RepresentaEon  redesign   § Different  machine  learning  methods   § We’re  do  this  over  and  over  again