SlideShare a Scribd company logo
ToxicDocs
Gautam Shine
ToxicDocs
Background
‱ Columbia and CPR have 4 million
newly declassified legal documents
‱ Of interest to journalists, historians,
attorneys, public health officials
‱ How can this repository be made
more accessible and useful?
‱ More generally, what can we do to get
insights out of a document dump?
ToxicDocs
I aim to:
1. Categorize documents into types
such as memo, ad, scientific study
2. Provide retrieval of documents
similar to a given one
3. Infer missing attributes (such as the
year) by parsing text
4. Visualize trends for topics over time
The Actual Data
DIVISION OF SAFETY ANIj . n -IENE
The Industrial Commission of Ohio 700
W. Third Avenue, Columbus, Ohio 43212
.e~2e70 ~ .e~.sslle
IlIdullri,1 COOt",lIlio.
M. IIOlLAWO nlSE ChI"m,.
mm l. tHOOU Go'ftrttt
""i,,,.DIYII;'" of
wllty ",41
101m t. wl.mEt MtIlMr
lHOI1W0,,,',I1f.11G,,.,U,,,U",GHU
~
~ ~L.yI0Il" P. SHEOIAN
OCR Quality
‱ Out-of-Vocabulary (OOV) text, i.e. meaningless
gibberish, is limited in most documents
‱ Most documents have >60% readable text
Classification Task
‱ 50k documents
‱ 1k randomly labeled
‱ 15 unbalanced classes
‱ Documents can be:
‱ subtle
‱ handwritten
‱ illustrations
‱ not in English
‱ blank (?)
Classification Algorithm
Model
‱ Linear kernel one-v-all SVM (liblinear)
‱ Squared hinge loss + L2 regularization
Performance
‱ Accuracy: 62% (top 3: 88%), mean F1: 0.57
‱ Baseline: 15% using regular expressions
Features
‱ Sublinear TF-IDF on n-gram features (56%)
‱ +6% from document length, NER, and
semi-supervised label propagation
Visual Features
‱ Visual structure contains independent information
‱ 1-layer feed-forward network gets an accuracy of 15%
‱ Recent advances allow OCR with probabilistic
models for language using recurrent networks
1000 x 1000 100 x 100 10 x 10
Putting it to Use
‱ A search for “vinyl” (H2C=CHCl) reveals the following trend
in the composition of documents over time
‱ Time (x) is inferred and type (y) is predicted
Evolution of a Crisis
‱ chemicalindustryarchives.org/dirtysecrets/
‱ “By 1971 the industry knew without doubt that vinyl
chloride caused cancer in animals.”
Evolution of a Crisis
‱ “In January 1974, B.F. Goodrich announced the presence of
a rare liver cancer, angiosarcoma, in its polyvinyl chloride
workers at is Louisville plant.”
Evolution of a Crisis
‱ “In May of 1974, the Occupational Safety and Health
Administration (OSHA) proposed a maximum exposure level
for vinyl chloride at a no detectable level”
toxicdocs.org
‱ Launching some time after the election
‱ This work will be scaled and integrated
‱ github.com/GautamShine/toxic-docs
About me:
‱ Ph.D. candidate at Stanford in electrical engineering
‱ Interned at an AI startup (ML/NLP) and twice at Intel (supercomputing)
‱ Outdoor enthusiast (climbing, diving) and backpacker (30+ countries)
ToxicDocs
Semi-supervised Learning
‱ General idea: make use of unlabeled data
‱ For SVMs, distance from separating hyperplane
serves as prediction confidence
‱ Procedure used:
1. h1 = argmax f(Xtrain)
2. Ć·unlabeled = h1(Xunlabeled)
3. Xabsorbed = {x ∈ Xunlabeled | h1(x) > C}
4. h2 = argmax f(Xtrain + Xabsorbed)
5. Ć·test = h2(Xtest)
Kernels for Text
‱ n-gram space is high-d and sparse
‱ Most points lie in a low-d subspace
‱ Data is often trivially separable
‱ Projection to higher dimensions doesn’t
reduce bias much
‱ But pays a price with higher variance
‱ Big speed difference in NLP problems
‱ Linear kernel is O(p) from SGD on hinge loss
‱ Non-linear kernels are O(p2) from
coordinate ascent on the Lagrange dual

More Related Content

PPT
A Dark New World: Anatomy of Australian Horror Films, Mark David Ryan
PDF
Jasper Jeremiah Mubango curriculum vitae
PPS
Belle france
DOCX
Production log
PPTX
Sustainability
PPTX
Project Work Basilicata-2
PPTX
Misturas
DOCX
ZachWilsonResumeSEPT2016_1COLUMN
A Dark New World: Anatomy of Australian Horror Films, Mark David Ryan
Jasper Jeremiah Mubango curriculum vitae
Belle france
Production log
Sustainability
Project Work Basilicata-2
Misturas
ZachWilsonResumeSEPT2016_1COLUMN

Viewers also liked (8)

PDF
Kate (TP)
DOC
Vu kien chat doc mau da cam
PDF
david getuigskrif
PDF
Power storage systems
PDF
Decathlon maniyar pdf
PPT
SoluçÔes e Solubilidade
PPTX
Venturi scrubber by SP
Kate (TP)
Vu kien chat doc mau da cam
david getuigskrif
Power storage systems
Decathlon maniyar pdf
SoluçÔes e Solubilidade
Venturi scrubber by SP
Ad

Similar to ToxicDocs (20)

PDF
Strategies for Identifying, Categorizing, and Facilitating Data Extraction.pdf
PPTX
A Knowledge Discovery Framework for Planetary Defense
PDF
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
PPTX
CiteSeerX: Mining Scholarly Big Data
PDF
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
PPTX
Deep Machine Reading for Customer Analytics
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
Using Machine Learning to aid Journalism at the New York Times
PDF
A Recommender Story: Improving Backend Data Quality While Reducing Costs
PDF
The Scientific and Technical Foundation for Altmetrics in the United States
PDF
Presenting Your Digital Research
PDF
Best Practices for Large Scale Text Mining Processing
PDF
Temporal information extraction in the general and clinical domain
PPTX
E.Gombocz: Semantics in a Box (SemTech 2013-04-30)
PDF
Daeil Kim: Machine Learning at the New York Times
PDF
Strata 2012: Big Data and Bibliometrics
PDF
Intro to Exhibit Workshop
PDF
Assessing Drug Safety Using AI
Strategies for Identifying, Categorizing, and Facilitating Data Extraction.pdf
A Knowledge Discovery Framework for Planetary Defense
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
CiteSeerX: Mining Scholarly Big Data
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
Deep Machine Reading for Customer Analytics
Frontiers of Computational Journalism week 2 - Text Analysis
Using Machine Learning to aid Journalism at the New York Times
A Recommender Story: Improving Backend Data Quality While Reducing Costs
The Scientific and Technical Foundation for Altmetrics in the United States
Presenting Your Digital Research
Best Practices for Large Scale Text Mining Processing
Temporal information extraction in the general and clinical domain
E.Gombocz: Semantics in a Box (SemTech 2013-04-30)
Daeil Kim: Machine Learning at the New York Times
Strata 2012: Big Data and Bibliometrics
Intro to Exhibit Workshop
Assessing Drug Safety Using AI
Ad

Recently uploaded (20)

PPTX
A Complete Guide to Streamlining Business Processes
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
 
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
 
PPTX
Managing Community Partner Relationships
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Introduction to the R Programming Language
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
DOCX
Factor Analysis Word Document Presentation
A Complete Guide to Streamlining Business Processes
IBA_Chapter_11_Slides_Final_Accessible.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Qualitative Qantitative and Mixed Methods.pptx
New ISO 27001_2022 standard and the changes
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
 
ISS -ESG Data flows What is ESG and HowHow
SAP 2 completion done . PRESENTATION.pptx
Introduction to Data Science and Data Analysis
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
 
Managing Community Partner Relationships
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to the R Programming Language
DU, AIS, Big Data and Data Analytics.ppt
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Navigating the Thai Supplements Landscape.pdf
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Factor Analysis Word Document Presentation

ToxicDocs

  • 2. ToxicDocs Background ‱ Columbia and CPR have 4 million newly declassified legal documents ‱ Of interest to journalists, historians, attorneys, public health officials ‱ How can this repository be made more accessible and useful? ‱ More generally, what can we do to get insights out of a document dump?
  • 3. ToxicDocs I aim to: 1. Categorize documents into types such as memo, ad, scientific study 2. Provide retrieval of documents similar to a given one 3. Infer missing attributes (such as the year) by parsing text 4. Visualize trends for topics over time
  • 4. The Actual Data DIVISION OF SAFETY ANIj . n -IENE The Industrial Commission of Ohio 700 W. Third Avenue, Columbus, Ohio 43212 .e~2e70 ~ .e~.sslle IlIdullri,1 COOt",lIlio. M. IIOlLAWO nlSE ChI"m,. mm l. tHOOU Go'ftrttt ""i,,,.DIYII;'" of wllty ",41 101m t. wl.mEt MtIlMr lHOI1W0,,,',I1f.11G,,.,U,,,U",GHU ~ ~ ~L.yI0Il" P. SHEOIAN
  • 5. OCR Quality ‱ Out-of-Vocabulary (OOV) text, i.e. meaningless gibberish, is limited in most documents ‱ Most documents have >60% readable text
  • 6. Classification Task ‱ 50k documents ‱ 1k randomly labeled ‱ 15 unbalanced classes ‱ Documents can be: ‱ subtle ‱ handwritten ‱ illustrations ‱ not in English ‱ blank (?)
  • 7. Classification Algorithm Model ‱ Linear kernel one-v-all SVM (liblinear) ‱ Squared hinge loss + L2 regularization Performance ‱ Accuracy: 62% (top 3: 88%), mean F1: 0.57 ‱ Baseline: 15% using regular expressions Features ‱ Sublinear TF-IDF on n-gram features (56%) ‱ +6% from document length, NER, and semi-supervised label propagation
  • 8. Visual Features ‱ Visual structure contains independent information ‱ 1-layer feed-forward network gets an accuracy of 15% ‱ Recent advances allow OCR with probabilistic models for language using recurrent networks 1000 x 1000 100 x 100 10 x 10
  • 9. Putting it to Use ‱ A search for “vinyl” (H2C=CHCl) reveals the following trend in the composition of documents over time ‱ Time (x) is inferred and type (y) is predicted
  • 10. Evolution of a Crisis ‱ chemicalindustryarchives.org/dirtysecrets/ ‱ “By 1971 the industry knew without doubt that vinyl chloride caused cancer in animals.”
  • 11. Evolution of a Crisis ‱ “In January 1974, B.F. Goodrich announced the presence of a rare liver cancer, angiosarcoma, in its polyvinyl chloride workers at is Louisville plant.”
  • 12. Evolution of a Crisis ‱ “In May of 1974, the Occupational Safety and Health Administration (OSHA) proposed a maximum exposure level for vinyl chloride at a no detectable level”
  • 13. toxicdocs.org ‱ Launching some time after the election ‱ This work will be scaled and integrated ‱ github.com/GautamShine/toxic-docs About me: ‱ Ph.D. candidate at Stanford in electrical engineering ‱ Interned at an AI startup (ML/NLP) and twice at Intel (supercomputing) ‱ Outdoor enthusiast (climbing, diving) and backpacker (30+ countries) ToxicDocs
  • 14. Semi-supervised Learning ‱ General idea: make use of unlabeled data ‱ For SVMs, distance from separating hyperplane serves as prediction confidence ‱ Procedure used: 1. h1 = argmax f(Xtrain) 2. Ć·unlabeled = h1(Xunlabeled) 3. Xabsorbed = {x ∈ Xunlabeled | h1(x) > C} 4. h2 = argmax f(Xtrain + Xabsorbed) 5. Ć·test = h2(Xtest)
  • 15. Kernels for Text ‱ n-gram space is high-d and sparse ‱ Most points lie in a low-d subspace ‱ Data is often trivially separable ‱ Projection to higher dimensions doesn’t reduce bias much ‱ But pays a price with higher variance ‱ Big speed difference in NLP problems ‱ Linear kernel is O(p) from SGD on hinge loss ‱ Non-linear kernels are O(p2) from coordinate ascent on the Lagrange dual