ToxicDocs

ToxicDocs
Background
• Columbia and CPR have 4 million
newly declassified legal documents
• Of interest to journalists, historians,
attorneys, public health officials
• How can this repository be made
more accessible and useful?
• More generally, what can we do to get
insights out of a document dump?

ToxicDocs
I aim to:
1. Categorize documents into types
such as memo, ad, scientific study
2. Provide retrieval of documents
similar to a given one
3. Infer missing attributes (such as the
year) by parsing text
4. Visualize trends for topics over time

The Actual Data
DIVISION OF SAFETY ANIj . n -IENE
The Industrial Commission of Ohio 700
W. Third Avenue, Columbus, Ohio 43212
.e~2e70 ~ .e~.sslle
IlIdullri,1 COOt",lIlio.
M. IIOlLAWO nlSE ChI"m,.
mm l. tHOOU Go'ftrttt
""i,,,.DIYII;'" of
wllty ",41
101m t. wl.mEt MtIlMr
lHOI1W0,,,',I1f.11G,,.,U,,,U",GHU
~
~ ~L.yI0Il" P. SHEOIAN

OCR Quality
• Out-of-Vocabulary (OOV) text, i.e. meaningless
gibberish, is limited in most documents
• Most documents have >60% readable text

Classification Task
• 50k documents
• 1k randomly labeled
• 15 unbalanced classes
• Documents can be:
• subtle
• handwritten
• illustrations
• not in English
• blank (?)

Classification Algorithm
Model
• Linear kernel one-v-all SVM (liblinear)
• Squared hinge loss + L2 regularization
Performance
• Accuracy: 62% (top 3: 88%), mean F1: 0.57
• Baseline: 15% using regular expressions
Features
• Sublinear TF-IDF on n-gram features (56%)
• +6% from document length, NER, and
semi-supervised label propagation

Visual Features
• Visual structure contains independent information
• 1-layer feed-forward network gets an accuracy of 15%
• Recent advances allow OCR with probabilistic
models for language using recurrent networks
1000 x 1000 100 x 100 10 x 10

Putting it to Use
• A search for “vinyl” (H2C=CHCl) reveals the following trend
in the composition of documents over time
• Time (x) is inferred and type (y) is predicted

Evolution of a Crisis
• chemicalindustryarchives.org/dirtysecrets/
• “By 1971 the industry knew without doubt that vinyl
chloride caused cancer in animals.”

• “In January 1974, B.F. Goodrich announced the presence of
a rare liver cancer, angiosarcoma, in its polyvinyl chloride
workers at is Louisville plant.”

• “In May of 1974, the Occupational Safety and Health
Administration (OSHA) proposed a maximum exposure level
for vinyl chloride at a no detectable level”

toxicdocs.org
• Launching some time after the election
• This work will be scaled and integrated
• github.com/GautamShine/toxic-docs
About me:
• Ph.D. candidate at Stanford in electrical engineering
• Interned at an AI startup (ML/NLP) and twice at Intel (supercomputing)
• Outdoor enthusiast (climbing, diving) and backpacker (30+ countries)
ToxicDocs

Semi-supervised Learning
• General idea: make use of unlabeled data
• For SVMs, distance from separating hyperplane
serves as prediction confidence
• Procedure used:
1. h1 = argmax f(Xtrain)
2. ŷunlabeled = h1(Xunlabeled)
3. Xabsorbed = {x ∈ Xunlabeled | h1(x) > C}
4. h2 = argmax f(Xtrain + Xabsorbed)
5. ŷtest = h2(Xtest)

Kernels for Text
• n-gram space is high-d and sparse
• Most points lie in a low-d subspace
• Data is often trivially separable
• Projection to higher dimensions doesn’t
reduce bias much
• But pays a price with higher variance
• Big speed difference in NLP problems
• Linear kernel is O(p) from SGD on hinge loss
• Non-linear kernels are O(p2) from
coordinate ascent on the Lagrange dual

ToxicDocs

More Related Content

Viewers also liked (8)

Similar to ToxicDocs (20)

Recently uploaded (20)

ToxicDocs