SlideShare a Scribd company logo
© 2019 The MITRE Corporation. All rights reserved.
Apache Tika
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-6
| 2 |
© 2019 The MITRE Corporation. All rights reserved.
Overview
▪ What is Tika
▪ tika-eval
▪ Running Tika safely
▪ Coming out in 1.21 and beyond
| 3 |
© 2019 The MITRE Corporation. All rights reserved.
Text/Metadata Extraction
| 4 |
© 2019 The MITRE Corporation. All rights reserved.
Things Can Happen
▪ Tired:
– Exceptions
– Unsupported file formats
– Encrypted files
– Garbled text
– Missing text
▪ Wired:
– OOM
– Seg fault
– Infinite loops
– Multithreaded garbage collector pegging all CPU resources
| 5 |
Stands up on Soap Box
| 6 |
© 2019 The MITRE Corporation. All rights reserved.
Upgrade from PDFBox 1.8.6->1.8.7
| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
| 8 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
You don’t have a search system.
| 9 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
👍You’ve got a neat, little demo!👍
You don’t have a search system.
| 10 |
Steps Off of Soap Box
| 11 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval
▪ Profile individual runs
▪ Compare two runs
▪ Exceptions by mime
▪ Out of vocabulary (OOV) statistics
| 12 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval: Eating our own dog food
▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a
public virtual machine, provided by Rackspace
▪ Code to profile a single run or compare two runs before release
▪ Evaluation methodology co-developed with and now co-run by open
source colleagues (around the world) on the MSOffice parser project
and the PDF parser project
| 13 |
© 2019 The MITRE Corporation. All rights reserved.
Tika 1.21 and beyond
▪ Tika 1.21
– csv/tsv detector and parser (Apache commons-csv)
– Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing
▪ Beyond
– Modularize tika-eval and include stats within the extract for scalability and aggregation of
stats w/in Solr/Elastic
– Increase coverage/speed of zip-based file detection; can we move entirely to streaming
detection?
– Improve language coverage/lang id component w/in tika-eval
▪ Help!
– What do you need?
– How can you help us help you?

More Related Content

PPTX
IPTC Semantic Web 2012 Spring Working Group
PPTX
IPTC Semantic Web Working Group 2011 Autumn Working Group
PDF
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
PPTX
IPTC Semantic Web Working Group Summer 2012
PDF
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
PDF
Haystack 2018 apache_tika-eval_tallison
PDF
Distributed deep learning reference architecture v3.2l
PDF
HPC Networking in the Real World
IPTC Semantic Web 2012 Spring Working Group
IPTC Semantic Web Working Group 2011 Autumn Working Group
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
IPTC Semantic Web Working Group Summer 2012
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
Haystack 2018 apache_tika-eval_tallison
Distributed deep learning reference architecture v3.2l
HPC Networking in the Real World

Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison (20)

PDF
Embedded-ml(ai)applications - Bjoern Staender
PDF
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
PPTX
Research data management 1.5
PPTX
Research and technology explosion in scale-out storage
PPTX
ApI first Microservices meetup
PPTX
FIWARE and Smart Data Models
PDF
IBM Aspera overview
PDF
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
PDF
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
PPTX
Kafka at Peak Performance
PPTX
Hyperledger weatherreport20190219 公開版
PDF
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PDF
Enterprise Data Lakes
PDF
Model-driven Telemetry: The Foundation of Big Data Analytics
PDF
OSINT: Open Source Intelligence - Rohan Braganza
PDF
Mulesoft Meetup Milano #11.pdf
PPTX
SmartDataModelsProgramMasterPresentation.pptx
PPTX
Implementing Machine Learning Incrementally
PDF
8.1 In Depth: New 64-bit Files and File Management
Embedded-ml(ai)applications - Bjoern Staender
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
Research data management 1.5
Research and technology explosion in scale-out storage
ApI first Microservices meetup
FIWARE and Smart Data Models
IBM Aspera overview
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Kafka at Peak Performance
Hyperledger weatherreport20190219 公開版
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Enterprise Data Lakes
Model-driven Telemetry: The Foundation of Big Data Analytics
OSINT: Open Source Intelligence - Rohan Braganza
Mulesoft Meetup Milano #11.pdf
SmartDataModelsProgramMasterPresentation.pptx
Implementing Machine Learning Incrementally
8.1 In Depth: New 64-bit Files and File Management
Ad

More from OpenSource Connections (20)

PDF
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
PDF
Test driven relevancy
PDF
How To Structure Your Search Team for Success
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
PDF
Payloads and OCR with Solr
PPTX
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
PPTX
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PPTX
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
PDF
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
PDF
Haystack 2019 - Architectural considerations on search relevancy in the conte...
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
PPTX
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
PPTX
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
PDF
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
PDF
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
Test driven relevancy
How To Structure Your Search Team for Success
The right path to making search relevant - Taxonomy Bootcamp London 2019
Payloads and OCR with Solr
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Moving the Public Sector (Government) to a Digital Adoption
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

  • 1. © 2019 The MITRE Corporation. All rights reserved. Apache Tika Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
  • 2. | 2 | © 2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
  • 3. | 3 | © 2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
  • 4. | 4 | © 2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
  • 5. | 5 | Stands up on Soap Box
  • 6. | 6 | © 2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
  • 7. | 7 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
  • 8. | 8 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
  • 9. | 9 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
  • 10. | 10 | Steps Off of Soap Box
  • 11. | 11 | © 2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
  • 12. | 12 | © 2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
  • 13. | 13 | © 2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?