SlideShare a Scribd company logo
Presented By : rohit sharma
What Is Apache Tika?
Apache Tika is a library that is used for document type detection and content extraction from various file formats.
Internally, Tika uses existing various document parsers and document type detection techniques to detect and
extract data.
Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser ibraries
for each document type. Like Apache PdfBox and POI.
All these parser libraries are encapsulated under a single interface called the Parser interface.
WHy Tika?
According to filext.com, there are about 15k to 51k content types, and this number is growing day by day.
Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia
files.
Therefore, applications such as search engines and content management systems need additional support for easy
extraction of data from these document types.
Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.
Apache Tika Applications
Search Engines
Tika is widely used while developing search engines to index the text contents of digital documents.
Digital Asset Management
Some organizations manage their digital assets such as photographs, e-books, drawings, music and video
using a special application known as digital asset management (DAM).
Such applications take the help of document type detectors and metadata extractor to classify the various
documents.
Content Analysis
Websites like Amazon recommend newly released contents of their website to individual users according to
their interests.
To do so, these websites take the help of social media websites like Facebook to extract required information
such as likes and interests of the users. This gathered information will be in the form of html tags or other
formats that require further content type detection and extraction.
Features Of Tika
Unified parser Interface : Tika encapsulates all the third party parser libraries within a single parser interface.
Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it
according to the file type encountered.
Fast processing : Quick content detection and extraction from applications can be expected.
Parser integration : Tika can use various parser libraries available for each document type in a single application.
MIME type detection : Tika can detect and extract content from all the media types included in the MIME
standards.
Language detection : Tika includes language identification feature, therefore can be used in documents based on
language type in a multi lingual websites.
Functionalities Of Tika
Tika supports various functionalities:
Document type detection
Content extraction
Metadata extraction
Language detection
Tika Facade Class?
Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the
functionalities of Tika.
Since it is a facade class, Tika abstracts the complexity behind its functions.
It abstracts all the internal implementations and provides simple methods to access the Tika functionalities.
Some Methods Of tika facade class
String parseToString (File file) :- This method and all its variants parses the file passed as parameter and returns the
extracted text content in the String format. By default, the length of this string parameter is limited.
int getMaxStringLength () :- Returns the maximum length of strings returned by the parseToString methods.
void setMaxStringLength (int maxStringLength) :- Sets the maximum length of strings returned by the parseToString methods.
Reader parse (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in
the form of java.io.reader object.
String detect (InputStream stream, Metadata metadata) :- This method and all its variants accepts an InputStream object and a
Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This
method abstracts the detection mechanisms used by Tika.
Parser interface
This is the interface that is implemented by all the parser classes of Tika package.
The following is the important method of Tika Parser interface:
❖ parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted
document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.
Metadata class
Following are the methods of this class
1. add (Property property, String value)
2. add (String name, String value)
3. String get (Property property)
4. String get (String name)
5. Date getDate (Property property)
6. String[] getValues (Property property)
7. String[] getValues (String name)
8. String[] names()
9. set (Property property, Date date)
10. set(Property property, String[] values)
Language Identifier class
This class identifies the language of the given content.
LanguageIdentifier (String content) : This constructor can instantiate a language identifier by passing on a String from text content
String getLanguage () : Returns the language given to the current LanguageIdentifier object.
Type detection in tika
Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and
its document type.
Type detection using Facade class
The detect() method of facade class is used to detect the document type. This method accepts a file as input.
Eg:
File file = new File("example.mp3");
Tika tika = new Tika();
String filetype = tika.detect(file);
Content extraction using tika
For parsing documents, the parseToString() method of Tika facade class is generally used.
Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString()
method.
Content extraction using parser InterfaCE
The parser package of Tika provides several interfaces and classes using which we can parse a text document.
Content extraction using parser InterfaCE
parse() method
Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this
method is shown below.
Given below is an example that shows how the parse() method is used.
1. Instantiate any of the classes providing the implementation for this interface.
Parser parser = new AutoDetectParser();
(or)
Parser parser = new CompositeParser();
(or)
object of any individual parsers given in Tika Library
1. Create a handler class object.
BodyContentHandler handler = new BodyContentHandler( );
Content extraction using parser InterfaCE
3. Create the Metadata object as shown below:
Metadata metadata = new Metadata();
4. Create any of the input stream objects, and pass your file that should be extracted to it.
File file=new File(filepath)
FileInputStream inputstream=new FileInputStream(file);
(or)
InputStream stream = TikaInputStream.get(new File(filename));
5. Create a parse context object as shown below:
ParseContext context =new ParseContext();
6. Instantiate the parser object, invoke the parse method, and pass all the objects required.
parser.parse(inputstream, handler, metadata, context);
METADATA extraction using parser InterfaCE
Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file.
If we consider an audio file, the artist name, album name, title comes under metadata.
Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters.
This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object.
Therefore, after parsing the file using parse(), we can extract the metadata from that object.
Adding new and setting values in existing metadata
Adding New
We can add new metadata values using the add() method of the metadata class
metadata.add(“author”,”Tutorials point”);
The Metadata class has some predefined properties.
metadata.add(Metadata.SOFTWARE,"ms paint");
Edit Existing
You can set values to the existing metadata elements using the set() method.
metadata.set(Metadata.DATE, new Date());
Language detection in tika
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.
Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class.
This method returns the code name of the language in String format.
Given below is the list of the 18 language-code pairs detected by Tika:
Language detection in tika
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted.
Eg:
LanguageIdentifier object=new LanguageIdentifier(“this is english”);
Language Detection of a Document
To detect the language of a given document, you have to parse it using the parse() method.
The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments.
Eg:
parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
Questions ?
Thanks !

More Related Content

PPT
Apache Tika
KEY
Content extraction with apache tika
PDF
What's new with Apache Tika?
PPT
Apache Tika end-to-end
PPT
Content Analysis with Apache Tika
PPT
Apache Tika: 1 point Oh!
PPT
Text and metadata extraction with Apache Tika
PPT
Content analysis for ECM with Apache Tika
Apache Tika
Content extraction with apache tika
What's new with Apache Tika?
Apache Tika end-to-end
Content Analysis with Apache Tika
Apache Tika: 1 point Oh!
Text and metadata extraction with Apache Tika
Content analysis for ECM with Apache Tika

What's hot (20)

PDF
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
PPT
Scientific data curation and processing with Apache Tika
PPTX
PDF
Tutorial 5 (lucene)
PDF
Full Text Search with Lucene
PPT
Lucece Indexing
PDF
What is in a Lucene index?
PPT
Intelligent crawling and indexing using lucene
PPTX
Search Me: Using Lucene.Net
PDF
Apache Lucene intro - Breizhcamp 2015
PPTX
Fedora Commons in the CLARIN Infrastructure
PPT
Lucene and MySQL
PPTX
Introduction to Lucene & Solr and Usecases
PPTX
LDAP - Lightweight Directory Access Protocol
PPTX
Introduction to Linked Data Platform (LDP)
PPTX
NLP and LSA getting started
PPTX
Describing LDP Applications with the Hydra Core Vocabulary
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
Scientific data curation and processing with Apache Tika
Tutorial 5 (lucene)
Full Text Search with Lucene
Lucece Indexing
What is in a Lucene index?
Intelligent crawling and indexing using lucene
Search Me: Using Lucene.Net
Apache Lucene intro - Breizhcamp 2015
Fedora Commons in the CLARIN Infrastructure
Lucene and MySQL
Introduction to Lucene & Solr and Usecases
LDAP - Lightweight Directory Access Protocol
Introduction to Linked Data Platform (LDP)
NLP and LSA getting started
Describing LDP Applications with the Hydra Core Vocabulary
Ad

Viewers also liked (17)

PPTX
Actors model in gpars
PDF
Spring Web Flow
PPTX
PDF
Introduction to thymeleaf
PPTX
Progressive Web-App (PWA)
PDF
Introduction to gradle
PPTX
Introduction to es6
PPTX
Grails with swagger
PDF
Reactive java - Reactive Programming + RxJava
PDF
Java 8 features
PDF
Unit test-using-spock in Grails
PDF
Cosmos DB Service
Actors model in gpars
Spring Web Flow
Introduction to thymeleaf
Progressive Web-App (PWA)
Introduction to gradle
Introduction to es6
Grails with swagger
Reactive java - Reactive Programming + RxJava
Java 8 features
Unit test-using-spock in Grails
Cosmos DB Service
Ad

Similar to Apache tika (20)

PDF
Understanding information content with apache tika
PDF
Understanding information content with apache tika
PPTX
Metadata Extraction and Content Transformation
PPTX
Applying ocr to extract information : Text mining
PDF
PLAT-13 Metadata Extraction and Transformation
PPT
Mime Magic With Apache Tika
PPT
File Handling In C++(OOPs))
PDF
Filepointers1 1215104829397318-9
PPT
Metadata first, ontologies second
PPTX
Data Science Process.pptx
PPTX
Python-FileHandling.pptx
PDF
Intake 38 10
PPTX
Dataset description: DCAT and other vocabularies
PDF
Data file handling
PDF
PDF
Python reading and writing files
DOC
CSCI6505 Project:Construct search engine using ML approach
PPTX
File Management – File Concept, access methods, File types and File Operation
Understanding information content with apache tika
Understanding information content with apache tika
Metadata Extraction and Content Transformation
Applying ocr to extract information : Text mining
PLAT-13 Metadata Extraction and Transformation
Mime Magic With Apache Tika
File Handling In C++(OOPs))
Filepointers1 1215104829397318-9
Metadata first, ontologies second
Data Science Process.pptx
Python-FileHandling.pptx
Intake 38 10
Dataset description: DCAT and other vocabularies
Data file handling
Python reading and writing files
CSCI6505 Project:Construct search engine using ML approach
File Management – File Concept, access methods, File types and File Operation

More from NexThoughts Technologies (20)

PDF
PDF
Docker & kubernetes
PDF
Apache commons
PDF
Microservice Architecture using Spring Boot with React & Redux
PDF
Solid Principles
PDF
Introduction to TypeScript
PDF
Smart Contract samples
PDF
My Doc of geth
PDF
Geth important commands
PDF
Ethereum genesis
PPTX
Springboot Microservices
PDF
An Introduction to Redux
PPTX
Google authentication
Docker & kubernetes
Apache commons
Microservice Architecture using Spring Boot with React & Redux
Solid Principles
Introduction to TypeScript
Smart Contract samples
My Doc of geth
Geth important commands
Ethereum genesis
Springboot Microservices
An Introduction to Redux
Google authentication

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Per capita expenditure prediction using model stacking based on satellite ima...

Apache tika

  • 1. Presented By : rohit sharma
  • 2. What Is Apache Tika? Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data. Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser ibraries for each document type. Like Apache PdfBox and POI. All these parser libraries are encapsulated under a single interface called the Parser interface.
  • 3. WHy Tika? According to filext.com, there are about 15k to 51k content types, and this number is growing day by day. Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia files. Therefore, applications such as search engines and content management systems need additional support for easy extraction of data from these document types. Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.
  • 4. Apache Tika Applications Search Engines Tika is widely used while developing search engines to index the text contents of digital documents. Digital Asset Management Some organizations manage their digital assets such as photographs, e-books, drawings, music and video using a special application known as digital asset management (DAM). Such applications take the help of document type detectors and metadata extractor to classify the various documents. Content Analysis Websites like Amazon recommend newly released contents of their website to individual users according to their interests. To do so, these websites take the help of social media websites like Facebook to extract required information such as likes and interests of the users. This gathered information will be in the form of html tags or other formats that require further content type detection and extraction.
  • 5. Features Of Tika Unified parser Interface : Tika encapsulates all the third party parser libraries within a single parser interface. Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it according to the file type encountered. Fast processing : Quick content detection and extraction from applications can be expected. Parser integration : Tika can use various parser libraries available for each document type in a single application. MIME type detection : Tika can detect and extract content from all the media types included in the MIME standards. Language detection : Tika includes language identification feature, therefore can be used in documents based on language type in a multi lingual websites.
  • 6. Functionalities Of Tika Tika supports various functionalities: Document type detection Content extraction Metadata extraction Language detection
  • 7. Tika Facade Class? Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the functionalities of Tika. Since it is a facade class, Tika abstracts the complexity behind its functions. It abstracts all the internal implementations and provides simple methods to access the Tika functionalities.
  • 8. Some Methods Of tika facade class String parseToString (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in the String format. By default, the length of this string parameter is limited. int getMaxStringLength () :- Returns the maximum length of strings returned by the parseToString methods. void setMaxStringLength (int maxStringLength) :- Sets the maximum length of strings returned by the parseToString methods. Reader parse (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in the form of java.io.reader object. String detect (InputStream stream, Metadata metadata) :- This method and all its variants accepts an InputStream object and a Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This method abstracts the detection mechanisms used by Tika.
  • 9. Parser interface This is the interface that is implemented by all the parser classes of Tika package. The following is the important method of Tika Parser interface: ❖ parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.
  • 10. Metadata class Following are the methods of this class 1. add (Property property, String value) 2. add (String name, String value) 3. String get (Property property) 4. String get (String name) 5. Date getDate (Property property) 6. String[] getValues (Property property) 7. String[] getValues (String name) 8. String[] names() 9. set (Property property, Date date) 10. set(Property property, String[] values)
  • 11. Language Identifier class This class identifies the language of the given content. LanguageIdentifier (String content) : This constructor can instantiate a language identifier by passing on a String from text content String getLanguage () : Returns the language given to the current LanguageIdentifier object.
  • 12. Type detection in tika Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and its document type. Type detection using Facade class The detect() method of facade class is used to detect the document type. This method accepts a file as input. Eg: File file = new File("example.mp3"); Tika tika = new Tika(); String filetype = tika.detect(file);
  • 13. Content extraction using tika For parsing documents, the parseToString() method of Tika facade class is generally used. Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString() method.
  • 14. Content extraction using parser InterfaCE The parser package of Tika provides several interfaces and classes using which we can parse a text document.
  • 15. Content extraction using parser InterfaCE parse() method Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this method is shown below. Given below is an example that shows how the parse() method is used. 1. Instantiate any of the classes providing the implementation for this interface. Parser parser = new AutoDetectParser(); (or) Parser parser = new CompositeParser(); (or) object of any individual parsers given in Tika Library 1. Create a handler class object. BodyContentHandler handler = new BodyContentHandler( );
  • 16. Content extraction using parser InterfaCE 3. Create the Metadata object as shown below: Metadata metadata = new Metadata(); 4. Create any of the input stream objects, and pass your file that should be extracted to it. File file=new File(filepath) FileInputStream inputstream=new FileInputStream(file); (or) InputStream stream = TikaInputStream.get(new File(filename)); 5. Create a parse context object as shown below: ParseContext context =new ParseContext(); 6. Instantiate the parser object, invoke the parse method, and pass all the objects required. parser.parse(inputstream, handler, metadata, context);
  • 17. METADATA extraction using parser InterfaCE Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file. If we consider an audio file, the artist name, album name, title comes under metadata. Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters. This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object. Therefore, after parsing the file using parse(), we can extract the metadata from that object.
  • 18. Adding new and setting values in existing metadata Adding New We can add new metadata values using the add() method of the metadata class metadata.add(“author”,”Tutorials point”); The Metadata class has some predefined properties. metadata.add(Metadata.SOFTWARE,"ms paint"); Edit Existing You can set values to the existing metadata elements using the set() method. metadata.set(Metadata.DATE, new Date());
  • 19. Language detection in tika Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika:
  • 20. Language detection in tika While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted. Eg: LanguageIdentifier object=new LanguageIdentifier(“this is english”); Language Detection of a Document To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Eg: parser.parse(inputstream, handler, metadata, context); LanguageIdentifier object = new LanguageIdentifier(handler.toString());