Apache tika

What Is Apache Tika?
Apache Tika is a library that is used for document type detection and content extraction from various file formats.
Internally, Tika uses existing various document parsers and document type detection techniques to detect and
extract data.
Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser ibraries
for each document type. Like Apache PdfBox and POI.
All these parser libraries are encapsulated under a single interface called the Parser interface.

WHy Tika?
According to filext.com, there are about 15k to 51k content types, and this number is growing day by day.
Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia
files.
Therefore, applications such as search engines and content management systems need additional support for easy
extraction of data from these document types.
Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.

Apache Tika Applications
Search Engines
Tika is widely used while developing search engines to index the text contents of digital documents.
Digital Asset Management
Some organizations manage their digital assets such as photographs, e-books, drawings, music and video
using a special application known as digital asset management (DAM).
Such applications take the help of document type detectors and metadata extractor to classify the various
documents.
Content Analysis
Websites like Amazon recommend newly released contents of their website to individual users according to
their interests.
To do so, these websites take the help of social media websites like Facebook to extract required information
such as likes and interests of the users. This gathered information will be in the form of html tags or other
formats that require further content type detection and extraction.

Features Of Tika
Unified parser Interface : Tika encapsulates all the third party parser libraries within a single parser interface.
Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it
according to the file type encountered.
Fast processing : Quick content detection and extraction from applications can be expected.
Parser integration : Tika can use various parser libraries available for each document type in a single application.
MIME type detection : Tika can detect and extract content from all the media types included in the MIME
standards.
Language detection : Tika includes language identification feature, therefore can be used in documents based on
language type in a multi lingual websites.

Functionalities Of Tika
Tika supports various functionalities:
Document type detection
Content extraction
Metadata extraction
Language detection

Tika Facade Class?
Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the
functionalities of Tika.
Since it is a facade class, Tika abstracts the complexity behind its functions.
It abstracts all the internal implementations and provides simple methods to access the Tika functionalities.

Some Methods Of tika facade class
String parseToString (File file) :- This method and all its variants parses the file passed as parameter and returns the
extracted text content in the String format. By default, the length of this string parameter is limited.
int getMaxStringLength () :- Returns the maximum length of strings returned by the parseToString methods.
void setMaxStringLength (int maxStringLength) :- Sets the maximum length of strings returned by the parseToString methods.
Reader parse (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in
the form of java.io.reader object.
String detect (InputStream stream, Metadata metadata) :- This method and all its variants accepts an InputStream object and a
Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This
method abstracts the detection mechanisms used by Tika.

Parser interface
This is the interface that is implemented by all the parser classes of Tika package.
The following is the important method of Tika Parser interface:
❖ parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted
document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.

Metadata class
Following are the methods of this class
1. add (Property property, String value)
2. add (String name, String value)
3. String get (Property property)
4. String get (String name)
5. Date getDate (Property property)
6. String[] getValues (Property property)
7. String[] getValues (String name)
8. String[] names()
9. set (Property property, Date date)
10. set(Property property, String[] values)

Language Identifier class
This class identifies the language of the given content.
LanguageIdentifier (String content) : This constructor can instantiate a language identifier by passing on a String from text content
String getLanguage () : Returns the language given to the current LanguageIdentifier object.

Type detection in tika
Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and
its document type.
Type detection using Facade class
The detect() method of facade class is used to detect the document type. This method accepts a file as input.
Eg:
File file = new File("example.mp3");
Tika tika = new Tika();
String filetype = tika.detect(file);

Content extraction using tika
For parsing documents, the parseToString() method of Tika facade class is generally used.
Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString()
method.

Content extraction using parser InterfaCE
The parser package of Tika provides several interfaces and classes using which we can parse a text document.

parse() method
Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this
method is shown below.
Given below is an example that shows how the parse() method is used.
1. Instantiate any of the classes providing the implementation for this interface.
Parser parser = new AutoDetectParser();
(or)
Parser parser = new CompositeParser();
(or)
object of any individual parsers given in Tika Library
1. Create a handler class object.
BodyContentHandler handler = new BodyContentHandler( );

3. Create the Metadata object as shown below:
Metadata metadata = new Metadata();
4. Create any of the input stream objects, and pass your file that should be extracted to it.
File file=new File(filepath)
FileInputStream inputstream=new FileInputStream(file);
(or)
InputStream stream = TikaInputStream.get(new File(filename));
5. Create a parse context object as shown below:
ParseContext context =new ParseContext();
6. Instantiate the parser object, invoke the parse method, and pass all the objects required.
parser.parse(inputstream, handler, metadata, context);

METADATA extraction using parser InterfaCE
Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file.
If we consider an audio file, the artist name, album name, title comes under metadata.
Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters.
This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object.
Therefore, after parsing the file using parse(), we can extract the metadata from that object.

Adding new and setting values in existing metadata
Adding New
We can add new metadata values using the add() method of the metadata class
metadata.add(“author”,”Tutorials point”);
The Metadata class has some predefined properties.
metadata.add(Metadata.SOFTWARE,"ms paint");
Edit Existing
You can set values to the existing metadata elements using the set() method.
metadata.set(Metadata.DATE, new Date());

Language detection in tika
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.
Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class.
This method returns the code name of the language in String format.
Given below is the list of the 18 language-code pairs detected by Tika:

Language detection in tika
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted.
Eg:
LanguageIdentifier object=new LanguageIdentifier(“this is english”);
Language Detection of a Document
To detect the language of a given document, you have to parse it using the parse() method.
The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments.
Eg:
parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());

Apache tika

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Apache tika (20)

More from NexThoughts Technologies (20)

Recently uploaded (20)

Apache tika