SlideShare a Scribd company logo
Applying OCR to Extract Information: Text-Mining
Step 1:
Get Access of Scanned
PDF Documents
Step 2:
Use of Apache Tika
library to extract
textual data
Step 3:
Information extraction
from text to structured
tables
Data processing Steps:
Applying OCR to Extract Information: Text-Mining
Step 1: Access Scanned PDF documents
Extracting/Connecting data to Hadoop
server to get access of scanned PDF files
(document) in Python Environment.
Step 2: Text Extraction
Use of parser from Apache Tika library to
extract text from each assessment orders
and store in a table form with two columns
namely "Assessment Order ID" and "Actual
Text".
Synopsis of text extracted into a table:
Step 3: Information Extraction from text
Extracting following list of information with the use of Regular Expressions (pattern search) over Actual Text for
each document.
1) Name
2) Financial Year
3) PAN
4) Legal Citation (which includes citation of SC, HC & ITAT) and
5) Legal Issues associated with each document
1.Define
2.Design
3.Deploy
4.Analyze
5.Act
Define:
Identify specific requirements within use cases while highlighting risk factors
and estimate value opportunities
Design:
Design a tracking strategy that captures the appropriate data with proper
KPIs of the business requirement
Deploy:
Implement the technologies required to capture the data as along with the
measurement strategy design
Analyze:
Insight driven analyses to expose challenges and identify opportunities
Act:
Leverage Analysis to describe and prescribe the challenges with solutions
and uncover the hidden opportunities
Analytics Cycle
Text Analytics
Text Analysis (TA) is a process which takes unseen texts as input and produces fixed-format, unambiguous data as output.
This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be
used for indexing purposes in Information Retrieval (IR) applications.
Documents
• Text Mining
• Topic Modeling
• Text Classification
• Named Entity Recognition
• Relation extraction
• Event detection
• Natural Language Toolkit (NLTK)
• Gensim,
• Scikit-Learn
Multi-dimensional Text Mining Tools
Word Frequency Analysis:
• Most Frequent words
• Frequency Distribution
Results
Text Classification:
• Multi-label Domain Specific
classified texts
Collocation Analysis
• Bigrams
• Trigrams and
• N-grams
Keyword Analysis
• Keyword Counts,
• Most prominent Categories
Topic Modeling
• Discovering Topics and
Categories
Performance Measures
• Accuracy, Precision, Recall,
F-Measures
Comprehensive tool set for
• Data editing and visualization
• Rapid application development
• Manual annotation
• Ontology management
User Interface
Text Analytics - Process Flow

More Related Content

PDF
SA2: Text Mining from User Generated Content
PPTX
Value Mining: How Entity Extraction Informs Analysis
PPT
Enabling Exploration Through Text Analytics
PPTX
Text Mining
PPTX
Information Retrieval Systems_Lecture_1_Text_Analytics.pptx
PDF
Decision Support for E-Governance: A Text Mining Approach
PDF
Text Mining : Experience
PDF
Efficient Practices for Large Scale Text Mining Process
SA2: Text Mining from User Generated Content
Value Mining: How Entity Extraction Informs Analysis
Enabling Exploration Through Text Analytics
Text Mining
Information Retrieval Systems_Lecture_1_Text_Analytics.pptx
Decision Support for E-Governance: A Text Mining Approach
Text Mining : Experience
Efficient Practices for Large Scale Text Mining Process

Similar to Applying ocr to extract information : Text mining (20)

PDF
Paper id 26201475
PDF
Text Analytics in Enterprise Search - Daniel Ling
PDF
Text Analytics in Enterprise Search
PDF
Development of Information Extraction for Data Analysis using NLP
PPT
Text Analytics: Yesterday, Today and Tomorrow
PPTX
Sa discover text webinar
PDF
Best Practices for Large Scale Text Mining Processing
DOC
Text Mining: Beyond Extraction Towards Exploitation
DOC
Text Mining: Beyond Extraction Towards Exploitation
PPTX
Text mining
PDF
Exploration of Call Transcripts with MapReduce and Zipf’s Law
PPTX
sentiment analysis
PPT
Predictive Text Analytics
PPTX
DiscoverText Product Overview
DOC
Semi-automatic Text MiningNK
PDF
Text Mining in Business Intelligence โดย รศ.ดร.โอม ศรนิล
PPTX
Introduction to Text Mining
PPTX
Text mining and analytics v6 - p1
PPTX
Text mining and machine learning
PPTX
MODULE 4-Text Analytics.pptx
Paper id 26201475
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search
Development of Information Extraction for Data Analysis using NLP
Text Analytics: Yesterday, Today and Tomorrow
Sa discover text webinar
Best Practices for Large Scale Text Mining Processing
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
Text mining
Exploration of Call Transcripts with MapReduce and Zipf’s Law
sentiment analysis
Predictive Text Analytics
DiscoverText Product Overview
Semi-automatic Text MiningNK
Text Mining in Business Intelligence โดย รศ.ดร.โอม ศรนิล
Introduction to Text Mining
Text mining and analytics v6 - p1
Text mining and machine learning
MODULE 4-Text Analytics.pptx
Ad

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Reliability_Chapter_ presentation 1221.5784
Business Acumen Training GuidePresentation.pptx
Launch Your Data Science Career in Kochi – 2025
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
.pdf is not working space design for the following data for the following dat...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Ad

Applying ocr to extract information : Text mining

  • 1. Applying OCR to Extract Information: Text-Mining Step 1: Get Access of Scanned PDF Documents Step 2: Use of Apache Tika library to extract textual data Step 3: Information extraction from text to structured tables Data processing Steps:
  • 2. Applying OCR to Extract Information: Text-Mining Step 1: Access Scanned PDF documents Extracting/Connecting data to Hadoop server to get access of scanned PDF files (document) in Python Environment. Step 2: Text Extraction Use of parser from Apache Tika library to extract text from each assessment orders and store in a table form with two columns namely "Assessment Order ID" and "Actual Text". Synopsis of text extracted into a table: Step 3: Information Extraction from text Extracting following list of information with the use of Regular Expressions (pattern search) over Actual Text for each document. 1) Name 2) Financial Year 3) PAN 4) Legal Citation (which includes citation of SC, HC & ITAT) and 5) Legal Issues associated with each document
  • 3. 1.Define 2.Design 3.Deploy 4.Analyze 5.Act Define: Identify specific requirements within use cases while highlighting risk factors and estimate value opportunities Design: Design a tracking strategy that captures the appropriate data with proper KPIs of the business requirement Deploy: Implement the technologies required to capture the data as along with the measurement strategy design Analyze: Insight driven analyses to expose challenges and identify opportunities Act: Leverage Analysis to describe and prescribe the challenges with solutions and uncover the hidden opportunities Analytics Cycle Text Analytics Text Analysis (TA) is a process which takes unseen texts as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in Information Retrieval (IR) applications.
  • 4. Documents • Text Mining • Topic Modeling • Text Classification • Named Entity Recognition • Relation extraction • Event detection • Natural Language Toolkit (NLTK) • Gensim, • Scikit-Learn Multi-dimensional Text Mining Tools Word Frequency Analysis: • Most Frequent words • Frequency Distribution Results Text Classification: • Multi-label Domain Specific classified texts Collocation Analysis • Bigrams • Trigrams and • N-grams Keyword Analysis • Keyword Counts, • Most prominent Categories Topic Modeling • Discovering Topics and Categories Performance Measures • Accuracy, Precision, Recall, F-Measures Comprehensive tool set for • Data editing and visualization • Rapid application development • Manual annotation • Ontology management User Interface Text Analytics - Process Flow