SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1959
PDF Extraction Using Data Mining Techniques
Madhuri Badhe1,Vrushali Thakur2, Pooja Patil3, Rukhsar khan4, Prof. N. L. Bhale5
1,2,3,4Student, Dept. of Information Technology, Matoshri College of Engineering and Research Centre,
Maharashtra, India
5Head of Department, Dept. of Information Technology, Matoshri College of Engineering and Research Centre,
Maharashtra, India
-----------------------------------------------------------------------***------------------------------------------------------------------------
Abstract - In this new era, where tremendous information is
available on the internet, it is most important to provide the
improved mechanism to extract the information quickly and
most efficiently. It is very difficult for human beings to
manually extract the summary of a large documents of text.
There are plenty of text material available on the internet.
So there is a problem of searching for relevant documents
from the number of documents available, and absorbing
relevant information from it. In order to solve the above two
problems, the automatic text summarization is very much
necessary. Text summarization is the process of identifying
the most important meaningful information in a document
or set of related documents and compressing them into a
shorter version preserving its overall meanings.
Keywords: Documentations, Text, Summarization, Quickly,
easy
1. INTRODUCTION
Summary can be de_ned as a brief and accurate way of
representing the important concepts of the given source
documents. Humans, during the process of text
summarization, understand the concept of source
document and create a summary which conveys the
essence of the document whereas in automated systems
this is a complex task. As the quantity of information
available in electronic format continues to grow, research
into automatic text summarization has taken huge
importance. There are two types of summary Extractive
and Abstractive. Abstractive summary represents use of .
(NLP) whereas Extractive summary is based on copying
exact sentences from source document. Presently it is not
possible that the computer can understand every aspect
behind Natural Language processing. So, our Scope is
limited to Extractive based summary.
1.1 Aim
Our aim is to identifies the most important points of a text
and expresses them in a shorter document. Summarization
process:
1. interpret the text;
2. extract the relevant information (topics of the source);
3. condense extracted information and create summary
representation;
4. Present summary representation to reader in natural
language.
1.2 Motivation of the Project
As we can see in our daily lives , people not to take more
interest to read the books and big documents because it
takes more time and very boring mostly for the students so
that we can introduce such a system that which can reduce
the text in such a file, pdf or any and give output only in
summarized form so anyone one can understand the thing
in this document easily.
1.3 Objectives
Automatic summarization involves reduces a text le into a
passage or paragraph that conveys the main meaning of
the text. The searching of important information from a
large text file is very difficult job for the users thus to
automatic extract the important information or summary
of the text file.
This summary helps the users to reduce time instead Of
reading the whole text file and it provide quick
Information from the large document. In today's world to
extract information from the World Wide Web is very easy.
This extracted information is a huge text repository.
With the rapid growth of the World Wide Web (internet),
information overload is becoming a problem for an
increasing large number of people. Automatic
summarization can be an indispensable solution to reduce
the information overload problem on the web.
2. LITERATURE SURVEY
Paper 1: CyberPDF: Smart and Secure Coordinate-
based Automated Health PDF Data Batch Extraction
Data extraction from files is a prevalent activity in today's
electronic health record systems which can be laborious.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1960
Paper 2: PDF Scrutinizer: Detecting JavaScript-based
attacks in PDF documents
For a long time PDF documents have arrived in the
everyday life of the average computer user, corporate
businesses and critical structures
Paper 3: Data Mining Based Strategy for Detecting
Malicious PDF Files
Portable Document Format (PDF) is one of the widely-
accepted document format. The file can be viewed on any
information processing system with a PDF viewer in the
year 2014.
Paper 4: A new method of information extraction from
PDF files
With the rapid increase of the PDF files in Internet, how to
manage and search PDF files efficiently and quickly has
become an urgent problem to be solved.
3. ARCHITECTURE
3.1 Problem Statement / Definition
This project describes a system for the summarization of
single and multiple documents. The system produces multi
as well as single document summaries using data mining
techniques for identifying common terms across the set of
documents. For each term, the system identifies
representative passages that are included in the final
summary. Results of our evaluation are also presented.
3.2 Proposed Architecture
We are introducing a system system that allows user to
extract meaningful information from a particular pdf, by
using text characteristic algorithm . User has to upload the
file into our system and system will get process on that file
and give output to user.
Figure 3.1: Proposed Architecture Diagram
CONCLUSION
The Conclusion of this project is that the client will get an
Web application that will execute on client side and get the
summary of the input document as per his/her
requirement. The effective diversity based method
combined with K-mean Clustering algorithm to generating
summary of the document. The clustering algorithm is
used as helping factor with the method for finding the
most distinct ideas in the text. The results of the method
supports that employing of multiple factors can help to
find the diversity in the text because the isolation of all
similar sentences in one group can solve a part of the
redundancy problem among the document sentences and
the other part of that problem is solved by the diversity
based method.
REFERENCES
[1] Adobe Systems Incorporated PDF Reference 6th
edn, November 2006.
[2] S.Y. Bai, "Method Research and System Design of
Printed Mathematical Formula Recognition Based
on SVM", Shen Yang: ShenYang University of
Technology, pp. 1-66, 2015.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1961
[3] L.Y. Lin, L.C. Gao, Z. Tang, "Research on
Mathematical Formula Identification in Digital
Chinese Documents", Act a Scientiarum
Naturalium Universitat is Pekinens is, vol. 50, pp.
17-24, January 2014.
[4] D.R. Li, T.D. Xu, "Research on an Extraction
Method for Mathematical Formulas Embedded in
Printed Documents", Computer Applications and
Software, vol. 31, pp. 102-105, April 2014.
[5] Z.F. Guo, "Mathematical Formula Feature
Extraction and Locating in Chinese Scanned
Printed Document", GuangXi: GuangXi Normal
University, pp. 1-35, 2010
[6] Y.S. Guo, N.T. Tan, L. Hang, C.P. Liu, "An
Identification Method for Mathematical
Expressions in Scanned Chinese Document",
Journal of Chinese Information Processing, vol. 22,
pp. 83-87, July 2008
[7] X.D. Tian, N. Hao, "Mathematical Formula
Extraction Method from Printed Document Based
on Fuzzy Classification", Computer Applications,
vol. 27, pp. 2036-2038, August 2007.
[8] Z.W. Zhang, F.R. Kong, W.L. Liu, Q. Long, Y.B. Liu,
"Extraction of Mathematical Expressions in
Printed Chinese Technical Documents", Journal of
Chinese Information Processing, vol. 21, pp. 86-91,
July 2007.
[9] J.M. Jin, X.H. Han, Q.R. Wang, "Mathematical
Formulas Extraction", Proceedings of the Seventh
International Conference on Document Analysis
and Recognition, vol. 2, pp. 1138-1141, 2003

More Related Content

PDF
Prediction of User Rare Sequential Topic Patterns of Internet Users
PDF
PDF
An Extensible Web Mining Framework for Real Knowledge
PDF
Comparative Study on Graph-based Information Retrieval: the Case of XML Document
PDF
Ab03101590166
PDF
IRJET- Offline Transcription using AI
PDF
An overview of information extraction techniques for legal document analysis ...
Prediction of User Rare Sequential Topic Patterns of Internet Users
An Extensible Web Mining Framework for Real Knowledge
Comparative Study on Graph-based Information Retrieval: the Case of XML Document
Ab03101590166
IRJET- Offline Transcription using AI
An overview of information extraction techniques for legal document analysis ...

What's hot (18)

PDF
710201947
PDF
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
PDF
An overview of internet of things
PDF
Text pre-processing of multilingual for sentiment analysis based on social ne...
PDF
Applying Soft Computing Techniques in Information Retrieval
PDF
Application of hidden markov model in question answering systems
PDF
IRJET- Instant Exam Paper Generator
PDF
Advanced Software Engineering Program with IIT Madras
PDF
IRJET-Model for semantic processing in information retrieval systems
PDF
IRJET- Intelligent Laboratory Management System based on Internet of Thin...
PDF
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
PDF
Internet of Things: Surveys for Measuring Human Activities from Everywhere
PDF
IRJET- Detection and Recognition of Hypertexts in Imagery using Text Reco...
TXT
Referensi Penelitian Rekayasa Perangkat Lunak berbasis Ontologi
PDF
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
PDF
Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectiv...
PDF
A Review on Text Mining in Data Mining
PDF
Context Driven Technique for Document Classification
710201947
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
An overview of internet of things
Text pre-processing of multilingual for sentiment analysis based on social ne...
Applying Soft Computing Techniques in Information Retrieval
Application of hidden markov model in question answering systems
IRJET- Instant Exam Paper Generator
Advanced Software Engineering Program with IIT Madras
IRJET-Model for semantic processing in information retrieval systems
IRJET- Intelligent Laboratory Management System based on Internet of Thin...
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
Internet of Things: Surveys for Measuring Human Activities from Everywhere
IRJET- Detection and Recognition of Hypertexts in Imagery using Text Reco...
Referensi Penelitian Rekayasa Perangkat Lunak berbasis Ontologi
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectiv...
A Review on Text Mining in Data Mining
Context Driven Technique for Document Classification
Ad

Similar to IRJET- PDF Extraction using Data Mining Techniques (20)

PDF
IRJET- Semantic based Automatic Text Summarization based on Soft Computing
PDF
Automatic Text Summarization: A Critical Review
PDF
Automatic Text Summarization
PDF
EASESUM: an online abstractive and extractive text summarizer using deep lear...
PDF
IRJET - Text Summarizer.
PDF
A Survey on Automatic Text Summarization
PDF
NLP Based Text Summarization Using Semantic Analysis
PDF
Automatic Text Summarization Using Natural Language Processing (1)
PDF
8 efficient multi-document summary generation using neural network
PDF
Design of optimal search engine using text summarization through artificial i...
PDF
A domain specific automatic text summarization using fuzzy logic
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
IRJET- Automatic Recapitulation of Text Document
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
TextlyticResearchPapdaswer bgtyn ghner.pdf
PDF
Evaluation of Techniques for Automatic Text Extraction
PDF
Automation tool for evaluation of the quality of nlp based
PPTX
Comparative Analysis of Text Summarization Techniques
PDF
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
IRJET- Semantic based Automatic Text Summarization based on Soft Computing
Automatic Text Summarization: A Critical Review
Automatic Text Summarization
EASESUM: an online abstractive and extractive text summarizer using deep lear...
IRJET - Text Summarizer.
A Survey on Automatic Text Summarization
NLP Based Text Summarization Using Semantic Analysis
Automatic Text Summarization Using Natural Language Processing (1)
8 efficient multi-document summary generation using neural network
Design of optimal search engine using text summarization through artificial i...
A domain specific automatic text summarization using fuzzy logic
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
IRJET- Automatic Recapitulation of Text Document
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
TextlyticResearchPapdaswer bgtyn ghner.pdf
Evaluation of Techniques for Automatic Text Extraction
Automation tool for evaluation of the quality of nlp based
Comparative Analysis of Text Summarization Techniques
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
Welding lecture in detail for understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
web development for engineering and engineering
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Well-logging-methods_new................
PDF
PPT on Performance Review to get promotions
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Digital Logic Computer Design lecture notes
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Welding lecture in detail for understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CYBER-CRIMES AND SECURITY A guide to understanding
Internet of Things (IOT) - A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
R24 SURVEYING LAB MANUAL for civil enggi
Well-logging-methods_new................
PPT on Performance Review to get promotions
Lecture Notes Electrical Wiring System Components
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
UNIT 4 Total Quality Management .pptx
Digital Logic Computer Design lecture notes
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Sustainable Sites - Green Building Construction
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx

IRJET- PDF Extraction using Data Mining Techniques

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1959 PDF Extraction Using Data Mining Techniques Madhuri Badhe1,Vrushali Thakur2, Pooja Patil3, Rukhsar khan4, Prof. N. L. Bhale5 1,2,3,4Student, Dept. of Information Technology, Matoshri College of Engineering and Research Centre, Maharashtra, India 5Head of Department, Dept. of Information Technology, Matoshri College of Engineering and Research Centre, Maharashtra, India -----------------------------------------------------------------------***------------------------------------------------------------------------ Abstract - In this new era, where tremendous information is available on the internet, it is most important to provide the improved mechanism to extract the information quickly and most efficiently. It is very difficult for human beings to manually extract the summary of a large documents of text. There are plenty of text material available on the internet. So there is a problem of searching for relevant documents from the number of documents available, and absorbing relevant information from it. In order to solve the above two problems, the automatic text summarization is very much necessary. Text summarization is the process of identifying the most important meaningful information in a document or set of related documents and compressing them into a shorter version preserving its overall meanings. Keywords: Documentations, Text, Summarization, Quickly, easy 1. INTRODUCTION Summary can be de_ned as a brief and accurate way of representing the important concepts of the given source documents. Humans, during the process of text summarization, understand the concept of source document and create a summary which conveys the essence of the document whereas in automated systems this is a complex task. As the quantity of information available in electronic format continues to grow, research into automatic text summarization has taken huge importance. There are two types of summary Extractive and Abstractive. Abstractive summary represents use of . (NLP) whereas Extractive summary is based on copying exact sentences from source document. Presently it is not possible that the computer can understand every aspect behind Natural Language processing. So, our Scope is limited to Extractive based summary. 1.1 Aim Our aim is to identifies the most important points of a text and expresses them in a shorter document. Summarization process: 1. interpret the text; 2. extract the relevant information (topics of the source); 3. condense extracted information and create summary representation; 4. Present summary representation to reader in natural language. 1.2 Motivation of the Project As we can see in our daily lives , people not to take more interest to read the books and big documents because it takes more time and very boring mostly for the students so that we can introduce such a system that which can reduce the text in such a file, pdf or any and give output only in summarized form so anyone one can understand the thing in this document easily. 1.3 Objectives Automatic summarization involves reduces a text le into a passage or paragraph that conveys the main meaning of the text. The searching of important information from a large text file is very difficult job for the users thus to automatic extract the important information or summary of the text file. This summary helps the users to reduce time instead Of reading the whole text file and it provide quick Information from the large document. In today's world to extract information from the World Wide Web is very easy. This extracted information is a huge text repository. With the rapid growth of the World Wide Web (internet), information overload is becoming a problem for an increasing large number of people. Automatic summarization can be an indispensable solution to reduce the information overload problem on the web. 2. LITERATURE SURVEY Paper 1: CyberPDF: Smart and Secure Coordinate- based Automated Health PDF Data Batch Extraction Data extraction from files is a prevalent activity in today's electronic health record systems which can be laborious.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1960 Paper 2: PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents For a long time PDF documents have arrived in the everyday life of the average computer user, corporate businesses and critical structures Paper 3: Data Mining Based Strategy for Detecting Malicious PDF Files Portable Document Format (PDF) is one of the widely- accepted document format. The file can be viewed on any information processing system with a PDF viewer in the year 2014. Paper 4: A new method of information extraction from PDF files With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. 3. ARCHITECTURE 3.1 Problem Statement / Definition This project describes a system for the summarization of single and multiple documents. The system produces multi as well as single document summaries using data mining techniques for identifying common terms across the set of documents. For each term, the system identifies representative passages that are included in the final summary. Results of our evaluation are also presented. 3.2 Proposed Architecture We are introducing a system system that allows user to extract meaningful information from a particular pdf, by using text characteristic algorithm . User has to upload the file into our system and system will get process on that file and give output to user. Figure 3.1: Proposed Architecture Diagram CONCLUSION The Conclusion of this project is that the client will get an Web application that will execute on client side and get the summary of the input document as per his/her requirement. The effective diversity based method combined with K-mean Clustering algorithm to generating summary of the document. The clustering algorithm is used as helping factor with the method for finding the most distinct ideas in the text. The results of the method supports that employing of multiple factors can help to find the diversity in the text because the isolation of all similar sentences in one group can solve a part of the redundancy problem among the document sentences and the other part of that problem is solved by the diversity based method. REFERENCES [1] Adobe Systems Incorporated PDF Reference 6th edn, November 2006. [2] S.Y. Bai, "Method Research and System Design of Printed Mathematical Formula Recognition Based on SVM", Shen Yang: ShenYang University of Technology, pp. 1-66, 2015.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1961 [3] L.Y. Lin, L.C. Gao, Z. Tang, "Research on Mathematical Formula Identification in Digital Chinese Documents", Act a Scientiarum Naturalium Universitat is Pekinens is, vol. 50, pp. 17-24, January 2014. [4] D.R. Li, T.D. Xu, "Research on an Extraction Method for Mathematical Formulas Embedded in Printed Documents", Computer Applications and Software, vol. 31, pp. 102-105, April 2014. [5] Z.F. Guo, "Mathematical Formula Feature Extraction and Locating in Chinese Scanned Printed Document", GuangXi: GuangXi Normal University, pp. 1-35, 2010 [6] Y.S. Guo, N.T. Tan, L. Hang, C.P. Liu, "An Identification Method for Mathematical Expressions in Scanned Chinese Document", Journal of Chinese Information Processing, vol. 22, pp. 83-87, July 2008 [7] X.D. Tian, N. Hao, "Mathematical Formula Extraction Method from Printed Document Based on Fuzzy Classification", Computer Applications, vol. 27, pp. 2036-2038, August 2007. [8] Z.W. Zhang, F.R. Kong, W.L. Liu, Q. Long, Y.B. Liu, "Extraction of Mathematical Expressions in Printed Chinese Technical Documents", Journal of Chinese Information Processing, vol. 21, pp. 86-91, July 2007. [9] J.M. Jin, X.H. Han, Q.R. Wang, "Mathematical Formulas Extraction", Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 1138-1141, 2003