SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1152
Extract and Analyze Data from PDF File and Web : A Review
Darshana Jadhav1, Dhanashree Jadhav2, Pooja More 3, Harshali Nikam
1 Darshana Jadhav , Dept. of computer Engineering, MET, Nashik
2 Dhanashree Jadhav , Dept. of computer Engineering, MET, Nashik
3 Pooja More, Dept. of computer Engineering, MET, Nashik
4Harshali Nikam, Dept. of computer Engineering, MET, Nashik
Assistant Professor : Ms.Tusharsaheb Patil
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Current survey done on today’s scenario shows,
result gadget declared by Universities(eg. Pune Uni.) for
engineering is in PDF file format. The PDF datacontentsdetail
such as seat no, centre, permanent registration no.(PRN),
Name, Subjects, Marks, etc. Presently PDF file is extracted in
excel file format, this conversion is done in order to extract
various reporting formats required by
department/college/university at various level. Thus, it
involves somewhat manual process. However, all these
operation have certain limitations such as semi-automated
process, no GUI present, SMS gateway is not support, E-mail
gateway is not supported, and mainly graphical analysis of
data is not available. On the basis of survey done, we came
across existing applications which are semi-automated or
automated with some restrictions which does not allow full
automation of result analysis in proper format. Thus none of
the applications supported the full automation. To overcome
above said drawbacks, we proposed a new system for result
analysis, which is automated with features like Auto-output
generation in different database format like excel, PDF, Mysql
for further compatibility with other ERP system as per user
selection, active SMS gateway, active Email gateway,
interactive and user friendly GUI, graphical result analysis
with text. In Proposed system we have targetedthelimitations
to provide effective solutionforresultanalysis. Thissystem will
also work on current grade system. Where we are going to
maintain database of students which willshow wholestatusof
students. Automated solutions provided by the system will
make exam department activities more efficient by covering
most of the important drawbacks of manual system, namely
speed, precision and simplicity. It will also work as a
generalized system to support any type and format ofPDFfile.
A centralized system will ensure that the activities in the
context of an examination can be managed effectively, while
also making it more accessible and convenient for both staff
and students.
Key Words: Information Extraction, Pattern Matching,
Data Mining, Web Mining.
1.INTRODUCTION
Result evaluation and analysis requires plenty of manual
work. so in order to reduce this issue we need system which
will support automation. Our systemwill work foruniversity
results. Nowadays in most of the engineering colleges , the
traditional method carried out by the colleges is to fill the
data within excel sheet manually for each student from the
pdf file provided by the university. There are so many
formulas for categories the things like toppers, pass, fail,
droppers, etc. This is a complete manual process where
chances of mistakes are so high. Similarly in diploma
colleges results are declared online, so data is taken from
web and fill into excel sheet manually and accordingly the
data evaluated and analyzed as per requirements of result
reports. This process is actually a verytimeconsuming.Thus
in order to fill ease the people doing this analysis, we have
propose one system which would automate the process of
result evaluation and analysis. This system take the input as
pdf file provided by university and save into database, once
the data get store into database we can use the data to get
the information using various queries.
2. LITERATURE SURVEY
In Existing System the data sort and analyze by manual
processes. User has to copy/paste the pdf file into excel
sheets and have to manually sort it to rank students.
Proposed system will be used to automate these processes.
Several researchers work on the topic of extracting require
data from unstructured data such as PDF. Here we are going
describe the tools which are closely related to proposed
system in this section. In reference [1] the authors used the
PDF-Box technique to extract references from PDF which
converts the PDF data into text and get the require
information from data. In reference [2] author used LA-
PDFText technique which is a commandlineutility toextract
text from PDF just by providing path of PDFfile.In[3] author
uses a technique for extraction of data from the structured
web pages. In reference [4] author uses a technique called
tag injection which inserts format information into text
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1153
document which is in the form of tags. It helpstotransforma
text into semi structure data, their is complete details are
discussed about data extraction .
3. PROPOSED SYSTEM
Following figure shows the detailed view of the proposed
system :
Fig: Detailed View
3.1 PDF Box :
PDF file is input for the system, so system has to first extract
data from PDF files. Here the PDF file is result gadget
provided by the Universities. so it does not contain any
diagram or images. To extract data from PDF files, we are
going to use PDF box technique.PDF box is PDF processing
library, it supports development and conversion of PDF
documents in addition it also provides command line utility
for performing various operations actually. PDF box has
ability to quickly and accurately extract the contents from
PDF documents. To use PDF box technique, we have to
include iTextSharp package. iText provides API inlanguages
such as .net, android, JAE, java developers to provide
enhancement to their application with a PDF functionality.It
provide functionalities such as PDF generation, PDF
manipulation, and PDF form filling. After including the
package, PdfReader is used to read the PDF file and then
PdfTextExtractor is used to extract the portable document
data.
3.2 Sorting of data :
Text extracted from PDF files is stored in text file. Proposed
system categories the data according to each department.
This separation is done by string manipulation operations.
3.3 Remove Noisy/Redundant Data :
After getting the essential data from the extracted PDF data,
and filtering the data which is not required.Forthispurpose,
we are using parsing technique which will help us to do
parsing line by line. Also the PDF contain lots of redundant
data E.g. Input PDF file contain same subject list for each
student for his/her of particular department. Then such
redundant data is also removed and only single copy of data
is stored in the database system.
3.4 WEB Extraction :
WEB extractor recognize the relevant data from the web
page and extract two different types ofdata outfromitoneis
source code and another is plain text displayedon webpage.
3.5. DOM Parser for Web Mining:
DOM is Document ObjectModel usuallyusedfororganize the
nodes into tree structure extracted from web pages.
3.6 Pattern Mining :
System uses pattern mining methodtofindthe essential data
from extracted document. The extracted plain text by the
web extractor is checked this the specified pattern and
mined the data accordingly.
3.7 Read and Analyze required data :
After elimination the noisy and redundant data, system has
need actual data . Then this data is accessedfor eachstudent.
Analysis of each student data is to be done by thesystem.For
the first time system will divides the department then
reading the subject list of each department, seperating
subjects into theory, practical, term-work and oral wise,
online exam and insem exam and to generate the final result
of every individual . Also system read personal information
of each student from text extracted from PDF.
3.8. Database designed and extracted data filled in
the system :
All gathered data which is useful need to be store into the
database system. Thus systemdesignsdatabasedynamically
by reading the contents from pdf file. After database is
designed, department wise tables are generated. Then in
tables analyzed data will be store.
3.9. Reports generation:
Reports are generated using the data is stored in the
database. The result reports will be generate by means of
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1154
requirements. The reports like college topper, department
wise topper, subject wise topper, ATKT’s, dropper student,
etc. System will generate result reports which are send via
mail to respective department/students.
4. CONCLUSIONS
System will sort all the data according to students marks
and grades if requested by user, for this we use data mining
techniques ,PDF extraction, data fetching and sorting
techniques, which will make user to simplify the data easily
and make result reports accordingly along with graphical
representation(using pie charts and graphs). It will become
convenient for students to receive results through SMS and
Email gateways. By this way result data will be organized
well , which becomes easy to manage the result records.
ACKNOWLEDGEMENT
We express our sincere gratitude to Prof. Mr. Tusharsaheb
Patil (Assistant Professor, MET BKC IOE) for his supportand
guidance. We would also like to thank Prof.Mr.PankajDeore
(Asst. Professor, MET BKC IOE) for his valuable words of
advice. We are also extremely grateful to our respected
H.O.D. Dr. M. U. Kharat and Principal Dr. V. P. Wani for
providing all facilities and every help for smooth progressof
project work. We are thankful for our family members and
friends for motivating us.
REFERENCES
[1]A Strategy for Automatically Extracting References from
PDF Documents. Neide Ferreira Alves, Universidade do
Estado do AmazonasManaus, Brazil Rafael Dueire Lins,
Universidade Federal de Pernambuco Recife, Brazil Maria
Lencastre, Universidade de PernambucoRecife.
[2] Automatic classification of scientific papers in PDF for
populating ontologies. JuanC.Redon-Miranda,Julia Y.Arana-
Llanes, Juan G. González-Serna andNimrodGonzález-Franco
Department of Computer Science National Center for
Research and Technological Development, CENIDET
Cuernavaca, México {juancarlos, juliaarana, gabriel,
[3] HWPDE: Novel Approach for Data Extraction from
Structured Web Pages .Manpreet Singh Sehgal Department
of information Technology, Apeejay College of Engineering,
Sohna, Gurgaon Anuradha PhD, Department of Computer
Engineering, YMCA University of Sc. & Technology,
Faridabad
[4] A new method of information extraction from pdf
filesFANG YUAN1,2, BO LIU College of Mathematics and
Computer Science, Hebei University, Baoding, 071002
P.R.China College of Information Science and Engineering,
Northeastern University, She0nyang, 110004 P.R.China.
BIOGRAPHIES
Darshana Jadhav Pursuing her computer degree
course in MET’s Institute of Engineering, Nashik. Her
interest include database system.
Dhanashree JadhavPursuing her computer degree
course in MET’s Institute of Engineering, Nashik. Her
interest include database system.
Pooja MorePursuing her computer degree course in
MET’s Institute of Engineering, Nashik. Her interest
include database system and data mining,webmining.
Harshali Nikam Pursuing her computer degree
course in MET’s Institute of Engineering, Nashik. Her
interest include database system and web mining.

More Related Content

PDF
Ijcatr04071001
PDF
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
PDF
Advanced Question Paper Generator Implemented using Fuzzy Logic
DOC
C programming project by navin thapa
DOCX
Library management system
PDF
DOCX
Feasibility report for library management system
PDF
Thesis on Library Management System | LMS | Project Report
Ijcatr04071001
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
Advanced Question Paper Generator Implemented using Fuzzy Logic
C programming project by navin thapa
Library management system
Feasibility report for library management system
Thesis on Library Management System | LMS | Project Report

What's hot (18)

PDF
Student Administration System
DOCX
Systems Development & Procurement
PDF
Project proposal of Library Management System.
PPSX
Library Management System
DOCX
library management system
DOCX
Project for Student Result System
PPTX
Introduction to files and db systems 1.0
PPTX
Library management system project
DOCX
IP Final project 12th
PDF
System requirement specification report(srs) T/TN/Gomarankadawala Maha vidyal...
PDF
Library Management System
PDF
Libary management system
DOCX
c++ library management
PPTX
Database system structure
PDF
Library management project
PPTX
Online Library management system proposal by Banuka Dananjaya Subasinghe
PPTX
Software Development Methodologies Library Management System (Part-1)
PDF
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
Student Administration System
Systems Development & Procurement
Project proposal of Library Management System.
Library Management System
library management system
Project for Student Result System
Introduction to files and db systems 1.0
Library management system project
IP Final project 12th
System requirement specification report(srs) T/TN/Gomarankadawala Maha vidyal...
Library Management System
Libary management system
c++ library management
Database system structure
Library management project
Online Library management system proposal by Banuka Dananjaya Subasinghe
Software Development Methodologies Library Management System (Part-1)
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
Ad

Similar to Extract and Analyze Data from PDF File and Web : A Review (20)

DOC
School management System
PDF
Development of Information Extraction for Data Analysis using NLP
PDF
IRJET- Intelligence Extraction using Machine Learning Technics
PDF
Comparing the performance of a business process: using Excel & Python
PDF
IRJET- PDF Extraction using Data Mining Techniques
PDF
IRJET- Resume Information Extraction Framework
PDF
Database Engine Control though Web Portal Monitoring Configuration
PDF
IRJET- E-Attendance Manager: A Review
PDF
Algorithm Procedure and Pseudo Code Mining
PDF
Content Migration -FileNet Image Service to P8
PPTX
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
PDF
IRJET- Efficient Student Faculty Management System
PDF
IRJET- Placemate - Sakec Portal
PDF
CV INSPECTION USING NLP AND MACHINE LEARNING
PDF
Cloud Computing Based System Integration in Education
DOCX
Student database management system
PDF
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
DOCX
573137875-Attendance-Management-System-original
PDF
Job portal
PPTX
student management system by using tkinter mysql crud operation
School management System
Development of Information Extraction for Data Analysis using NLP
IRJET- Intelligence Extraction using Machine Learning Technics
Comparing the performance of a business process: using Excel & Python
IRJET- PDF Extraction using Data Mining Techniques
IRJET- Resume Information Extraction Framework
Database Engine Control though Web Portal Monitoring Configuration
IRJET- E-Attendance Manager: A Review
Algorithm Procedure and Pseudo Code Mining
Content Migration -FileNet Image Service to P8
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
IRJET- Efficient Student Faculty Management System
IRJET- Placemate - Sakec Portal
CV INSPECTION USING NLP AND MACHINE LEARNING
Cloud Computing Based System Integration in Education
Student database management system
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
573137875-Attendance-Management-System-original
Job portal
student management system by using tkinter mysql crud operation
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
web development for engineering and engineering
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
composite construction of structures.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Well-logging-methods_new................
PPT
Project quality management in manufacturing
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Construction Project Organization Group 2.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
introduction to datamining and warehousing
web development for engineering and engineering
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
additive manufacturing of ss316l using mig welding
composite construction of structures.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Well-logging-methods_new................
Project quality management in manufacturing
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
OOP with Java - Java Introduction (Basics)
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT 4 Total Quality Management .pptx
CH1 Production IntroductoryConcepts.pptx
Digital Logic Computer Design lecture notes
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Construction Project Organization Group 2.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
introduction to datamining and warehousing

Extract and Analyze Data from PDF File and Web : A Review

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1152 Extract and Analyze Data from PDF File and Web : A Review Darshana Jadhav1, Dhanashree Jadhav2, Pooja More 3, Harshali Nikam 1 Darshana Jadhav , Dept. of computer Engineering, MET, Nashik 2 Dhanashree Jadhav , Dept. of computer Engineering, MET, Nashik 3 Pooja More, Dept. of computer Engineering, MET, Nashik 4Harshali Nikam, Dept. of computer Engineering, MET, Nashik Assistant Professor : Ms.Tusharsaheb Patil ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Current survey done on today’s scenario shows, result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF datacontentsdetail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbacks, we proposed a new system for result analysis, which is automated with features like Auto-output generation in different database format like excel, PDF, Mysql for further compatibility with other ERP system as per user selection, active SMS gateway, active Email gateway, interactive and user friendly GUI, graphical result analysis with text. In Proposed system we have targetedthelimitations to provide effective solutionforresultanalysis. Thissystem will also work on current grade system. Where we are going to maintain database of students which willshow wholestatusof students. Automated solutions provided by the system will make exam department activities more efficient by covering most of the important drawbacks of manual system, namely speed, precision and simplicity. It will also work as a generalized system to support any type and format ofPDFfile. A centralized system will ensure that the activities in the context of an examination can be managed effectively, while also making it more accessible and convenient for both staff and students. Key Words: Information Extraction, Pattern Matching, Data Mining, Web Mining. 1.INTRODUCTION Result evaluation and analysis requires plenty of manual work. so in order to reduce this issue we need system which will support automation. Our systemwill work foruniversity results. Nowadays in most of the engineering colleges , the traditional method carried out by the colleges is to fill the data within excel sheet manually for each student from the pdf file provided by the university. There are so many formulas for categories the things like toppers, pass, fail, droppers, etc. This is a complete manual process where chances of mistakes are so high. Similarly in diploma colleges results are declared online, so data is taken from web and fill into excel sheet manually and accordingly the data evaluated and analyzed as per requirements of result reports. This process is actually a verytimeconsuming.Thus in order to fill ease the people doing this analysis, we have propose one system which would automate the process of result evaluation and analysis. This system take the input as pdf file provided by university and save into database, once the data get store into database we can use the data to get the information using various queries. 2. LITERATURE SURVEY In Existing System the data sort and analyze by manual processes. User has to copy/paste the pdf file into excel sheets and have to manually sort it to rank students. Proposed system will be used to automate these processes. Several researchers work on the topic of extracting require data from unstructured data such as PDF. Here we are going describe the tools which are closely related to proposed system in this section. In reference [1] the authors used the PDF-Box technique to extract references from PDF which converts the PDF data into text and get the require information from data. In reference [2] author used LA- PDFText technique which is a commandlineutility toextract text from PDF just by providing path of PDFfile.In[3] author uses a technique for extraction of data from the structured web pages. In reference [4] author uses a technique called tag injection which inserts format information into text
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1153 document which is in the form of tags. It helpstotransforma text into semi structure data, their is complete details are discussed about data extraction . 3. PROPOSED SYSTEM Following figure shows the detailed view of the proposed system : Fig: Detailed View 3.1 PDF Box : PDF file is input for the system, so system has to first extract data from PDF files. Here the PDF file is result gadget provided by the Universities. so it does not contain any diagram or images. To extract data from PDF files, we are going to use PDF box technique.PDF box is PDF processing library, it supports development and conversion of PDF documents in addition it also provides command line utility for performing various operations actually. PDF box has ability to quickly and accurately extract the contents from PDF documents. To use PDF box technique, we have to include iTextSharp package. iText provides API inlanguages such as .net, android, JAE, java developers to provide enhancement to their application with a PDF functionality.It provide functionalities such as PDF generation, PDF manipulation, and PDF form filling. After including the package, PdfReader is used to read the PDF file and then PdfTextExtractor is used to extract the portable document data. 3.2 Sorting of data : Text extracted from PDF files is stored in text file. Proposed system categories the data according to each department. This separation is done by string manipulation operations. 3.3 Remove Noisy/Redundant Data : After getting the essential data from the extracted PDF data, and filtering the data which is not required.Forthispurpose, we are using parsing technique which will help us to do parsing line by line. Also the PDF contain lots of redundant data E.g. Input PDF file contain same subject list for each student for his/her of particular department. Then such redundant data is also removed and only single copy of data is stored in the database system. 3.4 WEB Extraction : WEB extractor recognize the relevant data from the web page and extract two different types ofdata outfromitoneis source code and another is plain text displayedon webpage. 3.5. DOM Parser for Web Mining: DOM is Document ObjectModel usuallyusedfororganize the nodes into tree structure extracted from web pages. 3.6 Pattern Mining : System uses pattern mining methodtofindthe essential data from extracted document. The extracted plain text by the web extractor is checked this the specified pattern and mined the data accordingly. 3.7 Read and Analyze required data : After elimination the noisy and redundant data, system has need actual data . Then this data is accessedfor eachstudent. Analysis of each student data is to be done by thesystem.For the first time system will divides the department then reading the subject list of each department, seperating subjects into theory, practical, term-work and oral wise, online exam and insem exam and to generate the final result of every individual . Also system read personal information of each student from text extracted from PDF. 3.8. Database designed and extracted data filled in the system : All gathered data which is useful need to be store into the database system. Thus systemdesignsdatabasedynamically by reading the contents from pdf file. After database is designed, department wise tables are generated. Then in tables analyzed data will be store. 3.9. Reports generation: Reports are generated using the data is stored in the database. The result reports will be generate by means of
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1154 requirements. The reports like college topper, department wise topper, subject wise topper, ATKT’s, dropper student, etc. System will generate result reports which are send via mail to respective department/students. 4. CONCLUSIONS System will sort all the data according to students marks and grades if requested by user, for this we use data mining techniques ,PDF extraction, data fetching and sorting techniques, which will make user to simplify the data easily and make result reports accordingly along with graphical representation(using pie charts and graphs). It will become convenient for students to receive results through SMS and Email gateways. By this way result data will be organized well , which becomes easy to manage the result records. ACKNOWLEDGEMENT We express our sincere gratitude to Prof. Mr. Tusharsaheb Patil (Assistant Professor, MET BKC IOE) for his supportand guidance. We would also like to thank Prof.Mr.PankajDeore (Asst. Professor, MET BKC IOE) for his valuable words of advice. We are also extremely grateful to our respected H.O.D. Dr. M. U. Kharat and Principal Dr. V. P. Wani for providing all facilities and every help for smooth progressof project work. We are thankful for our family members and friends for motivating us. REFERENCES [1]A Strategy for Automatically Extracting References from PDF Documents. Neide Ferreira Alves, Universidade do Estado do AmazonasManaus, Brazil Rafael Dueire Lins, Universidade Federal de Pernambuco Recife, Brazil Maria Lencastre, Universidade de PernambucoRecife. [2] Automatic classification of scientific papers in PDF for populating ontologies. JuanC.Redon-Miranda,Julia Y.Arana- Llanes, Juan G. González-Serna andNimrodGonzález-Franco Department of Computer Science National Center for Research and Technological Development, CENIDET Cuernavaca, México {juancarlos, juliaarana, gabriel, [3] HWPDE: Novel Approach for Data Extraction from Structured Web Pages .Manpreet Singh Sehgal Department of information Technology, Apeejay College of Engineering, Sohna, Gurgaon Anuradha PhD, Department of Computer Engineering, YMCA University of Sc. & Technology, Faridabad [4] A new method of information extraction from pdf filesFANG YUAN1,2, BO LIU College of Mathematics and Computer Science, Hebei University, Baoding, 071002 P.R.China College of Information Science and Engineering, Northeastern University, She0nyang, 110004 P.R.China. BIOGRAPHIES Darshana Jadhav Pursuing her computer degree course in MET’s Institute of Engineering, Nashik. Her interest include database system. Dhanashree JadhavPursuing her computer degree course in MET’s Institute of Engineering, Nashik. Her interest include database system. Pooja MorePursuing her computer degree course in MET’s Institute of Engineering, Nashik. Her interest include database system and data mining,webmining. Harshali Nikam Pursuing her computer degree course in MET’s Institute of Engineering, Nashik. Her interest include database system and web mining.