SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 
__________________________________________________________________________________________ 
Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 513 
RETRIEVAL OF TEXTUAL AND NON-TEXTUAL INFORMATION IN CLOUD Anbarasi M S1, Divya R2, Buvaneswari P3, Illakkiya M4 1Assistant Professor, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 2Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 3Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 4Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India Abstract With the advent of the Internet there is an exponential growth in multimedia content in various databases, which has a major issue in effective access and retrieval of both textual and non-textual resources. To resolve this problem information is stored in the cloud environment. From Internet the large and complex data cannot be stored and processed using traditional data processing applications. The idea in this proposed method involves parsing of the web page for the extraction of textual data and images. Textual retrieval is done through keyword extraction whereas feature extraction technique is done for the image retrieval. K-means algorithm is used to perform clustering. Based on ranking both the textual data and non-textual data are retrieved together. Keywords— Retrieval, Feature Extraction, K- means Clustering, Cloud 
----------------------------------------------------------------------***-------------------------------------------------------------------- 
1. INTRODUCTION The World Wide Web is a very large distributed digital information space. Images are a major source of content on the Internet. The development of technology such as digital cameras and mobile telephones equipped with such devices generates huge amounts of non-textual information, such as images. The ability to search and retrieve information from the Web efficiently and effectively is an emerging technology in information retrieval. The existing system retrieves too many documents, of which only a small chunk are relevant to the user query. The idea of the proposed work is to retrieve relevant textual and non textual information together for the user query in cloud, so it is entitled as Retrieval of Textual and Non-Textual Information in Cloud (RTNIC). For example, if the user needs to retrieve information for the query animal then this retrieval process provides the relevant non textual information (i.e. images) of animals along with its textual description about the image. The main aim is to obtain the correspondences between the image and its associated text for the easy understanding. Cloud [3] is used to enhance data management and storage of huge amount of information since the growth of information to be stored increases day by day. Cloud is used since the information being accessed from a centralized storage [3], and does not need any user to be in a specific place to access it. By this way information is delivered and resources are retrieved 
by web-based system, rather than a direct connection to a server. The main advantages of using cloud are: 
 It is highly elastic. 
 Everything is provided as service. 
 Less power consumed on hardware and software. 
 High availability and scalability. 
 No data loss. 
Thus the information retrieval in the cloud is a popular research area. It is shown in figure 1.In recent years the web document analysis has been done to effective filtering of useful information from them. The multimedia documents consist of different components (texts, images, sounds, videos).But we concentrate on image and text. The main aim is to obtain the correspondences between the image and its associated text for the retrieval accuracy. The research in the web page processing is focuses only on the textual analysis of the segments around the non-textual information. 2. RELATED WORK Current research in image retrieval uses both the textual and visual features for the retrieval of the image. Ontology based retrieval of image is used to reduce the semantic gap. It focuses on the semantic content which relates to the user’s intent. This paper also concentrates on the tag refinement. The visual words which are very sparse to match is overcome. The textual cluster and the visual cluster are mapped to retrieve the image. [10]
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 
__________________________________________________________________________________________ 
Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 514 
The second approach extracts the text and images from the webpage and stores them in the different databases. The image analysis [8] which includes segmentation followed by Scale Invariant feature Transform (SIFT) feature extraction. The segmented images reduces the number of SIFT points which increases the matching of images from the database and the query. The Support Vector Machine (SVM) classifier is used to classify the images to various classes. The textual and the images are subjected to the semantic inclusion. 3. ARCHITECTURE OF THE PROPOSED SYSTEM The large and complex dataset of texts and images cannot be processed using the traditional database applications. The proposed work is the retrieval of the relevant textual and non- textual information in the cloud. 
Fig1: RTNIC High Level System Architecture The proposed approach not only concentrates on the image retrieval alone but also retrieves the relevant textual information. The existing retrieval processes are either retrieving image alone or text alone. The content based retrieval methods are used in the existing processes where it is difficult to match the image contents with the existing images in the database [5]. The system which is retrieval from the cloud concentrates on the retrieval methods that are specific to the cloud environment. 3.1. Preprocessing Stage The proposed system consists of two phases. The first phase is the preprocessing stage. The second phase is the Common Retrieval phase. The preprocessing stage involves the collection of text and image from the webpage and the storage in the database. The second phase involves the ranking algorithm for the relevant retrieval from the database. The preprocessing phase includes three modules. They are Parser module, Image processing module and the Text processing module. The following section contains the descriptions of the modules and its functions. 
3.1.1. Html Parsing Module 
The HTML parsing module includes the conversion of the HTML page into a DOM tree [1,6]. The DOM tree based web page segmentation algorithm is used to segment web pages into sections. Each section containing the text and the images are extracted. The tags such as <TD>, <TR>, <TABLE>, <HR> are used to separate the different content passages [9] 3.1.2. Image Processing module This module explains the image feature extraction process. 3.1.2.1 Feature Extraction Process The Color histogram is the feature to be extracted from the images. They are trivial and popular to compute. The color histogram method extracts three histograms of the RGB colors. It computes the occurrences of each color. When computing is completed it is normalized because they are collected from different sites. 3.1.2.2 Image Clustering The image clustering is done using the k-means algorithm. This is an unsupervised clustering process. The k clusters are produced using this k-means clustering. K-means is a partitioning algorithm which provides k clusters where k is fixed as a priori. K-means algorithm treats each observation in data as a object in the space. The objects within each cluster are closer to each other. The centroid is chosen from the first k points. Then every point in the dataset is assigned to the nearest centroid. Many iteration are done in which each dataset is assigned to the nearest centroid. 3.1.3 Text Processing Structured Text is extracted from the WebPages where it is subjected to the stop words removal, stemming and finally the keyword extraction. 3.1.3.1 Stop Words Removal There is a need for the removal of non-informative words in the sentences. There is a pre-defined frequency list where the commonly occurring non-informative words are stored. It is used to eliminate the non informative words. 3.1.3.2 Stemming Stemming is changing the words to its basic form. For example the walking, walks can be converted to walk. 3.1.3.3 Keywords Extraction A set of meaningful keywords is extracted to be tagged. The extracted sentence and its associated keywords are stored in the database.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 
__________________________________________________________________________________________ 
Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 515 
Fig2: Preprocessing of Data. 3.2 Ranking Phase A(r) is any non-increasing function and r(z(k)) [5] is the rank of image z(k). We care only about the ranks of the top K images; we can define A(r) as: A(r) = max (K + 1 - r; 0) Thus the lower (top) ranked images are assigned higher weights and since A(r) = 0 for r > K, only the top K images of the ranking are considered. Now the top ranked images are mapped with the associated text that is stored in the text database by using the same tag that is used to store the image. 3.3 Retrieval Phase In the preprocessing stage, the text and the image are stored and indexed. In this retrieval phase the images and texts are retrieved based on the user’s query. 
Fig 3: The Retrieval phase 
The visual features are computed for the images in the database in the preprocessing stage and the mean of the visual features is matched to the textual term. 
The visual feature used here is color histogram of the red, blue and green values. Thus for a single image there are many feature values computed.When the query is given, the keyword is obtained and it is matched with the visual feature of the image in the database. If the values are matched with the database, the images and the corresponding text are retrieved [7]. 3.3.1 Text and Image Query The result will be based on the aggregation of the scores of the image and text retrieval [2]. The aggregation operator can be of different values. The different aggregation operator used can be of different behavior [4]. If it is maximum, we get the highly relevant text and image. If it is minimum, then it can be either the best image or the best text and not both. Another very common approach is the aggregate the results using a mean. 3.3.1.1 The Output of the RTNIC System for Combined Text and Image Retrieval 
Fig 4: The Output of RTNIC combined image and text retrieval
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 
__________________________________________________________________________________________ 
Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 516 
The User Interface provides the option for the user to the give the textual query [7]. When the query is given the related textual and non-textual data is retrieved. The top k ranked pairs of text and image are retrieved. When the query World Wonders is given the related text and images are retrieved. It is the case when we have both the text content and image content in the database. 3.1.3.2 The Output of the RTNIC System for the Image Retrieval There might be some cases where there may be images but not the related text associated to it. In that case the images alone are retrieved. For example the query “Book” is given as the input. Since the images only are available for this query, they are displayed according to the rank. 
Fig 5: The Output of the RTNIC image retrieval 3.1.3.3 The Output of the RTNIC for Text Retrieval There are exceptions when we have the text data available for the query but no images relating to the query. In that case the RTNIC displays the text associated with the query. For example when the query “Heaven” is given, there is no images associated with it. Hence the textual information alone is retrieved based on the ranking. 
Fig 6: The Output of the RTNIC for the text retrieval 4. EXPERIMENTAL RESULT OF RTNIC 4.1. Performance Evaluation Then the following two measurements quantify the quality of the search: Recall = R / M = Number of retrieved images and text that are also relevant / Total number of relevant images and text. 
Fig 7: Recall 
Precision = R / N = Number of retrieved images and text that are also relevant / Total number of retrieved images. The recall is the answer to the question: How close am I to getting
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 
__________________________________________________________________________________________ 
Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 517 
all good matches? The precision is the answer to the question: How close am I to getting only good matches? 
Fig 8: Precision 5. CONCLUSIONS AND FUTURE WORK 
The information in the Internet must be archived, maintained and effectively managed for the retrieval. The proposed work concentrates in the retrieval of the textual and non-textual information which are relevant. Thus the proposed work is aimed at complementing text retrieval and image retrieval each other. The cloud environment enhances the data security and storage issues in large dataset. The future work will be the video and text retrieval Effective retrieval of video from the databases and efficient indexing of videos. 
REFERENCES [1]. Alaa Riad, Hamdy Elminir and Sameh Abd-Elghany, “Web Image Retrieval Search Engine based on Semantically Shared Annotation”, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 2, No 3, March 2012 [2]. A. BalaSubramanium, “Information Retrieval Techniques for non textual media”. [3]. Cong Wang, “Toward Secure and Dependable Storage Services in Cloud Computing”, IEEE Transactions on Services computing, 2012. [4]. L. P. Florence, “Image and Text Mining Based on Contextual Exploration from Multiple Points of View,” Twenty-Fourth International FLAIRS Conference, 2011, Palm Beach, Florida, 18-20 May. [5]. N. Haque. “Image Ranking for Multimedia Retrieval”. Ph.D. thesis, School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, 2003. [6]. Martina Zachariasova, Robert Hudec, Miroslav Benco, and Patrik Kamencay, ”Automatic Extraction of Non-Textual Information in Web Document and Their Classification”,IEEE 2012 
[7]. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle,Gert R.G. Lanckriet1, Roger Levy, NunoVasconcelos , “A New Approach to Cross-Modal Multimedia Retrieval”.MM’10. 
[8]. G.Tryfou and N. Tsapatsoulis,” Web Image context extraction based on Semantic representation of web page visual segments”, 7th International workshop on Semantic and Social media adaptation and Personalization, 2012. 
[9]. M.J. Parag, I. Sam, “Web document text and images extraction using DOM analysis and natural language processing, ” In Proceedings of the 9th ACM symposium on Document engineering Doc Eng 09 [10]. Yin-Hsi Kuo, Wen-Huang Cheng, Member, IEEE, Hsuan-Tien Lin, Member, IEEE, and Winston H.Hsu,”Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement”, IEEE Transactions on Multimedia, Vol. 14, No. 4, August 2012. BIOGRAPHIES 
Dr. M. S. Anbarasi has completed B.E (Comp Sc & Tech), M.E.(SE) & Ph.d in Data Mining from Anna University CEG Campus, Chennai 25. She has 15 years of teaching experience and 7 years of research experience in the areas Data Mining, Software Engineering and Cloud Computing 
Divya R is the final year student, Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Big Data and Information Retrieval. 
ILLAKKIYA M is the final year student of Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Data Mining and Warehousing. 
Buvaneswari P is the final year student of Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Web mining and Image Retrieval.

More Related Content

PDF
Ijarcet vol-2-issue-7-2341-2343
PDF
Ijarcet vol-2-issue-3-1078-1080
PDF
Comparison of decision and random tree algorithms on
PDF
Efficient Database Management System For Wireless Sensor Network
PDF
An image crawler for content based image retrieval system
PDF
An image crawler for content based image retrieval
PDF
Clustering of medline documents using semi supervised spectral clustering
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-3-1078-1080
Comparison of decision and random tree algorithms on
Efficient Database Management System For Wireless Sensor Network
An image crawler for content based image retrieval system
An image crawler for content based image retrieval
Clustering of medline documents using semi supervised spectral clustering

What's hot (18)

PDF
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
PDF
Mapping of genes using cloud technologies
PDF
PDF
17 manjula aakunuri final_paper--185-190
PDF
Anomalous symmetry succession for seek out
PDF
50120140504006
PDF
HII: Histogram Inverted Index for Fast Images Retrieval
PDF
Comparison result-of-songket-motives-retrieval-through-sketching-technique-wi...
 
PDF
Re-enactment of Newspaper Articles
PDF
Bn32416419
PDF
A novel Image Retrieval System using an effective region based shape represen...
PDF
Cc31331335
PDF
2015.basicsof imageanalysischapter2 (1)
PDF
10.1.1.432.9149
PDF
Information Upload and retrieval using SP Theory of Intelligence
PDF
B0330811
PDF
Fp3111131118
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
Mapping of genes using cloud technologies
17 manjula aakunuri final_paper--185-190
Anomalous symmetry succession for seek out
50120140504006
HII: Histogram Inverted Index for Fast Images Retrieval
Comparison result-of-songket-motives-retrieval-through-sketching-technique-wi...
 
Re-enactment of Newspaper Articles
Bn32416419
A novel Image Retrieval System using an effective region based shape represen...
Cc31331335
2015.basicsof imageanalysischapter2 (1)
10.1.1.432.9149
Information Upload and retrieval using SP Theory of Intelligence
B0330811
Fp3111131118
Ad

Viewers also liked (20)

PDF
A new technique near minimum material zone, to
PDF
Root cause failure analysis of blanking device of strainer housing used in st...
PDF
Unravelling the molecular linkage of co morbid
PDF
Modeling of the damped oscillations of the viscous
PDF
Development of pavement management strategies for
PDF
Design and simulation of a tunable frequency
PDF
Novel technique in charactarizing a pv module using
PDF
Can fracture mechanics predict damage due disaster of structures
PDF
Risk assessment of a hydroelectric dam with parallel
PDF
Modeling a well stimulation process using the meor technique
PDF
Nonlinear fe modelling of anchorage bond in
PDF
A multi stage heuristic for manufacturing cell
PDF
Effect of zeolite types ltx and lta on physicochemical parameters of drinking...
PDF
The interconnecting mechanism for monitoring regular domestic condition
PDF
Simulation of pedestrian at intersection in urban congested area
PDF
Log into android mobile to fetch the device oriented information using remote...
PDF
A comparative evaluation on the properties of hma with variations in aggregat...
PDF
Real time approach of piezo actuated beam for wireless
PDF
System to convert 2 d x-ray image into 3-d x-ray image in dentistry
PDF
Power reduction through merged flip flops
A new technique near minimum material zone, to
Root cause failure analysis of blanking device of strainer housing used in st...
Unravelling the molecular linkage of co morbid
Modeling of the damped oscillations of the viscous
Development of pavement management strategies for
Design and simulation of a tunable frequency
Novel technique in charactarizing a pv module using
Can fracture mechanics predict damage due disaster of structures
Risk assessment of a hydroelectric dam with parallel
Modeling a well stimulation process using the meor technique
Nonlinear fe modelling of anchorage bond in
A multi stage heuristic for manufacturing cell
Effect of zeolite types ltx and lta on physicochemical parameters of drinking...
The interconnecting mechanism for monitoring regular domestic condition
Simulation of pedestrian at intersection in urban congested area
Log into android mobile to fetch the device oriented information using remote...
A comparative evaluation on the properties of hma with variations in aggregat...
Real time approach of piezo actuated beam for wireless
System to convert 2 d x-ray image into 3-d x-ray image in dentistry
Power reduction through merged flip flops
Ad

Similar to Retrieval of textual and non textual information in (20)

PDF
Mining of images using retrieval techniques
PPT
Yang.ppt
PDF
Image retrieval and re ranking techniques - a survey
PDF
A Novel Approach For Annotating Images By Semantic Similarity Keyword Based...
PPT
Yang (1) Image Information Retrieval.ppt
PDF
IRJET- A Survey on Image Retrieval using Machine Learning
PDF
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
PDF
Information Retrieval based on Cluster Analysis Approach
PDF
A tutorial review of automatic image tagging technique using text mining
PDF
A tutorial review of automatic image tagging technique using text mining
PDF
Ts2 c topic
PDF
Ts2 c topic (1)
PDF
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
PDF
Design and Development of an Algorithm for Image Clustering In Textile Image ...
DOC
WEB IMAGE RE-RANKING USING QUERY-SPECIFIC SEMANTIC SIGNATURES
PDF
IRJET- A Survey on Different Image Retrieval Techniques
DOCX
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
DOCX
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
DOCX
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
Mining of images using retrieval techniques
Yang.ppt
Image retrieval and re ranking techniques - a survey
A Novel Approach For Annotating Images By Semantic Similarity Keyword Based...
Yang (1) Image Information Retrieval.ppt
IRJET- A Survey on Image Retrieval using Machine Learning
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
Information Retrieval based on Cluster Analysis Approach
A tutorial review of automatic image tagging technique using text mining
A tutorial review of automatic image tagging technique using text mining
Ts2 c topic
Ts2 c topic (1)
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
Design and Development of an Algorithm for Image Clustering In Textile Image ...
WEB IMAGE RE-RANKING USING QUERY-SPECIFIC SEMANTIC SIGNATURES
IRJET- A Survey on Different Image Retrieval Techniques
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...

More from eSAT Publishing House (20)

PDF
Likely impacts of hudhud on the environment of visakhapatnam
PDF
Impact of flood disaster in a drought prone area – case study of alampur vill...
PDF
Hudhud cyclone – a severe disaster in visakhapatnam
PDF
Groundwater investigation using geophysical methods a case study of pydibhim...
PDF
Flood related disasters concerned to urban flooding in bangalore, india
PDF
Enhancing post disaster recovery by optimal infrastructure capacity building
PDF
Effect of lintel and lintel band on the global performance of reinforced conc...
PDF
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
PDF
Wind damage to buildings, infrastrucuture and landscape elements along the be...
PDF
Shear strength of rc deep beam panels – a review
PDF
Role of voluntary teams of professional engineers in dissater management – ex...
PDF
Risk analysis and environmental hazard management
PDF
Review study on performance of seismically tested repaired shear walls
PDF
Monitoring and assessment of air quality with reference to dust particles (pm...
PDF
Low cost wireless sensor networks and smartphone applications for disaster ma...
PDF
Coastal zones – seismic vulnerability an analysis from east coast of india
PDF
Assessment of seismic susceptibility of rc buildings
PDF
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
PDF
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
PDF
Disaster recovery sustainable housing
Likely impacts of hudhud on the environment of visakhapatnam
Impact of flood disaster in a drought prone area – case study of alampur vill...
Hudhud cyclone – a severe disaster in visakhapatnam
Groundwater investigation using geophysical methods a case study of pydibhim...
Flood related disasters concerned to urban flooding in bangalore, india
Enhancing post disaster recovery by optimal infrastructure capacity building
Effect of lintel and lintel band on the global performance of reinforced conc...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Shear strength of rc deep beam panels – a review
Role of voluntary teams of professional engineers in dissater management – ex...
Risk analysis and environmental hazard management
Review study on performance of seismically tested repaired shear walls
Monitoring and assessment of air quality with reference to dust particles (pm...
Low cost wireless sensor networks and smartphone applications for disaster ma...
Coastal zones – seismic vulnerability an analysis from east coast of india
Assessment of seismic susceptibility of rc buildings
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Disaster recovery sustainable housing

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
PPT on Performance Review to get promotions
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Model Code of Practice - Construction Work - 21102022 .pdf
Sustainable Sites - Green Building Construction
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT on Performance Review to get promotions
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

Retrieval of textual and non textual information in

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 513 RETRIEVAL OF TEXTUAL AND NON-TEXTUAL INFORMATION IN CLOUD Anbarasi M S1, Divya R2, Buvaneswari P3, Illakkiya M4 1Assistant Professor, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 2Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 3Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 4Student, Department of Information Technology, Pondicherry Engineering College, Puducherry, India Abstract With the advent of the Internet there is an exponential growth in multimedia content in various databases, which has a major issue in effective access and retrieval of both textual and non-textual resources. To resolve this problem information is stored in the cloud environment. From Internet the large and complex data cannot be stored and processed using traditional data processing applications. The idea in this proposed method involves parsing of the web page for the extraction of textual data and images. Textual retrieval is done through keyword extraction whereas feature extraction technique is done for the image retrieval. K-means algorithm is used to perform clustering. Based on ranking both the textual data and non-textual data are retrieved together. Keywords— Retrieval, Feature Extraction, K- means Clustering, Cloud ----------------------------------------------------------------------***-------------------------------------------------------------------- 1. INTRODUCTION The World Wide Web is a very large distributed digital information space. Images are a major source of content on the Internet. The development of technology such as digital cameras and mobile telephones equipped with such devices generates huge amounts of non-textual information, such as images. The ability to search and retrieve information from the Web efficiently and effectively is an emerging technology in information retrieval. The existing system retrieves too many documents, of which only a small chunk are relevant to the user query. The idea of the proposed work is to retrieve relevant textual and non textual information together for the user query in cloud, so it is entitled as Retrieval of Textual and Non-Textual Information in Cloud (RTNIC). For example, if the user needs to retrieve information for the query animal then this retrieval process provides the relevant non textual information (i.e. images) of animals along with its textual description about the image. The main aim is to obtain the correspondences between the image and its associated text for the easy understanding. Cloud [3] is used to enhance data management and storage of huge amount of information since the growth of information to be stored increases day by day. Cloud is used since the information being accessed from a centralized storage [3], and does not need any user to be in a specific place to access it. By this way information is delivered and resources are retrieved by web-based system, rather than a direct connection to a server. The main advantages of using cloud are:  It is highly elastic.  Everything is provided as service.  Less power consumed on hardware and software.  High availability and scalability.  No data loss. Thus the information retrieval in the cloud is a popular research area. It is shown in figure 1.In recent years the web document analysis has been done to effective filtering of useful information from them. The multimedia documents consist of different components (texts, images, sounds, videos).But we concentrate on image and text. The main aim is to obtain the correspondences between the image and its associated text for the retrieval accuracy. The research in the web page processing is focuses only on the textual analysis of the segments around the non-textual information. 2. RELATED WORK Current research in image retrieval uses both the textual and visual features for the retrieval of the image. Ontology based retrieval of image is used to reduce the semantic gap. It focuses on the semantic content which relates to the user’s intent. This paper also concentrates on the tag refinement. The visual words which are very sparse to match is overcome. The textual cluster and the visual cluster are mapped to retrieve the image. [10]
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 514 The second approach extracts the text and images from the webpage and stores them in the different databases. The image analysis [8] which includes segmentation followed by Scale Invariant feature Transform (SIFT) feature extraction. The segmented images reduces the number of SIFT points which increases the matching of images from the database and the query. The Support Vector Machine (SVM) classifier is used to classify the images to various classes. The textual and the images are subjected to the semantic inclusion. 3. ARCHITECTURE OF THE PROPOSED SYSTEM The large and complex dataset of texts and images cannot be processed using the traditional database applications. The proposed work is the retrieval of the relevant textual and non- textual information in the cloud. Fig1: RTNIC High Level System Architecture The proposed approach not only concentrates on the image retrieval alone but also retrieves the relevant textual information. The existing retrieval processes are either retrieving image alone or text alone. The content based retrieval methods are used in the existing processes where it is difficult to match the image contents with the existing images in the database [5]. The system which is retrieval from the cloud concentrates on the retrieval methods that are specific to the cloud environment. 3.1. Preprocessing Stage The proposed system consists of two phases. The first phase is the preprocessing stage. The second phase is the Common Retrieval phase. The preprocessing stage involves the collection of text and image from the webpage and the storage in the database. The second phase involves the ranking algorithm for the relevant retrieval from the database. The preprocessing phase includes three modules. They are Parser module, Image processing module and the Text processing module. The following section contains the descriptions of the modules and its functions. 3.1.1. Html Parsing Module The HTML parsing module includes the conversion of the HTML page into a DOM tree [1,6]. The DOM tree based web page segmentation algorithm is used to segment web pages into sections. Each section containing the text and the images are extracted. The tags such as <TD>, <TR>, <TABLE>, <HR> are used to separate the different content passages [9] 3.1.2. Image Processing module This module explains the image feature extraction process. 3.1.2.1 Feature Extraction Process The Color histogram is the feature to be extracted from the images. They are trivial and popular to compute. The color histogram method extracts three histograms of the RGB colors. It computes the occurrences of each color. When computing is completed it is normalized because they are collected from different sites. 3.1.2.2 Image Clustering The image clustering is done using the k-means algorithm. This is an unsupervised clustering process. The k clusters are produced using this k-means clustering. K-means is a partitioning algorithm which provides k clusters where k is fixed as a priori. K-means algorithm treats each observation in data as a object in the space. The objects within each cluster are closer to each other. The centroid is chosen from the first k points. Then every point in the dataset is assigned to the nearest centroid. Many iteration are done in which each dataset is assigned to the nearest centroid. 3.1.3 Text Processing Structured Text is extracted from the WebPages where it is subjected to the stop words removal, stemming and finally the keyword extraction. 3.1.3.1 Stop Words Removal There is a need for the removal of non-informative words in the sentences. There is a pre-defined frequency list where the commonly occurring non-informative words are stored. It is used to eliminate the non informative words. 3.1.3.2 Stemming Stemming is changing the words to its basic form. For example the walking, walks can be converted to walk. 3.1.3.3 Keywords Extraction A set of meaningful keywords is extracted to be tagged. The extracted sentence and its associated keywords are stored in the database.
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 515 Fig2: Preprocessing of Data. 3.2 Ranking Phase A(r) is any non-increasing function and r(z(k)) [5] is the rank of image z(k). We care only about the ranks of the top K images; we can define A(r) as: A(r) = max (K + 1 - r; 0) Thus the lower (top) ranked images are assigned higher weights and since A(r) = 0 for r > K, only the top K images of the ranking are considered. Now the top ranked images are mapped with the associated text that is stored in the text database by using the same tag that is used to store the image. 3.3 Retrieval Phase In the preprocessing stage, the text and the image are stored and indexed. In this retrieval phase the images and texts are retrieved based on the user’s query. Fig 3: The Retrieval phase The visual features are computed for the images in the database in the preprocessing stage and the mean of the visual features is matched to the textual term. The visual feature used here is color histogram of the red, blue and green values. Thus for a single image there are many feature values computed.When the query is given, the keyword is obtained and it is matched with the visual feature of the image in the database. If the values are matched with the database, the images and the corresponding text are retrieved [7]. 3.3.1 Text and Image Query The result will be based on the aggregation of the scores of the image and text retrieval [2]. The aggregation operator can be of different values. The different aggregation operator used can be of different behavior [4]. If it is maximum, we get the highly relevant text and image. If it is minimum, then it can be either the best image or the best text and not both. Another very common approach is the aggregate the results using a mean. 3.3.1.1 The Output of the RTNIC System for Combined Text and Image Retrieval Fig 4: The Output of RTNIC combined image and text retrieval
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 516 The User Interface provides the option for the user to the give the textual query [7]. When the query is given the related textual and non-textual data is retrieved. The top k ranked pairs of text and image are retrieved. When the query World Wonders is given the related text and images are retrieved. It is the case when we have both the text content and image content in the database. 3.1.3.2 The Output of the RTNIC System for the Image Retrieval There might be some cases where there may be images but not the related text associated to it. In that case the images alone are retrieved. For example the query “Book” is given as the input. Since the images only are available for this query, they are displayed according to the rank. Fig 5: The Output of the RTNIC image retrieval 3.1.3.3 The Output of the RTNIC for Text Retrieval There are exceptions when we have the text data available for the query but no images relating to the query. In that case the RTNIC displays the text associated with the query. For example when the query “Heaven” is given, there is no images associated with it. Hence the textual information alone is retrieved based on the ranking. Fig 6: The Output of the RTNIC for the text retrieval 4. EXPERIMENTAL RESULT OF RTNIC 4.1. Performance Evaluation Then the following two measurements quantify the quality of the search: Recall = R / M = Number of retrieved images and text that are also relevant / Total number of relevant images and text. Fig 7: Recall Precision = R / N = Number of retrieved images and text that are also relevant / Total number of retrieved images. The recall is the answer to the question: How close am I to getting
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Special Issue: 07 | May-2014, Available @ http://www.ijret.org 517 all good matches? The precision is the answer to the question: How close am I to getting only good matches? Fig 8: Precision 5. CONCLUSIONS AND FUTURE WORK The information in the Internet must be archived, maintained and effectively managed for the retrieval. The proposed work concentrates in the retrieval of the textual and non-textual information which are relevant. Thus the proposed work is aimed at complementing text retrieval and image retrieval each other. The cloud environment enhances the data security and storage issues in large dataset. The future work will be the video and text retrieval Effective retrieval of video from the databases and efficient indexing of videos. REFERENCES [1]. Alaa Riad, Hamdy Elminir and Sameh Abd-Elghany, “Web Image Retrieval Search Engine based on Semantically Shared Annotation”, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 2, No 3, March 2012 [2]. A. BalaSubramanium, “Information Retrieval Techniques for non textual media”. [3]. Cong Wang, “Toward Secure and Dependable Storage Services in Cloud Computing”, IEEE Transactions on Services computing, 2012. [4]. L. P. Florence, “Image and Text Mining Based on Contextual Exploration from Multiple Points of View,” Twenty-Fourth International FLAIRS Conference, 2011, Palm Beach, Florida, 18-20 May. [5]. N. Haque. “Image Ranking for Multimedia Retrieval”. Ph.D. thesis, School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, 2003. [6]. Martina Zachariasova, Robert Hudec, Miroslav Benco, and Patrik Kamencay, ”Automatic Extraction of Non-Textual Information in Web Document and Their Classification”,IEEE 2012 [7]. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle,Gert R.G. Lanckriet1, Roger Levy, NunoVasconcelos , “A New Approach to Cross-Modal Multimedia Retrieval”.MM’10. [8]. G.Tryfou and N. Tsapatsoulis,” Web Image context extraction based on Semantic representation of web page visual segments”, 7th International workshop on Semantic and Social media adaptation and Personalization, 2012. [9]. M.J. Parag, I. Sam, “Web document text and images extraction using DOM analysis and natural language processing, ” In Proceedings of the 9th ACM symposium on Document engineering Doc Eng 09 [10]. Yin-Hsi Kuo, Wen-Huang Cheng, Member, IEEE, Hsuan-Tien Lin, Member, IEEE, and Winston H.Hsu,”Unsupervised Semantic Feature Discovery for Image Object Retrieval and Tag Refinement”, IEEE Transactions on Multimedia, Vol. 14, No. 4, August 2012. BIOGRAPHIES Dr. M. S. Anbarasi has completed B.E (Comp Sc & Tech), M.E.(SE) & Ph.d in Data Mining from Anna University CEG Campus, Chennai 25. She has 15 years of teaching experience and 7 years of research experience in the areas Data Mining, Software Engineering and Cloud Computing Divya R is the final year student, Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Big Data and Information Retrieval. ILLAKKIYA M is the final year student of Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Data Mining and Warehousing. Buvaneswari P is the final year student of Department of Information Technology in Pondicherry Engineering College, Puducherry, India. Her areas of interest are Web mining and Image Retrieval.