Intelligent Informatics Proceedings Of The International Symposium On Intelligent Informatics Isi12 Held At August 45 2012 Chennai India 1st Edition Li Shang

Intelligent Informatics Proceedings Of The
International Symposium On Intelligent
Informatics Isi12 Held At August 45 2012 Chennai
India 1st Edition Li Shang download
https://ebookbell.com/product/intelligent-informatics-
proceedings-of-the-international-symposium-on-intelligent-
informatics-isi12-held-at-august-45-2012-chennai-india-1st-
edition-li-shang-4404742
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Recent Advances In Intelligent Informatics Proceedings Of The Second
International Symposium On Intelligent Informatics Isi13 August 2324
2013 Mysore India 1st Edition Ahmed H Asad
https://ebookbell.com/product/recent-advances-in-intelligent-
informatics-proceedings-of-the-second-international-symposium-on-
intelligent-informatics-isi13-august-2324-2013-mysore-india-1st-
edition-ahmed-h-asad-4319252
Computational Intelligence In Information Systems Proceedings Of The
Fourth Inns Symposia Series On Computational Intelligence In
Information Systems Innsciis 2014 1st Edition Somnuk Phonamnuaisuk
https://ebookbell.com/product/computational-intelligence-in-
information-systems-proceedings-of-the-fourth-inns-symposia-series-on-
computational-intelligence-in-information-systems-innsciis-2014-1st-
edition-somnuk-phonamnuaisuk-4973556
Intelligent Information Systems 2001 Proceedings Of The International
Symposium Intelligent Information Systems X June 1822 2001 Zakopane
Poland 1st Edition Kwei Aryeetey
https://ebookbell.com/product/intelligent-information-
systems-2001-proceedings-of-the-international-symposium-intelligent-
information-systems-x-june-1822-2001-zakopane-poland-1st-edition-kwei-
aryeetey-4625100
Intelligent Computing Networking And Informatics Proceedings Of The
International Conference On Advanced Computing Networking And
Informatics India June 2013 1st Edition Munaga V N K Prasad Auth
https://ebookbell.com/product/intelligent-computing-networking-and-
informatics-proceedings-of-the-international-conference-on-advanced-
computing-networking-and-informatics-india-june-2013-1st-edition-
munaga-v-n-k-prasad-auth-4625464

Proceedings Of The International Conference On Advanced Intelligent
Systems And Informatics 2020 1st Ed Aboul Ella Hassanien
https://ebookbell.com/product/proceedings-of-the-international-
conference-on-advanced-intelligent-systems-and-informatics-2020-1st-
ed-aboul-ella-hassanien-22497072
Systems And Informatics 2016 1st Edition Aboul Ella Hassanien
edition-aboul-ella-hassanien-5675880
Systems And Informatics 2017 Gaber
conference-on-advanced-intelligent-systems-and-
informatics-2017-gaber-6753466
Systems And Informatics 2019 1st Ed 2020 Aboul Ella Hassanien
ed-2020-aboul-ella-hassanien-10800526
Systems And Informatics 2018 1st Ed Aboul Ella Hassanien
ed-aboul-ella-hassanien-7325336

Advances in Intelligent Systems
and Computing 182
Editor-in-Chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
For further volumes:
http://www.springer.com/series/11156

Ajith Abraham and Sabu M. Thampi (Eds.)
Intelligent Informatics
Proceedings of the International Symposium
on Intelligent Informatics ISI’12
Held at August 4–5 2012, Chennai, India
ABC

Editors
Dr. Ajith Abraham
Machine Intelligence Research Labs
(MIR Labs)
Scientific Network for Innovation and
Research Excellence
Auburn
Washington
USA
Dr. Sabu M. Thampi
Indian Institute of Information Technology
and Management - Kerala (IIITM-K)
Technopark Campus
Trivandrum
Kerala
India
ISSN 2194-5357 e-ISSN 2194-5365
ISBN 978-3-642-32062-0 e-ISBN 978-3-642-32063-7
DOI 10.1007/978-3-642-32063-7
Springer Heidelberg New York Dordrecht London
Library of Congress Control Number: 2012942843
c
Springer-Verlag Berlin Heidelberg 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface
This book contains a selection of refereed and revised papers originally presented
at the first International Symposium on Intelligent Informatics (ISI’12), August
4–5, 2012, Chennai, India. ISI’12 provided an international forum for sharing
original research results and practical development experiences among experts in
the emerging areas of Intelligent Informatics. ISI’12 was co-located with Interna-
tional Conference on Advances in Computing, Communications and Informatics
(ICACCI-2012).
Credit for the quality of the conference proceedings goes first and foremost to the
authors. They contributed a great deal of effort and creativity to produce this work,
and we are very thankful that they chose ISI’12 as the place to present it. All of the
authors who submitted papers, both accepted and rejected, are responsible for keep-
ing the ISI papers program vital. The total of 165 papers coming from 17 countries
and touching a wide spectrum of topics related to both theory and applications were
submitted to ISI’12. Out of them 54 papers were selected for regular presentations.
An event like this can only succeed as a team effort. We would like to acknowl-
edge the contribution of the program committee members and thank the reviewers
for their efforts. Many thanks to the honorary chair Lotfi Asker Zadeh, the general
chair Axel Sikora as well as the program chairs Adel M. Alimi, Juan Manuel Cor-
chado and Michal Wozniak. Their involvement and support have added greatly to
the quality of the symposium.
We wish to express our thanks to Thomas Ditzinger, Senior Editor, Engineer-
ing/Applied Sciences Springer-Verlag for his help and cooperation.
August 2012 Ajith Abraham
Sabu M. Thampi

ISI’12 Conference Committee
Honorary Chair
Prof. Lotﬁ Asker Zadeh Founder of Fuzzy Logic, University of California
Berkeley, USA
General Chair
Axel Sikora University of Applied Sciences Offenburg,
Germany
Program Chairs
Adel M. Alimi University of Sfax, Tunisia
Juan Manuel Corchado
Rodriguez University of Salamanca, Spain
Michal Wozniak Wroclaw University of Technology, Poland
Publication Chairs
Ajith Abraham Machine Intelligence Research Labs (MIR Labs),
USA
Sabu M. Thampi Indian Institute of Information Technology and
Management - Kerala, India
TPC Members
Abdelmajid Khelil TU Darmstadt, Germany
Aboul Ella Hassanien University of Cairo, Egypt
Agostino Bruzzone University of Genoa, Italy
Ajay Singh Multimedia University, Malaysia
Ajay Jangra UIET Kurukshetra University, Kurukshetra, India
Algirdas Pakaitas London Metropolitan University - North Campus,
United Kingdom

VIII ISI’12 Conference Committee
Amer Dawoud University of Southern Mississippi, Canada
Anirban Kundu Kuang-Chi Institute of Advanced Technology,
P.R. China
Anjana Gosain Indraprastha University, India
Ash Mohammad Abbas Aligarh Muslim University, India
Asrul Adam Universiti Teknologi Malaysia, Malaysia
Athanasios Pantelous University of Liverpool, United Kingdom
Atul Negi University of Hyderabad, India
Avinash Keskar V.N.I.T., India
Azian Azamimi Abdullah Universiti Malaysia Perlis, Malaysia
Belal Abuhaija University of Tabuk, Saudi Arabia
Brajesh Kumar Kaushik Indian Institute of Technology, Roorkee, India
Cecilia Nugraheni Parahyangan Catholic University, Indonesia
Crina Grosan Norwegian University of Science and Technology,
Norway
Dayang Jawawi Universiti Teknologi Malaysia, Malaysia
Dhiya Al-Jumeily Liverpool John Moores University,
United Kingdom
Edara Reddy Nagarjuna University, India
Edward Williams PMcorp, USA
Emilio Jimanez Macaas University of La Rioja, Spain
Eri Shimokawara Tokyo Metropolitan University, Japan
Farrah Wong Universiti Malaysia Sabah, Malaysia
Fatos Xhafa UPC, Barcelona Tech, Spain
G. Ganesan Adikavi Nannaya University, India
Gancho Vachkov Yamaguchi University, Japan
Georgi Dimirovski Dogus University of Istanbul, Turkey
Ghulam Abbas Liverpool Hope University, United Kingdom
Gregorio Romero Universidad Politecnica de Madrid, Spain
Gurvinder-Singh Baicher University of Wales Newport, United Kingdom
Habib Kammoun University of Sfax, Tunisia
Hanif Ibrahim Universiti Teknologi Malaysia, Malaysia
Ida Giriantari Udayana University, Bali, Indonesia
Imran Bajwa The Islamia University of Bahawalpur, Pakistan
Issam Kouatli Lebanese American University, Lebanon
J. Mailen Kootsey Simulation Resources, Inc., USA
Jasmy Yunus University of Technology Malaysia, Malaysia
Javier Bajo University of Salamanca, Spain
Jayashree Padmanabhan Anna University, India
Jeng-Shyang Pan National Kaohsiung University of Applied
Sciences, Taiwan
Jiri Dvorsky Technical University of Ostrava, Czech Republic
Josip Lorincz University of Split, Croatia
K. Thangavel Periyar University, Salem, India
Kambiz Badie Iran Telecom Research Center, Iran

ISI’12 Conference Committee IX
Kenneth Nwizege University of SWANSEA, United Kingdom
Kumar R. SRM University, India
Kumar Rajamani GE Global Research, India
Lee Tian Soon Lt Multi Media University, Malaysia
Mahendra Dixit SDMCET, India
Manjunath Aradhya Dayananda Sagar College of Engineering, India
Manu Sood Himachal Pradesh University, India
Mario Kappen Kyushu Institute of Technology, Japan
Martin Tunnicliffe Kingston University, United Kingdom
Mikulas Alexik University of Zilina, Slovakia
Mohamad Noh Ahmad Universiti Teknologi Malaysia, Malaysia
Mohamed Baqer University of Bahrain, Bahrain
Mohamed Dahmane University of Montreal, Canada
Mohand Lagha Saad Dahlab University of Blida - Blida - Algeria,
Algeria
Mohsen Askari University of Technology, Sydney, Iran
Mokhtar Beldjehem Sainte-Anne’s University, Canada
Monica Chis Siemens PSE Romania, Romania
Muhammad Nadzir Marsono Universiti Teknologi Malaysia, Malaysia
Narendra Bawane RTM, Nagpur University, India
Nico Saputro Southern Illinois University Carbondale, USA
Nor Hisham Khamis Universiti Teknologi Malaysia, Malaysia
Obinna Anya Liverpool Hope University, United Kingdom
Otavio Teixeira Centro Universitajrio do Estado do Pari (CESUPA),
Brazil
Petia Koprinkova-Hristova Bulgarian Academy of Sciences, Bulgaria
Praveen Srivastava Bits Pilani, India
Raees Khan B.B.A. University, Lucknow, India
Rajan Vaish University of California, USA
Rubita Sudirman Universiti Teknologi Malaysia, Malaysia
S.D. Katebi Shiraz University, Shiraz, Iran
Sami Habib Kuwait University, Kuwait
Sasanko Gantayat GMR Institute of Technology, India
Satish Chandra Jaypee University of Information Technology, India
Sattar Sadkhan University of Babylon, Iraq
Satya Kumara Udayana University, Bali, Indonesia
Satya Ghrera Jaypee University of Information Technology, India
Sayed A. Hadei Tarbiat Modares University, Iran
Shubhalaxmi Kher Arkansas State University, USA
Shuliang Sun Fuqing Branch of Fujian Normal University,
P.R. China
Siti Mariyam Shamsuddin Universiti Teknologi Malaysia, Malaysia
Smriti Srivastava Netaji Subhas Institute of Technology, India
Sotiris Kotsiantis University of Patras, Greece
Sriparna Saha IIT Patna, India

X ISI’12 Conference Committee
Suchitra Sueeprasan Chulalongkorn University, Thailand
Sung-Bae Cho Yonsei University, Korea
Teruaki Ito University of Tokushima, Japan
Theodoros Kostis University of the Aegean, Greece
Usha Banerjee College of Engineering Roorkee, India
Vaclav Satek Brno University of Technology, Czech Republic
Vatsavayi Valli Kumari Andhra University, India
Veronica Moertini Parahyangan Catholic University, Bandung,
Indonesia
Viranjay Srivastava Jaypee University of Information Technology,
Shimla, India
Visvasuresh Victor
Govindaswamy Texas AM University-Texarkana, USA
Vivek Sehgal Jaypee University of Information Technology,
India
Wan Hussain Wan Ishak Universiti Utara Malaysia, Malaysia
Xu Han Univ. of Rochester, USA
Yahya Elhadj Al-Imam Muhammad Ibn Sau Islamic University,
Saudi Arabia
Yu-N Cheah Universiti Sains Malaysia, Malaysia
Zaliman Sauli Universiti Malaysia Perlis, Malaysia

Organising Committee
(RMK Engineering College)
Chief Patron
R.S. Munirathinam
Patrons
Manjula Munirathinam
R.M. Kishore
R. Jothi Naidu
Yalamanchi Pradeep
DurgaDevi Pradeep
Sowmya Kishore
Advisory Committee
T. Pitchandi
M.S. Palanisamy
Elwin Chandra Monie
N.V. Balasubramanian
Sheshadri
K.A. Mohamed Junaid
K.K. Sivagnana Prabu
Convener
K. Chandrasekaran
Secretary
K.L. Shunmuganathan

ISI’12 Logo
Host Institution
Technical Sponsors

Contents
Data Mining, Clustering and Intelligent Information
Systems
Mining Top-K Frequent Correlated Subgraph Pairs in Graph
Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Li Shang, Yujiao Jian
Evolutionary Approach for Classifier Ensemble: An Application
to Bio-molecular Event Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Asif Ekbal, Sriparna Saha, Sachin Girdhar
A Novel Clustering Approach Using Shape Based Similarity . . . . . . . . . . 17
Smriti Srivastava, Saurabh Bhardwaj, J.R.P. Gupta
Knowledge Discovery Using Associative Classification for Heart
Disease Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
M.A. Jabbar, B.L. Deekshatulu, Priti Chandra
An Application of K-Means Clustering for Improving Video Text
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
V.N. Manjunath Aradhya, M.S. Pavithra
Integrating Global and Local Application of Discriminative
Multinomial Bayesian Classifier for Text Classification . . . . . . . . . . . . . . . 49
Emmanuel Pappas, Sotiris Kotsiantis
Protein Secondary Structure Prediction Using Machine Learning . . . . . . 57
Sriparna Saha, Asif Ekbal, Sidharth Sharma,
Sanghamitra Bandyopadhyay, Ujjwal Maulik
Refining Research Citations through Context Analysis . . . . . . . . . . . . . . . 65
G.S. Mahalakshmi, S. Sendhilkumar, S. Dilip Sam

XVI Contents
Assessing Novelty of Research Articles Using Fuzzy Cognitive Maps . . . . 73
S. Sendhilkumar, G.S. Mahalakshmi, S. Harish, R. Karthik, M. Jagadish,
S. Dilip Sam
Towards an Intelligent Decision Making Support . . . . . . . . . . . . . . . . . . . . 81
Nesrine Ben Yahia, Narjès Bellamine, Henda Ben Ghezala
An Approach to Identify n-wMVD for Eliminating Data Redundancy. . . 89
Sangeeta Viswanadham, Vatsavayi Valli Kumari
Comparison of Question Answering Systems. . . . . . . . . . . . . . . . . . . . . . . . 99
Tripti Dodiya, Sonal Jain
Transform for Simpliﬁed Weight Computations in the Fuzzy Analytic
Hierarchy Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Manju Pandey, Nilay Khare, S. Shrivastava
Parameterizable Decision Tree Classiﬁer on NetFPGA . . . . . . . . . . . . . . . 119
Alireza Monemi, Roozbeh Zarei, Muhammad Nadzir Marsono,
Mohamed Khalil-Hani
Diagnosing Multiple Faults in Dynamic Hybrid Systems . . . . . . . . . . . . . . 129
Imtiez Fliss, Moncef Tagina
Investigation of Short Base Line Lightning Detection System by Using
Time of Arrival Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Behnam Salimi, Zulkurnain Abdul-Malek, S.J. Mirazimi,
Kamyar MehranZamir
Investigation on the Probability of Ferroresonance Phenomenon
Occurrence in Distribution Voltage Transformers Using ATP
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Zulkurnain Abdul-Malek, Kamyar MehranZamir, Behnam Salimi,
S.J. Mirazimi
Design of SCFDMA System Using MIMO . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Kaushik Kapadia, Anshul Tyagi
Multi Agent Systems
Testing an Agent Based E-Novel System – Role Based Approach . . . . . . . 165
N. Sivakumar, K. Vivekanandan
Comparative Genomics with Multi-agent Systems . . . . . . . . . . . . . . . . . . . 175
Juan F. De Paz, Carolina Zato, Fernando de la Prieta, Javier Bajo,
Juan M. Corchado, Jesús M. Hernández
Causal Maps for Explanation in Multi-Agent System. . . . . . . . . . . . . . . . . 183
Aroua Hedhili, Wided Lejouad Chaari, Khaled Ghédira

Contents XVII
Hierarchical Particle Swarm Optimization for the Design of Beta
Basis Function Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Habib Dhahri, Adel M. Alimi, Ajith Abraham
Fuzzy Aided Ant Colony Optimization Algorithm to Solve
Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Aloysius George, B.R. Rajakumar
Pattern Recognition, Signal and Image Processing
Self-adaptive Gesture Classifier Using Fuzzy Classifiers with Entropy
Based Rule Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Riidhei Malhotra, Ritesh Srivastava, Ajeet Kumar Bhartee, Mridula Verma
Speaker Independent Word Recognition Using Cepstral Distance
Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Arnab Pramanik, Rajorshee Raha
Wavelet Packet Based Mel Frequency Cepstral Features for Text
Independent Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Smriti Srivastava, Saurabh Bhardwaj, Abhishek Bhandari, Krit Gupta,
Hitesh Bahl, J.R.P. Gupta
Optimised Computational Visual Attention Model for Robotic
Cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
J. Amudha, Ravi Kiran Chadalawada, V. Subashini, B. Barath Kumar
A Rule-Based Approach for Extraction of Link-Context from
Anchor-Text Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Suresh Kumar, Naresh Kumar, Manjeet Singh, Asok De
Malayalam Offline Handwritten Recognition Using Probabilistic
Simplified Fuzzy ARTMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
V. Vidya, T.R. Indhu, V.K. Bhadran, R. Ravindra Kumar
Development of a Bilingual Parallel Corpus of Arabic and Saudi Sign
Language: Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Yahya O. Mohamed Elhadj, Zouhir Zemirli, Kamel Ayyadi
Software-Based Malaysian Sign Language Recognition . . . . . . . . . . . . . . . 297
Farrah Wong, G. Sainarayanan, Wan Mahani Abdullah, Ali Chekima,
Faysal Ezwen Jupirin, Yona Falinie Abdul Gaus
An Algorithm for Headline and Column Separation in Bangla
Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Farjana Yeasmin Omee, Md. Shiam Shabbir Himel, Md. Abu Naser Bikas
A Probabilistic Model for Sign Language Translation Memory . . . . . . . . 317
Achraf Othman, Mohamed Jemni

XVIII Contents
Selective Parameters Based Image Denoising Method . . . . . . . . . . . . . . . . 325
Mantosh Biswas, Hari Om
A Novel Approach to Build Image Ontology Using Texton . . . . . . . . . . . . 333
R.I. Minu, K.K. Thyagarajan
Cloud Extraction and Removal in Aerial and Satellite Images . . . . . . . . . 341
Lizy Abraham, M. Sasikumar
3D360: Automated Construction of Navigable 3D Models from
Surrounding Real Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Shreya Agarwal
Real Time Animated Map Viewer (AMV) . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Neeraj Gangwal, P.K. Garg
Computer Networks and Distributed Systems
A Novel Fuzzy Sensing Model for Sensor Nodes in Wireless Sensor
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Suman Bhowmik, Chandan Giri
Retraining Mechanism for On-Line Peer-to-Peer Trafﬁc
Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Roozbeh Zarei, Alireza Monemi, Muhammad Nadzir Marsono
Novel Monitoring Mechanism for Distributed System Software Using
Mobile Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Rajwinder Singh, Mayank Dave
Investigation on Context-Aware Service Discovery in Pervasive
Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
S. Sreethar, E. Baburaj
A Fuzzy Logic System for Detecting Ping Pong Effect Attack in IEEE
802.15.4 Low Rate Wireless Personal Area Network . . . . . . . . . . . . . . . . . 405
C. Balarengadurai, S. Saraswathi
Differentiated Service Based on Reinforcement Learning in Wireless
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Malika Bourenane
Multimodal Biometric Authentication Based on Score Normalization
Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
T. Sreenivasa Rao, E. Sreenivasa Reddy
Extracting Extended Web Logs to Identify the Origin of Visits
and Search Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Jeeva Jose, P. Sojan Lal

Contents XIX
A Novel Community Detection Algorithm for Privacy Preservation
in Social Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Fatemeh Amiri, Nasser Yazdani, Heshaam Faili, Alireza Rezvanian
Provenance Based Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Ajitha Robert, S. Sendhilkumar
A Filter Tree Approach to Protect Cloud Computing against XML
DDoS and HTTP DDoS Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Tarun Karnwal, Sivakumar Thandapanii, Aghila Gnanasekaran
Cloud Based Heterogeneous Distributed Framework . . . . . . . . . . . . . . . . . 471
Anirban Kundu, Chunlin Ji, Ruopeng Liu
An Enhanced Load Balancing Technique for Efﬁcient Load
Distribution in Cloud-Based IT Industries . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Rashmi KrishnaIyengar Srinivasan, V. Suma, Vaidehi Nedu
PASA: Privacy-Aware Security Algorithm for Cloud Computing . . . . . . . 487
Ajay Jangra, Renu Bala
A New Approach to Overcome Problem of Congestion in Wireless
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Umesh Kumar Lilhore, Praneet Saurabh, Bhupendra Verma
CoreIIScheduler: Scheduling Tasks in a Multi-core-Based Grid Using
NSGA-II Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Javad Mohebbi Najm Abad, S. Kazem Shekofteh, Hamid Tabatabaee,
Maryam Mehrnejad
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

Mining Top-K Frequent Correlated Subgraph
Pairs in Graph Databases
Li Shang and Yujiao Jian
Abstract. In this paper, a novel algorithm called KFCP(top K Frequent Correlated
subgraph Pairs mining) was proposed to discover top-k frequent correlated sub-
graph pairs from graph databases, the algorithm was composed of two steps: co-
occurrence frequency matrix construction and top-k frequent correlated subgraph
pairs extraction. We use matrix to represent the frequency of all subgraph pairs and
compute their Pearson’s correlation coefficient, then create a sorted list of subgraph
pairs based on the absolute value of correlation coefficient. KFCP can find both
positive and negative correlations without generating any candidate sets; the effec-
tiveness of KFCP is assessed through our experiments with real-world datasets.
1 Introduction
Graph mining has been a significant research topic in recent years because of nu-
merous applications in data analysis, drug discovery, social networking and web link
analysis. In view of this, many traditional mining techniques such as frequent pat-
tern mining and correlated pattern mining have been extended to the case of graph
data. Previous studies mainly focus on mining frequent subgraph and correlated
subgraph, while little attention has been paid to find other interesting patterns about
frequent correlated subgraph.
There is one straightforward solution named FCP-Miner [8]to the problem men-
tioned above, FCP-Miner algorithm employs an effective ”filter and verification”
framework to find all frequent correlated graphs whose correlation with a query
graph is no less than a given minimum correlation threshold.However, FCP-Miner
Li Shang · Yujiao Jian
Lanzhou University, P.R. China
e-mail: lishang@lzu.edu.cn,18993177580@189.cn
A. Abraham and S.M. Thampi (Eds.): Intelligent Informatics, AISC 182, pp. 1–8.
springerlink.com c

2 L. Shang and Y. Jian
has several drawbacks: First, the number of candidates of FCP-Miner algorithm are
large, since it processes a new subgraph f by CGSearch [7] to obtain its candidate
set. Second, it is difficult for users to set an appropriate correlation threshold for each
specific query graph, since different graph databases have different characteristics.
Finally, FCP-Miner is not complete, due to the use of the skipping mechanism, this
method cannot avoid missing some subgraph pairs.
To address these problems, in this paper, we propose an alternative mining algo-
rithm KFCP for discovering the top-k frequent correlated subgraph pairs. The main
contributions of this paper are briefly summarized as follows.
1. We propose an alternative mining task of finding the top-k frequent negatively
and positively correlated subgraph pairs from graph databases, which allows
users to derive the k most interesting patterns. It is not only significant, but also
mutually independent and containing little redundancy.
2. We propose an efficient algorithm KFCP by constructing a co-occurrence fre-
quency matrix. The method avoids the costly generation of a large number of
candidate sets.
3. We show that KFCP is complete and correct; extensive experiments demonstrate
that the approach is effective and feasible.
The remainder of this paper is organized as follows. Section 2 reports the related
work. In section 3, basic concepts are described. Section 4 introduces our algorithm
KFCP in detail and Section 5 shows the experimental results on two real datasets.
Finally, we draw conclusions in section 6.
2 Related Work
Correlation mining attracts much attention; it plays an essential role in various types
of databases, such as market-basket data [1, 2, 3, 4], multimedia data [5], stream data
[6], and graph data [7, 8]. For market-basket data [1, 2, 3, 4], a number of correlated
measures were proposed to discover all correlated items, including the chi-square
χ2 test [1], h-confidence [2], Pearson’s correlation coefficient [3], etc. All the above
works set a threshold for the correlation measure, except [4], which studied the top-
k mining. For multimedia data, correlated pattern mining based on multimedia data
[5] has been proposed to discover such cross-modal correlations. In the content of
stream data, lagged correlation [6] has been presented to investigate the lead-lag re-
lationship between two time series. On the correlation mining in graph mining, there
are many previous researches on correlation discovery; CGS [7] has been proposed
for the task of correlation mining between a subgraph and a given query graph. The
work of [8] aimed to find all frequent subgraph pairs whose correlation coefficient
is at least a given minimum correlation threshold.

Mining Top-K Frequent Correlated Subgraph Pairs in Graph Databases 3
3 Basic Concepts
Definition 1. (Pearson’s correlation coefficient). Pearson’s correlation coefficient
for binary variables is also known as the ” φ correlation coefficient”. Given two
graphs A and B, the Pearson’s correlation coefficient of A and B, denoted as φ(A,B),
is defined as follows:
φ(A,B) =
sup(A,B)− sup(A)sup(B)

sup(A)sup(B)(1 − sup(A))(1 − sup(B))
(1)
The range of φ(A,B) falls within [-1,1]. If φ(A,B) is positive, then A and B are pos-
itively correlated, it means that their occurrence distributions are similar; otherwise,
A and B are negatively correlated, in other words, A and B rarely occur together.
Definition 2. (Top-K Frequent Correlated subgraph pairs discovery). Given a graph
database GD, a minimum support threshold σ and an integer k, we need to find the
top k frequent subgraph pairs with the highest absolute value of correlations.
Definition 3. (Co-occurrence Frequency matrix).Given a frequent subgraph set,
F = {g1,g2,··· ,gn}, co-occurrence frequency matrix, denoted as X, X= (xij),where
for i=1,2,···,n and j=1,2,···,n.
xij =

freq(gi,gj), i= j
freq(gi), i = j
(2)
Obviously, X is an n× n symmetric matrix, due to the symmetry, we need retain
only the upper triangle part of matrix.
4 KFCP Algorithm
In this section, we describe the details of KFCP which consists of two steps: co-
occurrence frequency matrix construction and top-k frequent correlated subgraph
pairs extraction.
4.1 Co-occurrence Frequency Matrix Construction
In the co-occurrence frequency matrix construction step, KFCP starts by generating
frequent subgraphs set F, then counts the frequency for each subgraph pair (gi,gj)
of frequent subgraph set F. When the transaction of frequent subgraphs set F is n, the
co-occurrence frequency matrix is basically an n×n matrix, where each entry repre-
sents the frequency counts of 1- and 2- element of F. The co-occurrence frequency
matrix is constructed based on definition 3 by scanning the database once.

Example 1. Fig.1 shows a graph database GD. |GD|=10. For minimum support
threshold σ=0.4, we can obtain frequent subgraph set F={g1,g2,g3,g4,g5}, as
shown in Fig.2. Then,we can construct the co-occurrence frequency matrix by scan-
ning each one of the transaction graphs. Considering the graph transaction 4(G4),
we increment its count in the matrix depicted by Fig.3, Similarly, we also increment
other transaction graphs count, thus we can construct co-occurrence frequency ma-
trix X shown in Fig.4. All of the off-diagonal elements are filled with the joint
frequency of co-occurrence of subgraph pairs. For example, the element (x34) in-
dicates the joint frequency of the subgraph pairs (g3,g4). On the other hand, every
diagonal entry of matrix is filled with the occurrence frequency of the single ele-
ment set. When σ is varied from 0.4 to 0.3, KFCP generates some new frequent
subgraph pairs such as (g1,g6), we will increment the count of cell (g1,g6) to main-
tain uniformity of the co-occurrence frequency matrix and this can be done without
spending extra cost.
Fig. 1 A Graph Database GD Fig. 2 Frequent subgraph set
Fig. 3 Frequency matrix of transaction 4 Fig. 4 Co-occurrence frequency matrix X
Table 1 All pairs with their correlation coefficient
Pairs (g1,g2) (g1,g3) (g1,g4) (g1,g5) (g2,g3) (g2,g4) (g2,g5) (g3,g4) (g3,g5) (g4,g5)
Correlation 0.667 0.667 -0.333 0.272 1 -0.5 -0.102 -0.5 -0.102 0.816
4.2 Top-k Frequent Correlation Subgraph Pairs Extraction
Once the co-occurrence frequency matrix has been generated, frequency counts of
all 1- and 2- element set can be computed fast. Using these frequency counts, KFCP
computes the φ correlation coefficient of all the frequent subgraph pairs, then ex-
tracts the k mostly correlated pairs based on the |φ|.

Example 2. According to the matrix element shown in Fig.4, to compute φ(g3,g4),
we note that freq(g3)=8, freq(g4)=5, and freq(g3,g4)=3, we get sup(g3)=8/10,
sup(g4)=5/10, and sup(g3,g4)=3/10, using equation (1) above, we get φ(g3,g4)=-
0.5, φ correlation coefficients for other subgraph pairs can be computed similarly.
Table 1 shows φ correlation coefficients for all subgraph pairs. Suppose k=6, with
the result in table 1, we know that the absolute value of 6-th pair’s correlation co-
efficient (|φ(TL[k])| )is 0.5, through checking the

φ(gi,gj)

to determine whether
it can be pruned or not,four subgraph pairs (g1,g4),(g1,g5) ,(g2,g5),(g3,g5) will be
deleted,we are able to obtain the top-6 list.
4.3 Algorithm Descriptions
In the subsection, we show the pseudocode of the KFCP algorithm in ALGORITHM
1. KFCP accepts the graph database GD, minimum support σ and an integer k as
input, it generates list of top-k strongly frequent correlated subgraph pairs, TL, as
output. First, Line 1 initializes an empty list TL of size k. Line 2 enumerates all
frequent subgraph pairs by scanning the entire database once. Line 3 constructs co-
occurrence frequency matrix.Line 4-9 calculates the correlation coefficient for each
surviving pairs from the frequent subgraph set and pushes the pair into the top-k
list if the correlation coefficient of the new pair is greater than the k-th pair in the
current list.
ALGORITHM 1. KFCP Algorithm
Input: GD: a graph database
σ: a given minimum support threshold
k: the number of most highly correlated pairs requested
Output: TL: the sorted list of k frequent correlation subgraph pairs
1. initialize an empty list TL of size k;
2. scan the graph database to generete frequent subgraph set F (with input GD and σ);
3. construct co-occurrence frequency matrix;
4. for each subgraph pair (gi,gj) ∈F do
5. compute φ(gi,gj);
6. if

φ(gi,gj)

|φ(TL[k])| then
7. add subgraph pair

(gi,gj),|φ|

into the last position of TL;
8. sort the TL in non-increasing order based on absolute value of their correlation coefficient;
9. Return TL;
Here, we analyze KFCP algorithm in the area of completeness and correctness.
Theorem 1. The KFCP algorithm is complete and correct.
Proof. KFCP computes the correlation coefficient of all the frequent subgraph
pairs based on exhaustive search,this fact guarantees that KFCP is complete in all

aspects. KFCP creates a sorted list of subgraph pairs based on the absolute value of
correlation coefficient and prunes all those subgraph pairs whose absolute value of
correlation coefficient lower than the k-th pair; this fact ensures KFCP is correct.
5 Experimental Results
Our experiments are performed on a PC with a 2.1GHz CPU and 3GB RAM running
Windows XP. KFCP and FCP-Miner are implemented in C++. There are two real
datasets we tested: PTE1and NCI 2.PTE contains 340 graphs,the average graph size
is 27.4. NCI contains about 249000 graphs, we randomly select 10000 graphs for
our experiments, the average graph size is 19.95.
Since FCP-Miner is dependent on a minimum correlation threshold θ, in or-
der to generate same result by FCP-Miner we set the θ with the correlation co-
efficient of the k-th pair from the top-k list generated by KFCP. Fig.5 shows the
comparison between KFCP and FCP-Miner on PTE dataset with different val-
ues of k, we can obtain the correlation coefficient of the k-th pair shown in
Table 2. As the increasing of k, KFCP keeps stable running time,but the perfor-
mance of FCP-Miner decreases greatly, since when k is large,FCP-Miner can-
not avoid generating large number of candidates. Fig.6 displays that performance
comparison between KFCP and FCP-Miner on NCI dataset with different sup-
port threshold, we vary σ from 0.3 to 0.03, the running time for enumerating all
frequent subgraph pairs increases greatly,so the performances of KFCP and FCP-
Miner decrease greatly. We also analyze the completeness of KFCP by record-
ing the following experimental findings, as reported in Table 3:(1)% of excess:
the percentage of excess pairs by KFCP,calculted as (total number of pairs ob-
tained by KFCP/ total number of pairs obtained by FCP-Miner-1);(2)avg φ of
excess: the average φ value of the excess pairs.We create six NCI datasets, with
sizes ranging from 1000 to 10000 graphs, the values of σ and k are fixed at
0.05 and 40, respectively, when k=40, we set the θ with the φ of the 40-th
pair from the top-k list generated by KFCP. Thus, we obtain θ=0.8. The results
verify that FCP-Miner may miss some frequent correlated subgraph pairs, but
KFCP is complete.The experimental results confirm the superiority of KFCP in
all cases.
Table 2 The correlation coefficient of the k-th pair at varing k
K 10 20 30 40 50 60 70 80 90 100
φ of the k-th pair(θ) 0.95 0.92 0.88 0.85 0.82 0.76 0.74 0.7 0.68 0.65
1 http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/PTE/.
2 http://cactus.nci.nih. gov/ncidb2/download.html.

Table 3 The completeness of KFCP compared to FCP-Miner
the size of NCI 1000 2000 4000 6000 8000 10000
% of excess 2.5% 2.1% 1.7% 2.1% 3.3% 4.9%
avg φ of excess 0.82 0.82 0.82 0.82 0.82 0.82
Fig. 5 Runtime comparision on PTE dataset Fig. 6 Runtime comparision on NCI dataset
with different support threshold
6 Conclusions
In this paper, we present an algorithm KFCP for the frequent correlation subgraph
mining problem. Comparing to existing algorithm FCP-Miner, KFCP can avoid gen-
erating any candidate sets. Once the co-occurrence frequency matrix is constructed,
the correlation coefficients of all the subgraph pairs are computed and k numbers of
top strongly correlated subgraph pairs are extracted very easily. Extensive experi-
ments on real datasets confirm the efficiency of our algorithm.
Acknowledgements. This work was supported by the NSF of Gansu Province grant
(1010RJZA117).
References
1. Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning. In: Proc.
of PODS, pp. 226–236 (2000)
2. Xiong, H., Tan, P., Kumar, V.: Hyperclique pattern discovery. DMKD 13(2), 219–242
(2006)
3. Xiong, H., Shekhar, S., Tan, P., Kumar, V.: Exploiting a support-based upper bound of
Pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In:
Proc. ACM SIGKDD Internat. Conf. Knowledge Discovery and Data Mining, pp. 334–
343. ACM Press (2004)
4. Xiong, H., Brodie, M., Ma, S.: Top-cop: Mining top-k strongly correlated pairs in large
databases. In: ICDM, pp. 1162–1166 (2006)
5. Pan, J.Y., Yang, H.J., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal cor-
relation discovery. In: Proc. of KDD, pp. 653–658 (2004)

6. Sakurai, Y., Papadimitriou, S., Faloutsos, C.: Braid: Stream mining through group lag
correlations. In: SIGMOD Conference, pp. 599–610 (2005)
7. Ke, Y., Cheng, J., Ng, W.: Correlation search in graph databases. In: Proc. of KDD, pp.
390–399 (2007)
8. Ke, Y., Cheng, J., Yu, J.X.: Efficient Discovery of Frequent Correlated Subgraph Pairs. In:
Proc. of ICDM, pp. 239–248 (2009)

Evolutionary Approach for Classifier Ensemble:
An Application to Bio-molecular Event
Extraction
Asif Ekbal, Sriparna Saha, and Sachin Girdhar
Abstract. The main goal of Biomedical Natural Language Processing (BioNLP)
is to capture biomedical phenomena from textual data by extracting relevant enti-
ties, information and relations between biomedical entities (i.e. proteins and genes).
Most of the previous works focused on extracting binary relations among proteins.
In recent years, the focus is shifted towards extracting more complex relations in
the form of bio-molecular events that may include several entities or other relations.
In this paper we propose a classifier ensemble based on an evolutionary approach,
namely differential evolution that enables extraction, i.e. identification and classi-
fication of relatively complex bio-molecular events. The ensemble is built on the
base classifiers, namely Support Vector Machine, nave-Bayes and IBk. Based on
these individual classifiers, we generate 15 models by considering various subsets
of features. We identify and implement a rich set of statistical and linguistic fea-
tures that represent various morphological, syntactic and contextual information of
the candidate bio-molecular trigger words. Evaluation on the BioNLP 2009 shared
task datasets show the overall recall, precision and F-measure values of 42.76%,
49.21% and 45.76%, respectively for the three-fold cross validation. This is bet-
ter than the best performing SVM based individual classifier by 4.10 F-measure
points.
1 Introduction
The past history of text mining (TM) shows the great success of different evalu-
ation challenges based on carefully curated resources. Relations among biomed-
ical entities (i.e. proteins and genes) are important in understanding biomedical
Asif Ekbal · Sriparna Saha · Sachin Girdhar
Department of Computer Science and Engineering,
Indian Institute of Technology Patna, India
e-mail: {asif,sriparna,sachin}@iitp.ac.in
springerlink.com c

10 A. Ekbal, S. Saha, and S. Girdhar
phenomena and must be extracted automatically from a large number of published
papers. Similarly to previous bio-text mining challenges (e.g., LLL [1] and BioCre-
ative [2]), the BioNLP’09 Shared Task also addressed bio-IE, but it tried to look
one step further toward finer-grained IE. The difference in focus is motivated in
part by different applications envisioned as being supported by the IE methods. For
example, BioCreative aims to support curation of PPI databases such as MINT [3],
for a long time one of the primary tasks of bioinformatics. The BioNLP’09 shared
task contains simple events and complex events. Whereas the simple events consist
of binary relations between proteins and their textual triggers, the complex events
consist of multiple relations among proteins, events, and their textual triggers. The
primary goal of BioNLP-09 shared task [4] was aimed to support the development of
more detailed and structured databases, e.g. pathway or Gene Ontology Annotation
(GOA) databases, which are gaining increasing interest in bioinformatics research
in response to recent advances in molecular biology.
Classifier ensemble is a popular machine learning paradigm.
We assume that, in case of weighted voting, weights of voting should vary among
the various output classes in each classifier. The weight should be high for that
particular output class for which the classifier is more reliable. Otherwise, weight
should be low for that output class for which the classifier is not very reliable. So,
it is a very crucial issue to select the appropriate weights of votes for all the classes
in each classifier. Here, we make an attempt to quantify the weights of voting for
each output class in each classifier. A Genetic Algorithm (GA) based classifier en-
semble technique has been proposed in [5] for determining the proper weights of
votes in each classifier. This was developed aiming named entity recognition in In-
dian languages as well as in English. In this paper we propose a single objective
optimization based classifier ensemble technique based on the principle of differ-
ential evolution [6], an evolutionary algorithm that proved to be superior over GA
in many applications. We optimize F-measure value, which is a harmonic mean of
recall and precision both. The proposed approach is evaluated for event extraction
from biomedical texts and classification of them into nine predefined categories,
namely gene expression, transcription, protein catabolism, phosphorylation, local-
ization, binding, regulation, positive regulation and negative regulation. We identify
and implement a very rich feature set that incorporates morphological, orthographic,
syntactic, local contexts and global contexts as the features of the system. As a base
classifiers, we use Support Vector Machine, naïve-Bayes and instance-based leaner
IBk. Different versions of these diverse classifiers are built based on the various
subsets of features. Differential evolution is then used as the optimization technique
to build an ensemble model by combining all these classifiers. Evaluation with the
BioNLP 2009 shared task datasets yield the recall, precision and F-measure val-
ues of 42.76%, 49.21% and 45.76%, respectively for the three-fold cross validation.
This is better than the best performing SVM based individual classifier by 4.10 F-
measure points.

Evolutionary Approach for Classifier Ensemble 11
2 Proposed Approach
The proposed differential evolution based classifier ensemble method is described
below.
String Representation and Population Initialization: Suppose there N number
of available classifiers and M number of output classes. Thus, the length of the
chromosome (or vector) is N × M. This implies D = N × M, where D represents
the number of real parameters on which optimization or fitness function depends.
D is also dimension of vector xi,G. Each chromosome encodes the weights of votes
for possible M output classes for each classifier. Please note that chromosome repre-
sents the available classifiers along with their weights for each class. As an example,
the encoding of a particular chromosome is represented below, where M = 3 and O
= 3 (i.e., total 9 votes can be possible): 0.59 0.12 0.56 0.09 0.91 0.02 0.76 0.5 0.21
The chromosome represents the following ensemble: The weights of votes for 3 dif-
ferent output classes in classifier 1 are 0.59, 0.12 and 0.56, respectively. Similarly,
weights of votes for 3 different output classes are 0.09, 0.91 and 0.02, respectively in
classifier 2 and 0.76, 0.5 and 0.21, respectively in classifier 3. We use real encoding
that randomly initializes the entries of each chromosome by a real value (r) between
0 and 1. Each entry of chromosome whose size is D, is thus, initialized randomly.
If the population size is P then all the P number of chromosomes of this population
are initialized in the above way.
Fitness Computation: Initially, all the classifiers are trained using the available
training data and evaluated with the development data. The performance of each
classifier is measured in terms of the evaluation metrics, namely recall, precision
and F-measure. Then, we execute the following steps to compute the objective
values.
1) Suppose, there are total M number of classifiers. Let, the overall F-measure val-
ues of these M classifiers for the development set be Fi, i = 1...M, respectively.
2) The ensemble is constructed by combining all the classifiers. Now, for the ensem-
ble classifier the output label for each word in the development data is determined
using the weighted voting of these M classifiers’ outputs. The weight of the class
provided by the ith classifier is equal to I(m,i). Here, I(m,i) is the entry of the
chromosome corresponding to mth classifier and ith class. The combined score of a
particular class for a particular word w is:
f(ci) = ∑I(m,i) × Fm, ∀m = 1 : M op(w,m) = ci Here, op(w,m) denotes the
output class provided by the mth
classifier for the word w. The class receiving the
maximum combined score is selected as the joint decision.
3) The overall recall, precision and F-measure values of the ensemble classifier are
computed on the development set. For single objective approach, we use F-measure
value as the objective function, i.e. f0 = F-measure. The main goal is to maximize
this objective function using the search capability of DE.

Mutation: For each target vector xi,G; i = 1,2,3,...,NP, a mutant vector is gener-
ated according to vi,G+1 = xr1,G + F(xr2,G − xr3,G), where r1, r2, r3 are the random
indexes and belong to {1,2,...,NP}. These are some integer values, mutually dif-
ferent and F 0. The randomly chosen integers r1, r2 and r3 are also chosen to
be different from the running index i, so that NP must be greater or equal to four to
allow for this condition. F is a real and constant factor 2 [0,2] which controls the
amplification of the differential variation (xr2,G −xr3,G). The vi,G+1 is termed as the
donor vector.
Crossover: In order to increase the diversity of the perturbed parameter vectors,
crossover is introduced. This is well-known as the recombination. To this end, the
trial vector:
ui,G+1 = (u1i,G+1,u2i,G+1,...,uDi,G+1) is formed, where
uj,i,G+1 = vj,i,G+1 if (randb(j) ≤ CR) or j = rnbr(i) (1)
= xj,i,G if (randb(j) CR) and j = rnbr(i) (2)
for j = 1,2,...,D. In Equation 1, randb(j) is the jth evaluation of an uniform ran-
dom number generator with outcome belongs to [0,1]. CR is the crossover constant
belongs to [0,1] which has to be determined by the user. rnbr(i) is a randomly
chosen index x belongs to {1,2,...,D} which ensures that ui,G+1 gets at least one
parameter from vi,G+1.
Selection: To decide whether or not it should become a member of generation G+1,
the trial vector ui,G+1 is compared to the target vector xi,G using the greedy criterion.
If vector ui,G+1 yields a smaller cost function value than xi,G, then xi,G+1 is set to
ui,G+1, otherwise, the old value xi,G is retained.
Termination Condition: In this approach, the processes of mutation, crossover
(or, recombination), fitness computation and selection are executed for a maximum
number of generations. The best string seen up to the last generation provides the
solution to the above classifier ensemble problem. Elitism is implemented at each
generation by preserving the best string seen up to that generation in a location out-
side the population. Thus on termination, this location contains the best classifier
ensemble.
3 Features for Event Extraction
We identify and use the following set of features for event extraction. All these
features are automatically extracted from the training datasets without using any
additional domain dependent resources and/or tools.
• Context words: We use preceding and succeeding few words as the features. This
feature is used with the observation that contextual information plays an important
role in identification of event triggers.

•Root words: Stems of the current and/or the surrounding token(s) are used as the
features of the event extraction module. Stems of the words were provided with the
evaluation datasets of training, development and test.
• Part-of-Speech (PoS) information: PoS information of the current and/or the
surrounding tokens(s) are effective for event trigger identification. PoS labels of the
tokens were provided with the datasets.
• Named Entity (NE) information: NE information of the current and/or sur-
rounding token(s) are used as the features. NE information was provided with the
datasets.
• Semantic feature: This feature is semantically motivated and exploits global con-
text information. This is based on the content words in the surrounding context.
We consider all unigrams in contexts wi+3
i−3 = wi−3 ...wi+3 of wi (crossing sentence
boundaries) for the entire training data. We convert tokens to lower case, remove
stopwords, numbers, punctuation and special symbols. We define a feature vector of
length 10 using the 10 most frequent content words. Given a classification instance,
the feature corresponding to token t is set to 1 if and only if the context wi+3
i−3 of wi
contains t.
• Dependency features: A dependency parse tree captures the semantic predicate-
argument dependencies among the words of a sentence. Dependency paths between
protein pairs have successfully been used to identify protein interactions. In this
work, we use the dependency paths to extract events. We use the McClosky- Char-
niak parses which are converted to the Stanford Typed Dependencies format and
provided with the datasets. We define a number of features based on the depen-
dency labels of the tokens.
• Dependency path from the nearest protein: Dependency relations of the path
from the nearest protein are used as the features.
• Boolean valued features: Two boolean-valued features are defined using the de-
pendency path information. The first feature checks whether the current token’s
child is a proposition and the chunk of the child includes a protein. The second
feature fires if and only if the current token’s child is a protein and its dependency
label is OBJ.
•Shortest path: Distance of the nearest protein from the current token is used as
the feature. This is an integer-valued feature that takes the value equal to the number
of tokens between the current token and the nearest protein.
• Word prefix and suffix: Fixed length (say, n) word suffixes and prefixes may be
helpful to detect event triggers from the text. Actually, these are the fixed length
character strings stripped either from the rightmost (for suffix) or from the leftmost
(for prefix) positions of the words. If the length of the corresponding word is less
than or equal to n-1 then the feature values are not defined and denoted by ND. The
feature value is also not defined (ND) if the token itself is a punctuation symbol or
contains any special symbol or digit. This feature is included with the observation
that event triggers share some common suffixes and/or prefixes. In this work, we
consider the prefixes and suffixes of length up to four characters.

4 Datasets and Experimental Results
We use the BioNLP-09 shared task datasets. The events were selected from the GE-
NIA ontology based on their significance and the amount of annotated instances in
the GENIA corpus. The selected event types all concern protein biology, implying
that they take proteins as their theme. The first three event types concern protein
metabolism that actually represents protein production and breakdown. Phosphory-
lation represents protein modification event whereas localization and binding denote
fundamental molecular events. Regulation and its sub-types, positive and negative
regulations are representative of regulatory events and causal relations. The last five
event types are universal but frequently occur on proteins. Detailed biological inter-
pretations of the event types can be found in Gene Ontology (GO) and the GENIA
ontology. From a computational point of view, the event types represent different
levels of complexity.
Training and development datasets were derived from the publicly available event
corpus [7]. The test set was obtained from an unpublished portion of the corpus. The
shared task organizers made some changes to the original GENIA event corpus.
Irrelevant annotations were removed, and some new types of annotation were added
to make the event annotation more appropriate. The training, development and test
datasets have 176,146, 33,937 and 57,367 tokens, respectively.
4.1 Experimental Results
We generate 15 different classifiers by varying the feature combinations of 3 differ-
ent classifiers, namely Support Vector Machine (SVM), K-Nearest Neighbour (IBk)
and Naïve Bayesian classifier. We determine the best configuration using develop-
ment set. Due to non-availability of gold annotated test datasets we report the final
results on 3-fold cross validation. The system is evaluated in terms of the standard
recall, precision and F-measure. Evaluation shows the highest performance with a
SVM-based classifier that yields the overall recall, precision and F-measure values
of 33.17%, 56.00% and 41.66%, respectively.
The dimension of vector for our experiment is 15*19 = 285 where 15 repre-
sents number of classifiers and 19 represents number of output classes. We construct
an ensemble from these 15 classifiers. Differential Evolution (DE) based ensemble
technique is developed that determines the appropriate weights of votes of each class
in each classifier. When we set population size, P=100, cross-over constant, CR=1.0
and number of generations, G = 50, with increase in F over range [0,2], we get the
highest recall, precision and F-value of 42.90%, 47.40%, 45.04%, respectively.
We observe that for F 0.5 the solution converges faster. With 30 generations
we can reach to optimal solution (for case when F 0.5). For F = 0.0, the so-
lution converges at the very beginning. We observe the highest performance with
the settings P=100, F=2.0, CR=0.5 and G=150. This yields the overall recall, pre-
cision and F-measure values of 42.76%, 49.21% and 45.76%. This is actually an

improvement of 4.10 percentage of F-measure points over the best individual base
classifier, i.e. SVM.
5 Conclusion and Future Works
In this paper we have proposed differential evolution based ensemble technique for
biological event extraction that involves identification and classification of complex
bio-molecular events. The proposed approach is evaluated on the benchmark dataset
of BioNLP 2009 shared task. It shows the F-measure of 45.76%, an improvement
of 4.10%.
Overall evaluation results suggest that there is still the room for further improve-
ment. In this work, we have considered identification and classification as one step
problem. In our future work we would like to consider identification and classifi-
cation as a separate problem. We would also like to investigate distinct and more
effective set of features for event identification and classification each. We would
like to come up with an appropriate feature selection algorithm. In our future work,
we would like to identify arguments to these events.
References
1. Nédellec, C.: Learning Language in Logic -Genic Interaction Extraction Challenge. In:
Cussens, J., Nédellec, C. (eds.) Proceedings of the 4th Learning Language in Logic Work-
shop, LLL 2005, pp. 31–37 (2005)
2. Hirschman, L., Krallinger, M., Valencia, A. (eds.): Proceedings of the Second BioCre-
ative Challenge Evaluation Workshop. CNIO Centro Nacional de Investigaciones
Oncológicas (2007)
3. Chatr-aryamontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castag-
noli, L., Cesareni, G.: MINT: the Molecular INTeraction database. Nucleic Acids Re-
search 35(suppl. 1), 572–574 (2007)
4. Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Overview of BioNLP 2009 shared
task on event extraction. In: BioNLP 2009: Proceedings of the Workshop on BioNLP, pp.
1–9 (2009)
5. Ekbal, A., Saha, S.: Weighted Vote-Based Classifier Ensemble for Named Entity Recogni-
tion: A Genetic Algorithm-Based Approach. ACM Trans. Asian Lang. Inf. Process. 10(2),
9 (2011)
6. Storn, R., Price, K.: Differential Evolution A Simple and Efficient Heuristic for Global
Optimization over Continuous Spaces. J. of Global Optimization 11(4), 341–359 (1997),
http://dx.doi.org/10.1023/A:1008202821328,
doi:10.1023/A:1008202821328
7. Kim, J.-D., Ohta, T., Tsujii, J.: Corpus annotation for mining biomedical events from
literature. BMC Bioinformatics 9, 10 (2008)

springerlink.com © Springer-Verlag Berlin Heidelberg 2013
A Novel Clustering Approach Using Shape
Based Similarity*
Smriti Srivastava, Saurabh Bhardwaj, and J.R.P. Gupta
Abstract. The present research proposes a paradigm for the clustering of data in
which no prior knowledge about the number of clusters is required. Here shape
based similarity is used as an index of similarity for clustering. The paper exploits
the pattern identification prowess of Hidden Markov Model (HMM) and over-
comes few of the problems associated with distance based clustering approaches.
In the present research partitioning of data into clusters is done in two steps. In the
first step HMM is used for finding the number of clusters then in the second step
data is classified into the clusters according to their shape similarity. Experimental
results on synthetic datasets and on the Iris dataset show that the proposed algo-
rithm outperforms few commonly used clustering algorithm.
Keywords: Clustering, Hidden Markov Model, Shape Based similarity.
1 Introduction
Cluster analysis is a method of creating groups of objects or clusters in such a way
that the objects in one cluster are very similar to each other while the objects in dif-
ferent clusters are quite different. Data clustering algorithms could be generally
classified into the following categories [1]: Hierarchical clustering, Fuzzy cluster-
ing, Center based clustering, Search based clustering, Graph based clustering, Grid
based clustering, Density based clustering, Subspace clustering, and Model based
clustering algorithms. Every clustering algorithm is based on the index of similarity
or dissimilarity between data points. Many authors have used the distances as the
index of similarity. Commonly used distances are Euclidean distance, Manhattan
distance, Minkowski distance, and Mahalanobis distance [2] . As shown in [3] the
distance functions are not always adequate for capturing correlations among the ob-
jects. It is also shown that strong correlations may still exist among a set of objects
even if they are far apart from each other. HMMs are the dominant models for
the sequential data. Although HMMs have extensively used in speech recognition,
Smriti Srivastava · Saurabh Bhardwaj · J.R.P. Gupta
Netaji Subhas Institute of Technology, New Delhi – 110078, India
e-mail:{bsaurabh2078,jairamprasadgupta}@gmail.com,
ssmriti@yahoo.com

18 S. Srivastava, S. Bhardwaj, and J.R.P. Gupta
pattern recognition and time series prediction problems but they have not been
widely used for the clustering problems and only few papers can be found in the li-
terature. Many researchers have used single sequences to train the HMMs and pro-
posed different distance measures based on a likelihood matrix obtained from these
trained HMMs. Clustering of sequences using HMM, were introduced in [4] . In
this a (Log-Likelihood) LL based scheme for automatically determining the num-
ber of clusters in the data is proposed. A similarity based clustering of sequences
using HMMs is presented in [5]. In this, a new representation space is built in
which each object is described by the vector of its similarities with respect to a pre-
determinate set of other objects. These similarities are determined using LL values
of HMMs. A single HMM based clustering method was proposed in [6] , which uti-
lized LL values as the similarity measures between data points. The method was
useful for finding the number of clusters in the data set with the help of LL values
but it is tough to actually obtain the data elements for the clusters as the threshold
for the clusters was estimated by simply inspecting the graph of LL.
The present research proposes an HMM based unsupervised clustering algo-
rithm which uses the shape similarity as a measure to capture the correlation
among the objects. It also automatically determines the number of clusters in the
data. Here the hidden state information of HMM is utilized as a tool to obtain the
similar patterns among the objects.
The rest of the paper is organized as follows. Sect. 2 briefly describes the
HMM. Sect. 3 details the proposed shape based clustering paradigm. In Sect. 4
Experimental results are provided to illustrate the effectiveness of proposed mod-
el. Finally, conclusions are drawn in Sect. 5.
2 Hidden Markov Model
Hidden Markov Model (HMM) [7][8] springs forth from Markov Processes or
Markov Chains. It is a canonical probabilistic model for the sequential or temporal
data It depends upon the fundamental fact of real world, “Future is independent of
the past and given by the present”. HMM is a doubly embedded stochastic
process, where final output of the system at a particular instant of time depends
upon the state of the system and the output generated by that state. There are two
types of HMMs: Discrete HMMs and Continuous Density HMMs. These are dis-
tinguished by the type of data that they operate upon. Discrete HMMs (DHMMs)
operate on quantized data or symbols, on the other hand, the Continuous Density
HMMs (CDHMMs) operate on continuous data and their emission matrices are
the distribution functions. HMM Consists of the following parameters
O {O1,O2,… ,OT } : Observation Sequence
Z {Z1, Z2,…,ZT } : State Sequence
T : Transition Matrix
B : Emission Matrix/Function
π : Initialization Matrix
λ(T, B, π) : Model of the System
: Space of all state sequence of length T

A Novel Clustering Approach Using Shape Based Similarity 19
m{mq1,mq2,….mqT} : Mixture component for each state at each time
cil, µil, ∑il : Mixture component (i state and l component)
There are three major design problems associated with HMM:
Given the Observation Sequence {O1, O2, O3,.., OT} and the Model λ(T, B,
π), the first problem is the computation of the probability of the observation se-
quence P (O|λ).The second is to find the most probable state sequence Z {Z1,
Z2,.., ZT},
The third problem is the choice of the model parameters λ (T, B, π), such that
the probability of the Observation sequence, P (O|λ) is the maximum.
The solution to the above problems emerges from three algorithms: Forward,
Viterbi and Baum-Welch [7].
2.1 Continuous Density HMM
Let O = {O1,O2,..,OT } be the observation sequence and Z {Z1, Z2,…,ZT}be the
hidden state sequence. Now, we briefly define the Expectation Maximization
(EM) algorithm for finding the maximum-likelihood estimate of the parameters of
a HMM given a set of observed feature vectors. EM algorithm is a method for ap-
proximately obtaining the maximum a posteriori when some of the data is miss-
ing, as in HMM in which the observation sequence is visible but the states are
hidden or missing. The Q function is generally defined as

= ερ
λ
λ
λ
λ q
z
P
z
P
Q )
'
|
,
0
(
)
|
,
0
(
log
)
'
,
( (1)
To define the Q function for the Gaussian mixtures, we need the hidden variable
for the mixture component along with the hidden state sequence. These are pro-
vided by both the E–step and the M-step of EM algorithm given
E Step:
 
=
ερ ε
λ
λ
λ
λ
z
)
'
|
,
,
(
)
|
,
,
(
log
)
'
,
Q(
M
m
m
z
O
P
m
z
O
P (2)
M Step:
int
)]
'
,
(
[
arg
' constra
Q
ax
m +
= λ
λ
λ λ (3)
The optimized equations for the parameters of the mixture density are


= =
=
=
=
=
= T
t t
t
z
t
t
t
z
t
T
t t
il
O
m
z
P
O
m
z
P
O
t
t
1 ,
1
,
1
1
)
'
|
1
(
)
'
|
1
(
λ
λ
μ (4)



=
=
=
=
=
=
−
−
=
il T
t t
t
z
t
t
t
z
t
T
il
t
il
T
t t
O
m
i
z
P
O
m
i
z
P
O
O
t
t
1
1
)
'
|
1
,
(
)
'
|
1
,
(
)
)(
(
λ
λ
μ
μ
(5)

 

= =
=
=
=
=
=
= T
t
M
l t
t
z
t
T
t t
t
z
t
il
O
m
i
z
P
O
m
i
z
P
c
t
t
1 1
1
)
'
|
1
,
(
)
'
|
1
,
(
λ
λ
(6)
3 Shape Based Clustering
Generally different distance functions such as Euclidean distance, Manhattan dis-
tance, and Cosine distance are employed for clustering the data but these distance
functions are not always effective to capture the correlations among the objects. In
fact, strong correlations may still exist among a set of objects even if their dis-
tances are far apart from each other as measured by the distance functions. Fig.1
shows ‘4’ objects with ‘5’ attributes among a set of 300 objects which were allot-
ted in different clusters when the segmental k-means applied to partition them in
six clusters. As it is clear from the Fig.1 that these objects physically have the
same pattern of shape and also have the strong correlation among each other
which is shown with the help of correlation matrix between the ‘4’ data elements
in Table 1. So by taking the motivation from here in the present research we have
extended the basic concept of Shape Based Batching (SBB) procedure as intro-
duced in [9],[10]. Earlier it was shown that by carefully observing the datasets and
their corresponding log-likelihoods (LL), it is possible to find the shape of the in-
put variation for certain value of log-likelihood but further it is found that to detect
the shape by simply observing is not always easy. Moreover, in some datasets it is
very difficult to determine the threshold for the batch allocation. Although the
states are hidden, for many practical applications there is often some physical sig-
nificance attached to the states of the model. In the present research it is found that
the patterns of objects corresponding to any particular state of HMM is highly cor-
related and have different pattern or uncorrelated with the objects corresponding
to any other state, so here the concept of SBB is modified and in this modified
SBB the shape is a function of the state and not of the log likelihoods.
Table 1 Correlation among different
row vectors
Fig. 1 Clustering results with the segmental K- means
Data-1 Data-2 Data-3 Data-4
Data-1 1.000 0.988 0.955 0.905
Data-2 0.988 1.000 0.989 0.959
Data-3 0.955 0.989 1.000 0.990
Data-4 0.905 0.959 0.990 1.000

Here unsupervised clustering algorithm is proposed in which important thing to
note is that the numbers of clusters are not fixed, and the algorithm automatically
decides the number of clusters. The whole procedure is as shown in Fig. 2. First of
all the number of clusters in the data set are obtained. The steps for obtaining the
number of clusters are as follows. Estimate the HMM model parameters λ(T,B,π)
for the entire input dataset using Baum–Welch/Expectation maximization algo-
rithm, for the appropriate values of the state ‘Z’ and mixture components ‘m’.
Once the HMM has been trained, the forward algorithm is used to compute the
value of P(O|λ) which can then be used to calculate the LL of each row of the da-
taset . Now by sorting the LL values in the ascending (descending) order we can
get the clear indication regarding the number of clusters in the dataset.
Fig. 2 Procedure for Shape Based Clustering
Now after getting the information about the number of clusters initialize the
value of the parameters of the HMM. This includes initialization of transition ma-
trix ‘T’, initialization matrix ‘π’ and the mixture component ‘m’ for each state.
Take the number of states as equal to the number of clusters .The Continuous
Density Hidden Markov Model (CDHMM) is trained using Baum Welch/ Expec-
tation maximization algorithm for the entire input dataset After freezing the HMM
parameters the next step is to find the optimal state sequence, with the help of
‘Viterbi algorithm’ by taking the entire input dataset as the ‘D’ dimensional ob-
servation vector sequence. Now the observation sequence and the corresponding
optimal state sequence is obtained. After doing this one important thing is ob-
served that the data vectors which are associated with the same state have identical
shape while the data vectors with different states have no similarity in their
shapes. So once the optimal value of hidden state sequence is deduced the next
Dataset Train with HMM
Sort LL in Increasing
/ Decreasing order
Find No of
Clusters (K)
States = K
Train with
HMM
Optimal State Sequence Sort According to states
Calculate Correlation Matrix
Change ‘m’ to get the
required correlation
Get Shape Based Clusters

step is to put the data into clusters according to their state. Now each cluster have
almost identical shape but by simply observing the clusters it is difficult to find the
required shape based similarity, so an attempt is made to get the appropriate val-
ues of ‘Z’ and ‘m’ for the required shape based clusters by calculating the value of
correlation coefficient among the data vectors into the clusters. Here the Pearson
R model [11] comes handy for finding the coherence (correlation) among a set of
objects. The correlation between the two objects ‘x1’ and ‘x2’ is defined as:
(7)
Where x1 and x2 are the mean of all attribute values in ‘x1’ and ‘x2’, respectively.
It may be noted that Pearson R correlation measures the correlation between two
objects with respect to all the attribute values. A large positive value indicates a
strong positive correlation while a large negative value indicates a strong negative
correlation. Now the correlation coefficients can be used as a threshold value of
the similarity between the data vectors in the clusters and by using this value as a
threshold the appropriate value of ‘Z’ and ‘m’ can be determined for the shape
based clusters. Using these basic criteria, an algorithm was developed which ar-
ranged the data into clusters.
3.1 Steps for Shape Based Clustering Algorithm
Step 1: Take the number of states equal to the number of clusters and estimate the
HMM parameters λ(T, B,π) for the entire input dataset by taking the appropriate
value of the mixture components ‘m’.
Step 2: Calculate the optimal value of hidden state sequence with the help of
“Viterbi Algorithm” by taking the input as a ‘D’ dimensional observation vector.
Step 3: Rearrange the complete dataset according to their state values.
Step 4: Calculate correlation matrix by using the Pearson R model as in (7).
Step 5: Change the value of ‘m’ and repeat the steps 1-4 until the required to-
lerance of correlation is achieved.
The effectiveness of the proposed model can be demonstrated by taking the Iris
plants database. The data set contains ‘3’ classes of ‘50’ instances each, where
each class refers to a type of Iris plant. Fig.3 shows the patterns of Iris data before
clustering. Now as a first step entire Iris data is trained with the help of Baum–
Welch/Expectation maximization. Once the HMM has been trained, the forward
algorithm is used to calculate the LL of each row of the dataset. Fig.3 shows the
graph of LL vales sorted in ascending order. As it is clear from the Fig.3 that we
can get the information regarding the number of clusters, but to choose the thre-
shold value for allocating the data to the clusters by simply watching the LL graph
(Fig.4) is not possible. This is the main drawback in previous approaches which is
now removed in the present research. After getting the information regarding the
number of clusters the shape based clustering approach is applied as described ear-
lier. After applying step 1-step 5 of proposed algorithm Table 2 is obtained. The

Table 2 States and LL values of Iris data
Attributes
5.1 4.9 4.7 4.6 5.0 6.3 5.8 7.1 6.3 6.5 7.0 6.4 6.9 5.5 6.5
3.5 3.0 3.2 3.1 3.6 3.3 2.7 3.0 2.9 3.0 3.2 3.2 3.1 2.3 2.8
1.4 1.4 1.3 1.5 1.4 6.0 5.1 5.9 5.6 5.8 4.7 4.5 4.9 4.0 4.6
0.2 0.2 0.2 0.2 0.2 2.5 1.9 2.1 1.8 2.2 1.4 1.5 1.5 1.3 1.5
No. 1 2 3 4 5 51 52 53 54 55 101 102 103 104 105
LL 0.6 0.1 0.2 0.1 0.5 335.5 214.2 309.7 158.1 298.1 162.4 142.5 181.9 109.3 158.8
States1 1 1 1 1 3 3 3 3 3 2 2 2 2 2
Table 3 Actual parameters of the model
Sigma(:,:,1) Mean
0.2706 0.0833 0.1788 0.0545 5.936 5.006 6.589
0.0833 0.1064 0.0812 0.0405 2.770 3.428 2.974
0.1788 0.0812 0.2273 0.0723 4.262 1.462 5.553
0.0545 0.0405 0.0723 0.0487 1.327 0.246 2.026
Sigma(:,:,2) Initial Matrix
0.1318 0.0972 0.0160 0.0101 0.000 1.000 0.000
0.0972 0.1508 0.0115 0.0091
0.0160 0.0115 0.0396 0.0059 States =3
0.0101 0.0091 0.0059 0.0209
Sigma(:,:,3) Transition Matrix
0.4061 0.0921 0.2972 0.0479 1.000 0.000 0.000
0.0921 0.1121 0.0701 0.0468 0.000 0.980 0.020
0.2972 0.0701 0.3087 0.0477 0.020 0.000 0.980
0.0479 0.0468 0.0477 0.0840
description of this table is as follows: Row 1 to Row 4 shows the 4 attribute values
of the IRIS data. Row 5 shows the number of data vector, Row 6 shows the LL
values corresponding to data vectors and Row 7 shows the optimized state values
associated with that particular data vector. Due to the limitation of page width it is
not possible to show the complete table. So only ‘5’ values of each class is shown.
The values of LL in the table are only displayed to show the effectiveness of our
method over LL based clustering method. As it is clear from the table that the LL

values between the data element ‘54’ is almost equal to the LL value of data ele-
ment ‘105’ but these two elements belong to two different clusters. Hence it can
be said that the LL based clustering method is not adequate, while it is clear from
the Table 2 that the ‘states’ clearly partition the data accurately and in this dataset
(Iris dataset) the misclassification is zero meaning we are getting 100 % accuracy.
The plots of three clusters obtained after the application of proposed algorithm are as
shown in Fig. 5 and the actual parameters of the model are shown in the Table 3.
Fig. 3 Iris Plant Data Patterns Fig. 4 Iris Data LL Values
4 Experimental Results
To show the effectiveness of the proposed method it is applied on both the syn-
thetic data and the real world data.
Table 4 Parameters for synthetic data generation
Class -1
1/3 1/3 1/3 1/3 μ1=1 ߪଵ
ଶ
=0.6
T = 1/3 1/3 1/3 π = 1/3 B = μ2=3 ߪଶ
ଶ
=0.6
1/3 1/3 1/3 1/3 μ3=5 ߪଷ
ଶ
=0.6
Class -2
1/3 1/3 1/3 1/3 μ1=1 ߪଵ
ଶ
=0.5
T = 1/3 1/3 1/3 π = 1/3 B = μ2=3 ߪଶ
ଶ
=0.5
1/3 1/3 1/3 1/3 μ3=5 ߪଷ
ଶ
=0.5
Class -3
1/3 1/3 1/3 1/3 μ1=1 ߪଵ
ଶ
=0.4
T = 1/3 1/3 1/3 π = 1/3 B = μ2=3 ߪଶ
ଶ
=0.4
1/3 1/3 1/3 1/3 μ3=5 ߪଷ
ଶ
=0.4

4.1 Synthetic Data
The description of the synthetic data is given in [5]. The data contains 3 classes.
The training set is composed of 30 sequences (of length 400) from each of the
three classes generated by 3 HMMs. The parameters of the synthetic are as shown
in the Table 4. The comparison of results with the previous approaches is as
shown in Table 5. The results of the starting three rows are taken from [5].
4.2 Real Data
We have tested the proposed approaches on classical Iris Plants Database. The da-
ta set contains 3 classes of 50 instances each, where each class refers to a type of
iris plant (Irisvirginica, Irisversicolor, Irissetosa). The dataset consists of the fol-
lowing four attributes: sepal length, sepal width, petal length, and petal width. The
comparison of results with the previous approaches is as shown in Table 6. The
data in the column of Errors lists the numbers of data points which are classified
wrongly. The starting four rows of the table are taken from [12].
Fig. 5 Iris Data Cluster
Table 5 Comparison of previous methods
Learning Algorithm Accuracy (%)
MLOPS [5] 95.7
1 – NN on S T [5] 98.9
1 – NN on S T [5] 98.9
Shape Based Clustering 98.888

Table 6 Comparison of previous methods
5 Conclusion
This present research proposes a novel clustering approach based on the shape si-
milarity. The paper shows that the distance functions are not always adequate for
clustering of data and strong correlations may still exist among the data points
even if they are far apart from each other. The method is applied in a two phase
sequential manner. In the first phase, HMM is applied on the dataset to yield
HMM parameters assuming a certain number of states and gaussian mixtures.
Then the log likelihood values are obtained from the forward algorithm. The
sorted log likelihood values give the clear indication regarding the number of clus-
ters into the dataset. Next the shape based clustering algorithm is applied to cluster
the dataset. The method overcomes the problem of finding the threshold value in
LL based clustering algorithms. The proposed method is tested on real (Iris data)
as well as on the synthetic dataset. The results of simulation are very encouraging;
the method gives 100% accuracy on Iris dataset while about 99% accuracy on the
synthetic test data. Further the shortcoming of previous HMM based clustering
approaches in which the number of HMMs required were equal to that of number
of sequences/classes [4] [5] is removed by utilizing only single HMM for cluster-
ing and hence reducing the computational time and the complexity considerably.
References
[1] Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications.
Society for Industrial and Applied Mathematics, Philadelphia (2007)
[2] Xu, R., Wunsch, D.I.: Survey of clustering algorithms. IEEE Transactions on Neur-
al Networks 16(3), 645–678 (2005)
[3] Wang, H., Pei, J.: Clustering by Pattern Similarity. Journal of Computer Science
and Technology 23(4), 481–496 (2008)
[4] Smyth, P.: Clustering sequences with hidden Markov models. Advances in Neural
Information Processing Systems 9, 648–654 (1997)
[5] Bicego, M., Murino, V., Figueiredo, M.A.: Similarity-based classification of se-
quences using hidden Markov models. Pattern Recognition 37(12), 2281–2291
(2004)
[6] Hassan, R., Nath, B.: Stock market forecasting using hidden markov model. In: Pro-
ceedings of the Fifth International Conference on Intelligent Systems Design and
Application, pp. 192–196 (2005)
Learning Algorithm Error Accuracy (%)
FCM(Fuzzy c -means) 16 89.33
SWFCM(Sample Weighted Robust Fuzzy c -means) 12 92
PCM(Possibilistic c -means) 50 66.6
PFCM(Possibilistic Fuzzy c -means) 14 90.6
Shape Based Clustering 15 100

[7] Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in
speech recognition. IEEE (77), 257–286 (1989)
[8] Blimes, J.A.: A gentle tutorial of the EM algorithm and its application to parameter
estimation for gaussian mixture and hidden markov models. Berkeley, California:
International Computer Science Institute Technical Report ICSI-TR-97-021 (1998)
[9] Srivastava, S., Bhardwaj, S., Madhvan, A., Gupta, J.R.P.: A Novel Shape Based
Batching and Prediction approach for Time series using HMMs and FISs. In: 10th
International Conference on Intelligent Systems Design and Applications, Cairo,
Egypt, pp. 929–934 (2010)
[10] Bhardwaj, S., Srivastava, S., Madhvan, A., Gupta, J.R.P.: A Novel Shape Based
Batching and Prediction approach for Sunspot Data using HMMs and ANNs. In: In-
dia International Conference on Power Electronics, New Delhi, India, pp. 1–5
(2011)
[11] Maes, U.S.: Social Information Filtering: Algorithms for automating word of
mouth. In: ACM CHI, pp. 210–217 (1995)
[12] Xia, S.-X., Han, X.-D., Liu, B., Zhou, Y.: A Sample-Weighted Robust Fuzzy C-
Means Clustering Algorithm. Energy Procedia (13), 3924–3931 (2011)

springerlink.com © Springer-Verlag Berlin Heidelberg 2013
Knowledge Discovery Using Associative
Classification for Heart Disease Prediction*
M.A. Jabbar, B.L. Deekshatulu, and Priti Chandra
Abstract. Associate classification is a scientific study that is being used by
knowledge discovery and decision support system which integrates association
rule discovery methods and classification to a model for prediction. An important
advantage of these classification systems is that, using association rule mining
they are able to examine several features at a time. Associative classifiers are
especially fit to applications where the model may assist the domain experts in
their decisions. Cardiovascular deceases are the number one cause of death
globally. An estimated 17.3 million people died from CVD in 2008, representing
30% of all global deaths. India is at risk of more deaths due to CHD.
Cardiovascular disease is becoming an increasingly important cause of death in
Andhra Pradesh. Hence a decision support system is proposed for predicting heart
disease of a patient. In this paper we propose a new Associate classification
algorithm for predicting heart disease for Andhra Pradesh population. Experiments
show that the accuracy of the resulting rule set is better when compared to existing
systems. This approach is expected to help physicians to make accurate decisions.
Keywords: Andhra Pradesh, Associative classification, Data mining, Heart
disease.
1 Introduction
The major reason that the data mining has attracted great deal of attention in the
information industry in the recent years is due to the wide availability of huge
amounts of data and imminent need for turning such data into useful information
M.A. Jabbar
JNTU Hyderabad
e-mail: jabbar.meerja@gmail.com
B.L. Deekshatulu
Distinguished fellow IDRBT, RBI Govt of India
Priti Chandra
Senior Scientist Advanced System Laboratory, Hyderabad

30 M.A. Jabbar, B.L. Deekshatulu, and P. Chandra
and knowledge. The information gained can be used for applications ranging from
business management, production control, and market analysis to emerging design
and science exploration and health data analysis. Data mining, also known as
knowledge discovery in data bases (KDD), is the process of automatically
discovering useful information in large data repositories [1].Association rule
mining and classification are analogous tasks in data mining, with the exception
that classification main aim is to build a classifier using some training instances
for predicting classes for new instance, while association rule mining discovers
association between attribute values in a data set. Association rule uses
unsupervised learning where classification uses supervised learning .The majority
of traditional classification techniques use heuristic-based strategies for building
the classifier [2].In constructing a classification system they look for rules with
high accuracy. Once a rule is created, they delete all positive training objects
associated with it. Thus these methods often produce a small subset of rules, and
may miss detailed rules that might play an important role in some cases. The
heuristic methods that are employed by traditional classification technique often
use domain independent biases to derive a small set of rules, and therefore rules
generated by them are different in nature and more complex than those that users
might expect or be able to interpret [3]. Both classification rule mining and
association rule mining are indispensable to practical applications. Thus, great
savings and convenience to the user could result if the two mining techniques can
somehow be integrated.
Associative classifications (AC) is a recent and rewarding technique that
applies the methodology of association into classification and achieves high
classification accuracy, than traditional classification techniques and many of the
rules found by AC methods can not be discovered by traditional classification
algorithms. This generally involves two stages.
1) Generate class association rules from a training data set.
2) Classify the test data set into predefined class labels.
The various phases in Associative classification are Rule generation, Rule
pruning, Rule ranking, and Rule sorting, Model construction and Prediction. The
rule generation phase in Associative classification is a hard step that requires a
large amount of computation. A rich rule set is constructed after applying
suitable rule pruning and rule ranking strategies. This rule set which is generated
from the training data set is used to build a model which is used to predict test
cases present in the test data set.
Coronary heart disease (CHD) is epidemic in India and one of the major causes
of disease burden and deaths. Mortality data from the Registrar General of India
shows that cardiovascular diseases are a major cause of death in India now.
Studies to determine the precise causes of death in Andhra Pradesh have revealed
that cardiovascular diseases cause about 30% in rural areas [4].Medical diagnosis
is regarded as an important yet complicated task that needs to be executed
accurately and efficiently. The automation of this system should be extremely
advantageous. Medical history of data comprises of a number of tasks essential to
diagnosis particular disease. It is possible to acquire knowledge and information
concerning a disease from the patient -specific stored measurement as far as

Knowledge Discovery Using Associative Classification for Heart Disease Prediction 31
medical data is concerned. Therefore data mining has developed into a vital
domain in health care [5]. A classification system can assist the physician to
examine a patient. The system can predict if the patient is likely to have a certain
disease or present incompatibility with some treatments. Associative classification
is better alternative for predictive analysis [6].This paper proposed a new
associative classification method. Considering the classification model, the
physician can make a better decision.
Basic concepts in Associative classification and heart disease are discussed in
section 2, 3 and common algorithms surveyed in Section 4.Section 5 describes our
proposed method. Experimental results and comparisons are demonstrated in
section 6. We will conclude our final remarks in Section 7.
2 Associative Classification
According to [7] the AC problem was defined as Let a training data set T has M
distinct Attributes A1, A2 ...Am and C is a list of class labels. The number of rows
in D is denoted | D |. Attributes could be categorical or continuous. In case of
categorical attributes, all possible values are mapped to a set of positive integers.
For continuous attributes, a discreteisation method is first used to transform these
attributes into categorical cases.
Definition-1:- An item can be described as an attribute name Ai and its value a i,
denoted (Ai, a i)
Definition-2:- A row in D can be described as a combination of attribute names Ai
and values a ij , plus a class denoted by Cj.
Definition -3:- An item set can be described as a set of items contained in a
training data.
Definition -4:- A rule item r is of the Form item set-c) where c C is the class.
Definition -5:- The actual occurrence (actoccr) of a rule r in D is the no. of rows
in D that match the item set defined in r.
Definition -6:- The support count (supp. Count) of rule item r item set, c is the
No. of rows in D that matches r’s item set, and belongs to a class c.
Definition -7:- The occurrence of an item set I in T is the no. of rows in D that
match I.
Definition -8:- A rule r passes the min supp threshold if (supp count (r) = min-
supp)
Definition -9:-A rule r passes min.confidence threshold if (sup. Count (r) / actoccr
(r)) = min.confidence
Definition -10:- An item set I that passes the min .supp threshold is said to be a
frequent item set.
Definition -11:- Associate classification rule is represented in the Form (item set
→c) where antecedent is an item set and the consequent is a class.

The main task of AC s to discover a subset of rules with significant supports and
higher confidence subset is then used to build an automated classifier that could be
used to predict the classes of previously unseen data.
Predict
Training
Data
Frequent
Item sets
Set of class
association
rules
Classifier
(output)
Test data
Generate rules
Discover
frequent item
sets
Prune Rules
Accuracy
Fig. 1 Steps in Associative Classification
Table 1 A Training Data set
3 Heart Disease
Coronary heart disease is the single largest cause of death in developed countries
and is one of the main contributors to disease burden in developing countries.
According to WHO an estimated 17.3 million people died from CVD in 2008,
representing 30% of all global deaths .Of these deaths, an estimated 7.3 million
were due to coronary heart disease and 6.2 million were due to stroke. By 2030
almost 23.6 million people will die from CVD’s mainly from heart disease and
stroke [8].Coronary heart disease (CHD) is epidemic in India and one of the Major
causes of disease burden and deaths. Mortality data from the Registrar general of
India shows that CVD are a Major cause of death in India and in Andhra Pradesh
30% deaths in rural areas. The term heart disease encompasses the diverse
diseases that affect the heart. Cardiovascular disease or heart diseases are a class
of disease that involves the heart or blood vessels. Cardiovascular disease results
in severe illness, disability, and death. Narrowing of the coronary arteries results
Row
id
A B C Class
Label
1 a1 b2 c1 c1
2 a2 b1 c2 c0
3 a3 b3 c3 c1
4 a2 b2 c0 C0

in the reduction of blood and oxygen supply to the heart and leads to the coronary
heart disease. Myocardial infractions, generally known as heart attacks, and
angina pectoris or chest pain are encompassed in the CHD. A sudden blockage of
a coronary artery, generally due to a blood clot results in a heart attack, chest pains
arise when the blood received by the heart muscles is inadequate [9]. Over 300
risk factors have been associated with coronary heart disease and stroke. The
major established risk factors are 1) Modifiable risk factors 2) Non-modifiable
risk factors 3) Novel risk factors [8].
The following features are collected for heart disease prediction in Andhra
Pradesh based on the data collected from various corporate hospitals and opinion
from expert doctors.
1) Age 2) Sex 3) Hypertension 4) Diabetic 5) Systolic Blood pressure 6)
Diastolic Blood pressure 7) Rural / Urban.Comprehensive and integrated
action is the means to prevent and control cardio vascular diseases.
4 Related Work
One of the first algorithms to use an association rule mining approach for
classification was proposed in [10] and named CBA. CBA implement the famous
Apriori algorithm [11] in order to discover frequent item sets.
Classification based on multiple association rules (CMAR) adopts the FP-growth
ARM Algorithm [12] for discovering the rules and constructs an FP-Tree to mine
large databases efficiently [13]. It consists of two phases, rule generation and
classification. It Adopts FP-growth algorithm to scan the training data to find
complete set of rules that meet certain support and confidence thresholds.
Classification based on predictive association rules (CPAR) is a greedy method
proposed by [14]. The Algorithm inherits the basic idea of FOIL [15] in rule
generation and integrates it with the features of AC.Accurate and effective, multi
class, multi label associative classification was proposed in [7]. A new approach
based on information gain is proposed in [16] where the attribute values that are
more informative are chosen for rule generation. Numerous works in literature
related with heart disease have motivated our work. Some of the works are discussed
below.
Cluster based association rule mining for heart attack prediction was proposed
in [17].Their method is based on digit sequence and clustering. The entire data
base is divided into partitions of equal size. Each partition will be called as cluster.
Their approach reduces main memory requirement since it considers only a small
cluster at a time and it is scalable and efficient.
Intelligent and effective heart attack prediction system using data mining and
artificial neural net work was proposed in [18].They employed the multilayer
perception neural network with back – propagation as the training algorithm. The
problem of identifying constrained association rules for heart disease prediction was
studied in [19]. These constraints are introduced to decrease the number of patterns.
Enhanced prediction of heart disease with feature subset selection using genetic
algorithm was proposed in [20]. The objective of their work is to predict
accurately the presence of heart disease with reduced number of attributes.

We propose a better strategy for associative classification to generate a compact
rule set using only positively correlated rules, thereby the less significant rules are
eliminated from the classifier. Informative attribute centric rule generation
produces a compact rule set and we go for an attribute selection approach. We
used Gini Index measure as a filter to reduce number of item sets ultimately
generated. This classifier will be used for predicting heart disease.
5 Proposed Method
Most of the associate classification Algorithms adopt Apriori candidate generation
step for the discovery of the frequent rule items. The main drawback in terms of
mining efficiency of almost all the AC algorithms is that they generate large
number of candidate sets, and they make more than one pass over the training data
set to discover frequent rule items, which causes high I/O overheads. The search
space for enumeration of all frequent item sets is 2m
which is exponential in m,
where m, number of items.
Two measures support and confidence are used to prune the rules. Even after
pruning the infrequent items based on support and confidence, the Apriori [11]
association rule generation procedure, produces a huge number of association
rules. If all the rules are used in the classifier then the accuracy of the classifier
would be high but the building of classification will be slow.
An Informative attribute centered rule generation produces a compact rule set.
Gini Index is used as a filter to reduce the number of candidate item sets. In the
proposed method instead of considering all the combinations of items for rule
generation, Gini index is used to select the best attribute. Those attributes with
minimum Gini index are selected for rule generation.
( ) 
−
=
−
=
1
0
2
)]
/
(
[
1
c
i
t
i
p
t
Gini (1)
We applied our proposed method on heart disease data to predict the chances of
getting heart disease. Let us consider a sample training data set given in table 2
Table 2 Example Training data

After calculating Gini index of each attribute car type has the lowest Gini
index. So car type would be the better attribute. The rules generated like the
following are considered for classifier.
1) Car type=sports, shirt size=small, gender = male-class C0
2) Car type=sports, shirt size=medium, gender = female-class C0
3) Car type=luxury, shirt size=small, gender = female-class C1
4) Car type=luxury, shirt size=small, gender=male-class C1
Proposed Algorithm:
Input: Training data set T, min-support, min- confidence
Output: Classification Rules.
1) n ← no. of Attributes
C- Number of classes,
Classification rule X → ci
2) For each attribute Ai calculate Gini weighted average where
( ) 
−
=
−
=
1
0
2
)]
/
(
[
1
c
i
t
i
p
t
Gini
3) Select best attribute
Best Attribute = Minimum (Weighted Average of Gini (attribute))
4) For each t in training data set
i) (X → ci ) = candidates Gen. (Best attribute, T)
ii) If (Support (X → ci) min-support and min.confidence (x → (i))
iii) Rule set ← (X → ci)
5) From the generated association classification rules, test the rules on the
test data and find the Accuracy.
In our proposed method, we have selected the following attributes for heart
disease prediction in Andhra Pradesh.
1) Age 2) Sex 3) Hypertension 4) Diabetic 5) Systolic BP 6) Dialectic BP 7)
Rural / Urban .We collected the medical data from various corporate hospitals
and applied our proposed approach to analyze the classification of heart disease
patients
6 Results and Discussion
We have evaluated the accuracy of our proposed method on 9 data sets from SGI
Repository [21]. A Brief description about the data sets is presented in Table 3.
The accuracy is obtained by hold out approach [22], where 50% of the data was
randomly chosen from the data set and used as training data set and remaining
50% data was used as the testing data set. The training data set is used to
construct a model for classification. After constructing the classifier; the test data
set is used to estimate the classifier performance. Class wise distribution for each
data set is presented from Table 4-10
Accuracy Computation: Accuracy measures the ability of the classifier to
correctly classify unlabeled data. It is the ratio of the number of correctly
classified data over the total number of given transaction in the test data-set.

Accuracy = Number of objects correctly Classified
Total No. of objects in the test set.
Table 11 and Fig. 2 Represents the classification rate of the rule sets generated by
our algorithm. Table 12 represents the accuracy of various algorithms on different
data sets. Table 13 shows the size of the rules set generated by our algorithm,
CBA, C4.5.The table indicates that classification based association rule methods
often produce larger rules than traditional classification techniques. Table 14
shows the classification rules generated by our method, when we applied on heart
disease data sets.
Fig. 2 Accuracy of Various Data sets
Table 3 Data set Description Table 4 Class distribution for weather
Data
Data Sets Transac
tions
Items Classes
3 of 9
Data
150 9 2
XD6 Data 150 9 2
Parity 100 10 2
Rooth
Names
100 4 3
Led7 Data 100 7 10
Lens Data 24 9 3
Multiplexer
Data
100 12 2
Weather
Data
14 5 2
Baloon data 36 4 2
Table 5 Class distribution for lens Data
Class Frequency Probability
Yes 9 9/14=0.64
No 5 5/14=0.36
Class Frequency Probability
Hard
Contact
lenses
4 0.16
soft
Contact
lenses
5 0.28
No
Contact
lenses
15 0.625

Another Random Document on
Scribd Without Any Related Topics

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Intelligent Informatics Proceedings Of The International Symposium On Intelligent Informatics Isi12 Held At August 45 2012 Chennai India 1st Edition Li Shang

More Related Content

Similar to Intelligent Informatics Proceedings Of The International Symposium On Intelligent Informatics Isi12 Held At August 45 2012 Chennai India 1st Edition Li Shang (20)

Recently uploaded (20)

Intelligent Informatics Proceedings Of The International Symposium On Intelligent Informatics Isi12 Held At August 45 2012 Chennai India 1st Edition Li Shang