SlideShare a Scribd company logo
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
DOI : 10.5121/ijist.2013.3501 1
Quantification of Portrayal Concepts using tf-idf
Weighting
S. Florence Vijila 1
and Dr. K. Nirmala 2
1
Research Scholar, Manonmaniam Sundaranar University, Tamil Nadu, India,
Assistant Professor , CSI Ewart Women’s Christian College, Melrosapuram , TamilNadu
2
Associate Prof. of Computer Applications, Quaid-e-Millath Govt. College for Women,
Chennai, Tamil Nadu, India
ABSTRACT
Term frequencies and inverse document frequencies have been successfully applied in determining
weighting for document rankings. However these have been more successful in text mining and in
extraction techniques used in the web. Concept mining has become increasingly popular in the research
and application areas of Computer Science. This paper attempts to demonstrate the limited usage of term
frequency and inverse document frequency for the application of weighting calculations for ranking
documents that are based on concept quantifications. The case study considered for experiment in this
paper, is based on concept terms of David Merrill’s First Principles of Instruction (FPI). Merrill’s FPI
applies cognitive structures explicitly for analyzing instructional materials. Therefore it is justified that the
terms categorized under each cognitive structure (or portrayal) of FPI can be taken as respective concept
of that portrayal. As question papers are representative of cognitive structures in a more clear and logical
way, four question papers on ‘C Language’ have been considered for the experimental study, that are
detailed in this paper. Manual method has been adopted for the computation of quantities of portrayals in
selected documents for the purpose of comparative study. As manual method is accurate, the values
(results) are considered as benchmark values. These benchmark values are considered for comparing with
normalized term frequencies that are derived (experimented) from automated extractions from the same
selected documents. The study is however limited to four documents only. Conclusions are drawn from this
experimental study, which will be of immense use to concept mining researchers as well as for instructional
designers.
KEYWORDS
concept keywords; document ranking; term frequency weighting; cognitive structures.
1. INTRODUCTION
Literatures are aplenty for document ranking using term weighting. Documents are retrieved
using keywords and documents are ranked according to keyword density (Masaru Ohba et al -
2005, SA-Kwang Song et al -2012). According to documented literature, ranking of documents
based on term weighting has been a key research issue in information retrieval, particularly
retrieving concepts in addition to keywords. Applying weighting schemes using term frequency
and inverse document frequency are crucial for ranking of documents (Sa-Kwang Song et al-
2012). A concept keyword is a word that represents a concept. But the nature of concept and the
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
2
meaning a particular keyword carries are subjective judgment, as per these authors. The authors
have adapted term frequency for mining concepts. Both human selected concepts and automated
methodologies have been successfully adopted. But research publications point out that concept
mining ultimately depends only upon manual methods, when accuracy is needed to a high level.
In other words, automated methods would only provide approximate results.
The tf-idf weighting (term frequency-inverse document frequency, a.k.a.TF-IDF) is a numerical
statistic which reflects how important a word is to a document in a collection or corpus (Salton et
al - 1988). By convention, the tf-idf value increases proportionally to the number of times a word
appears in a document, but is offset by the frequency of the word in the corpus, which helps to
control for the fact that some words are generally more common than others. In view of this, the
factor itself may be considered as a weighting factor in information retrieval and text mining. This
paper hence attempts to determine the acceptability of tf-idf, through experimentation, on
quantifying concepts from documents. Comparative study will be carried out from the results of
manual methods.
In concept mining techniques, variations of the tf-idf weighting scheme are often used by some
search engines as a central tool in scoring and ranking a document’s relevance given by a user
query. Tf-idf can be successfully used for stop-words filtering in various subject fields including
text summarization and classification. This paper attempts to quantify well proven conceptual
terms, such as cognitive structures from instructional documents.
David Merrill (2007), in his ‘First Principles of Instruction’ (FPI) divides any instructional event
into four phases, which he calls ‘Activation’, ’Demonstration’, ’Application’ and ‘Integration’.
Central to this instructional model is a real-time problem-solving theme, called ‘problem’. Merrill
suggests that fundamental principles of instructional design should be relied on and these apply
regardless of any instructional design model used. Violating this would produce a decrement in
learning and performance. Each phase is a cognitive portrayal for learning a concept according to
this author. Therefore it would be very apt to consider cognitive structure or portrayal terms as
concepts for the purpose of quantification.
We propose to rank order four sets of concept keywords of these four cognitive structures
(portrayals) of Merrill’s FPI. The rank ordering would be carried out using normalized term
frequency (tf) and inverse document frequency (idf) on selected University question paper
documents of “C Language’. To ascertain the degree of accuracy (the objective of this paper), we
propose to compare the results with manual methods. Conclusions have been drawn from this
experimental study. The results will be of immense use to concept mining methodologists and for
instructional designers. The material of this paper forms a part of a whole research programme on
concept based quantification of portrayals by the authors. As these portrayals are taken for
concept quantification, the meaning of each portrayal is essential to comprehend before
attempting to experiment with.
2. ‘ACTIVATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’
LANGUAGE
The ‘Activation’ portrayal of FPI specifies, “ Does the question direct students to recall,
remember, repeat or recognize basic and fundamental knowledge of ‘C’ from the relevant past
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
3
behavior (experience) of the students that can be used as a foundation for the acquisition of new
(and further advanced) knowledge of ‘C’?“.
Based on the above definition and on the past ten year question paper patterns analyzed by the
authors, the following concept keywords have been specifically arrived at for the computation of
term frequency:
“list, define, tell, name, locate, identify, what, give, distinguish, acquire, underline, relate, state,
recall, select, repeat, recognize, reproduce and measure”.
3. ‘DEMONSTRATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’
LANGUAGE
The ‘Demonstration’ portrayal of FPI specifies, “Does the question directs the students to
demonstrate of what has he learnt rather than merely provide information (unlike ’Activation’)
about what is needed as foundation for further techniques of ‘C’?“. Based on the above definition
and the past ten year question papers pattern analyzed by the authors, the following concept
keywords have been exclusively arrived at for the computation of term frequency:
“demonstrate, summarize, illustrate, interpret, contrast, predict, associate, distinguish,
identify, show, label, collect, experiment, recite, classify, discuss, select, compare,
translate, prepare, change, rephrase, differentiate, draw, explain, estimate, fill in, choose,
operate, perform, organize and write”.
4. ‘APPLICATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’
LANGUAGE
The ‘Application’ portrayal of FPI specifies, ”Does the question provides an opportunity for
student to apply her acquired knowledge of ‘C’ to solve a problem?”. Based on this definition and
from analysis carried out authors on the past ten year question papers, the following concept
keywords have been exclusively arrived at for the computation of term frequencies:
“apply, calculate, find, solve, illustrate, make, predict, construct, assess, practice,
restrictive, classify, code, develop, generate, write”.
5. ‘INTEGRATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’
LANGUAGE
The ‘Integration’ portrayal of FPI specifies, ”Does the question triggers and encourages the
student to integrate (transfer) her acquired knowledge of ‘C’ to a new situation in any critical and
complex situation?”. Based on this definition and from the study on past ten year question papers,
the following concept keywords have been arrived at exclusively for the computation of term
frequencies:
“analyze, solve, justify, infer, combine, integrate, plan, generalize, assess, decide,
rank, grade, recommend, contrast, survey, examine, investigate, compose, invent,
improve, imagine, hypothesize, predict, evaluate, rate, how and why”.
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
4
Using these keywords, quantified benchmark values on the selected documents have been
arrived at through manual methods. These benchmark values would be compared with
automated term frequencies so as to arrive at conclusions.
6. COMPUTATION OF BENCHMARK VALUES BY MANUAL METHOD
Based on the definitions of the portrayals of FPI (four cognitive structures), quantification of
concept terms in terms of these four cognitive structures on the selected four question paper
documents (samples) were carried out using concept keywords as provided above. Thus, the
addition of all values (in percentages) would yield to 100%, as the method adapted is pure manual
by the authors. Therefore the values need to be exact and hence taken as benchmark values for
comparative studies. The values are provided in Table 1.
Portrayal Documents Average
Q1 Q2 Q3 Q4
Activation 48% 27% 30% 31% 34%
Demonstration 22% 28% 42% 33% 31%
Application 26% 36% 21% 29% 28%
Integration 4% 9% 7% 7% 7%
Table 1. Benchmark values in percentages computed manually
7. RESULTS AND DISCUSSIONS ON WEIGHTING USING TERM FREQUENCIES
The four selected question papers (documents) of the subject ‘C Language’ have been used for
the intended analysis. The normalized term frequency and inverse document frequencies on these
selected four concept keywords, namely ‘Activation’, ’Demonstration’, ’Application’ and
’Integration’ using their respective keywords have been used for the calculation of two
frequencies. These term frequencies are used for determining the weighting of these four
portrayals that we have computed through inverse document frequencies. The total number of
words, the most frequented word and the number of concepts (different keywords of each concept
or portrayal) are used for the calculation of weightings (Wikipedia 2012). They are computed and
presented for each document (i.e. question paper). They are presented in a nut shell in Table 2.
The basic variables for term frequency computations are:
Number of concept terms occurring in one document = i;
Number of most frequently occurred word in one document = n;
Normalized term frequency, tf = i/n ;_______________(1)
The basic data for term frequencies are presented in Table 2.
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
5
Document
No.
Total
words
No. of most
frequented
word(n)
No. of concept terms appearing (i)
Activation Demonstration Application Integration
Q1 187 15 11 5 6 1
Q2 206 25 6 7 8 2
Q3 184 16 9 12 6 2
Q4 173 21 6 10 8 2
Total no. of concept terms
appearing
32 34 28 7
Table 2.Basic Data for Term Frequency
The total number of documents considered is N, which is 4.
The normalized term frequencies for each concept term of each document is presented in Table 3
(refer to equation 1).
Concept Normalized term frequency Total Average
Q1 Q2 Q3 Q4
Activation 0.7333 0.2400 0.5625 0.2857 1.8215 0.4554
Demonstration 0.3333 0.2800 0.7500 0.4762 1.8395 0.4599
Application 0.4000 0.3200 0.3750 0.0952 1.1902 0.2976
Integration 0.0667 0.0800 0.1250 0.952 0.3669 0.0917
Table 3. Normalized Term Frequencies
The inverse document frequency is computed as:
idf = log (N/K) ;_______________(2)
Where N = Total number of documents, which is 4 and
K: Number of occurrences of terms in all the documents considered (see total concept terms
appearing in Table 2).
The final results are :-
Activation = -0.903089
Demonstration = -0.92959
Application = -0.845098
Integration = -0.243038
The negative sign is due to the fact that the total number of documents considered is less than the
occurrences of terms in all the documents. The objective of this research work is to compare these
frequency values of concept terms with respect to the benchmarks that were calculated based on
manual methods. As the study is limited to comparisons, the idf is considered with modular
values. Thus weighting would be
tf.│log (N/K)│;_______________(3)
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013
6
Accordingly, the weighting for each concept term of each document is computed and presented in
Table 4
Concept Weighting of terms Average
Q1 Q2 Q3 Q4
Activation 0.6622 0.2167 0.5080 0.2580 0.4113
Demonstration 0.3098 0.2603 0.6972 0.4427 0.4275
Application 0.3380 0.2704 0.3169 0.0805 0.2515
Integration 0.0162 0.0194 0.0304 0.2313 0.0223
Table 4. Weighting of portrayals in each document
The above values yield important conclusions.
8. CONCLUSIONS
The largest and the least present cognitive portrayals in every document as per benchmark values
tally well with weighting of term frequency of all question paper documents considered. It is thus
concluded that normalized term frequencies may be used for quantifying concept terms in
individualized documents to an acceptable standards. However the benchmark values deviate
from that of weighting term frequency on average values. This clearly shows that unlike
quantification of keywords, quantifying concept words may not be reliable when inverse
document frequency is computed on small number of documents. The work may be extended to
large number of documents for further study.
REFERENCES
[1] Salton G, Buckley C (1988), "Term-weighting approaches in automatic text retrieval". Information
Processing and Management 24 (5): 513–523, 1988.
[2] Sa-kwang Song, and Sung Hyon Myaeng, (2012), “A novel term weighting scheme based on
discrimination power obtained from past retrieval results.”, Information Processing and
Management 48 (2012) 919–930, 2012.
[3] Masaru Ohba and Katsuhiko Gondow, (2005), “Toward Mining ‘Concept Keywords’ from
Identifiers in Large Software Projects”, ACM SIGSOFT Software Engineering Notes 30(4): 1-5,
2005.
[4] Merrill M.D., (2002), “First Principles of Instruction”, Englewood Cliffs, NJ: Educational
Technology Publications, 2002.
[5] Wikipedia (2012), http://en.wikipedia.org/wiki/Tf*idf, 2012
Authors
S.Florence Vijila is working as Assistant Professor in the Department t of Computer
Science at CSI Ewart Women’s Christian College,Tamil Nadu for the past 12 years.Her
qualification is M.CA,M.Phil,Now she is doing Ph.D in the area of “Data Mining”.Her
work have been published in the International conference on ”Innovations in
contemporary IT research “ proceedings.

More Related Content

PPTX
Term weighting
PDF
Testing Different Log Bases for Vector Model Weighting Technique
PDF
Testing Different Log Bases for Vector Model Weighting Technique
PDF
3517 10753-1-pb
PDF
Additional1
PDF
renato_barros_arantes_article
PDF
20433-39028-3-PB.pdf
PDF
Using Class Frequency for Improving Centroid-based Text Classification
Term weighting
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
3517 10753-1-pb
Additional1
renato_barros_arantes_article
20433-39028-3-PB.pdf
Using Class Frequency for Improving Centroid-based Text Classification

Similar to Quantification of Portrayal Concepts using tf-idf Weighting (20)

PPTX
Information Retrieval – Dense Vectors – Neural IR for Question Answering
PDF
IRJET - Document Comparison based on TF-IDF Metric
PDF
A novel approach for text extraction using effective pattern matching technique
PDF
Ju3517011704
PDF
Arabic text categorization algorithm using vector evaluation method
PPT
information retrieval term Weighting.ppt
PDF
Cc35451454
PPTX
Course-Adaptive Content Recommender for Course Authoring
PDF
Comparative Study of Different Approaches for Measuring Difficulty Level of Q...
PDF
Comparative Study of Different Approaches for Measuring Difficulty Level of Q...
PDF
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
PDF
Searching for Interestingness in Wikipedia and Yahoo! Answers
PDF
Exploiting Distributional Semantic Models in Question Answering
PDF
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
PDF
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
PDF
Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian...
PDF
Aq35241246
PDF
Automated Question Paper Generator And Answer Checker Using Information Retri...
PDF
PDF
Construction of Keyword Extraction using Statistical Approaches and Document ...
Information Retrieval – Dense Vectors – Neural IR for Question Answering
IRJET - Document Comparison based on TF-IDF Metric
A novel approach for text extraction using effective pattern matching technique
Ju3517011704
Arabic text categorization algorithm using vector evaluation method
information retrieval term Weighting.ppt
Cc35451454
Course-Adaptive Content Recommender for Course Authoring
Comparative Study of Different Approaches for Measuring Difficulty Level of Q...
Comparative Study of Different Approaches for Measuring Difficulty Level of Q...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
Searching for Interestingness in Wikipedia and Yahoo! Answers
Exploiting Distributional Semantic Models in Question Answering
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian...
Aq35241246
Automated Question Paper Generator And Answer Checker Using Information Retri...
Construction of Keyword Extraction using Statistical Approaches and Document ...
Ad

More from ijistjournal (20)

PDF
MATHEMATICAL EXPLANATION TO SOLUTION FOR EX-NOR PROBLEM USING MLFFN
PPTX
Call for Papers - International Journal of Information Sciences and Technique...
PDF
3rd International Conference on NLP, AI & Information Retrieval (NLAII 2025)
PDF
SURVEY ON LI-FI TECHNOLOGY AND ITS APPLICATIONS
PPTX
Research Article Submission - International Journal of Information Sciences a...
PDF
A BRIEF REVIEW OF SENTIMENT ANALYSIS METHODS
PDF
14th International Conference on Information Technology Convergence and Servi...
PPTX
Online Paper Submission - International Journal of Information Sciences and T...
PDF
New Era of Teaching Learning : 3D Marker Based Augmented Reality
PPTX
Submit Your Research Articles - International Journal of Information Sciences...
PDF
GOOGLE CLOUD MESSAGING (GCM): A LIGHT WEIGHT COMMUNICATION MECHANISM BETWEEN ...
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PPTX
Call for Papers - International Journal of Information Sciences and Technique...
PDF
SURVEY OF ANDROID APPS FOR AGRICULTURE SECTOR
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
PDF
International Journal of Information Sciences and Techniques (IJIST)
PPTX
Research Article Submission - International Journal of Information Sciences a...
PDF
SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAIN
PDF
International Journal of Information Sciences and Techniques (IJIST)
PPTX
Online Paper Submission - International Journal of Information Sciences and T...
MATHEMATICAL EXPLANATION TO SOLUTION FOR EX-NOR PROBLEM USING MLFFN
Call for Papers - International Journal of Information Sciences and Technique...
3rd International Conference on NLP, AI & Information Retrieval (NLAII 2025)
SURVEY ON LI-FI TECHNOLOGY AND ITS APPLICATIONS
Research Article Submission - International Journal of Information Sciences a...
A BRIEF REVIEW OF SENTIMENT ANALYSIS METHODS
14th International Conference on Information Technology Convergence and Servi...
Online Paper Submission - International Journal of Information Sciences and T...
New Era of Teaching Learning : 3D Marker Based Augmented Reality
Submit Your Research Articles - International Journal of Information Sciences...
GOOGLE CLOUD MESSAGING (GCM): A LIGHT WEIGHT COMMUNICATION MECHANISM BETWEEN ...
6th International Conference on Artificial Intelligence and Machine Learning ...
Call for Papers - International Journal of Information Sciences and Technique...
SURVEY OF ANDROID APPS FOR AGRICULTURE SECTOR
6th International Conference on Machine Learning Techniques and Data Science ...
International Journal of Information Sciences and Techniques (IJIST)
Research Article Submission - International Journal of Information Sciences a...
SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAIN
International Journal of Information Sciences and Techniques (IJIST)
Online Paper Submission - International Journal of Information Sciences and T...
Ad

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
737-MAX_SRG.pdf student reference guides
PDF
composite construction of structures.pdf
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Construction Project Organization Group 2.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
additive manufacturing of ss316l using mig welding
Automation-in-Manufacturing-Chapter-Introduction.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Foundation to blockchain - A guide to Blockchain Tech
737-MAX_SRG.pdf student reference guides
composite construction of structures.pdf
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
OOP with Java - Java Introduction (Basics)
Model Code of Practice - Construction Work - 21102022 .pdf
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Internet of Things (IOT) - A guide to understanding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Construction Project Organization Group 2.pptx
Geodesy 1.pptx...............................................
Fundamentals of safety and accident prevention -final (1).pptx
additive manufacturing of ss316l using mig welding

Quantification of Portrayal Concepts using tf-idf Weighting

  • 1. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 DOI : 10.5121/ijist.2013.3501 1 Quantification of Portrayal Concepts using tf-idf Weighting S. Florence Vijila 1 and Dr. K. Nirmala 2 1 Research Scholar, Manonmaniam Sundaranar University, Tamil Nadu, India, Assistant Professor , CSI Ewart Women’s Christian College, Melrosapuram , TamilNadu 2 Associate Prof. of Computer Applications, Quaid-e-Millath Govt. College for Women, Chennai, Tamil Nadu, India ABSTRACT Term frequencies and inverse document frequencies have been successfully applied in determining weighting for document rankings. However these have been more successful in text mining and in extraction techniques used in the web. Concept mining has become increasingly popular in the research and application areas of Computer Science. This paper attempts to demonstrate the limited usage of term frequency and inverse document frequency for the application of weighting calculations for ranking documents that are based on concept quantifications. The case study considered for experiment in this paper, is based on concept terms of David Merrill’s First Principles of Instruction (FPI). Merrill’s FPI applies cognitive structures explicitly for analyzing instructional materials. Therefore it is justified that the terms categorized under each cognitive structure (or portrayal) of FPI can be taken as respective concept of that portrayal. As question papers are representative of cognitive structures in a more clear and logical way, four question papers on ‘C Language’ have been considered for the experimental study, that are detailed in this paper. Manual method has been adopted for the computation of quantities of portrayals in selected documents for the purpose of comparative study. As manual method is accurate, the values (results) are considered as benchmark values. These benchmark values are considered for comparing with normalized term frequencies that are derived (experimented) from automated extractions from the same selected documents. The study is however limited to four documents only. Conclusions are drawn from this experimental study, which will be of immense use to concept mining researchers as well as for instructional designers. KEYWORDS concept keywords; document ranking; term frequency weighting; cognitive structures. 1. INTRODUCTION Literatures are aplenty for document ranking using term weighting. Documents are retrieved using keywords and documents are ranked according to keyword density (Masaru Ohba et al - 2005, SA-Kwang Song et al -2012). According to documented literature, ranking of documents based on term weighting has been a key research issue in information retrieval, particularly retrieving concepts in addition to keywords. Applying weighting schemes using term frequency and inverse document frequency are crucial for ranking of documents (Sa-Kwang Song et al- 2012). A concept keyword is a word that represents a concept. But the nature of concept and the
  • 2. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 2 meaning a particular keyword carries are subjective judgment, as per these authors. The authors have adapted term frequency for mining concepts. Both human selected concepts and automated methodologies have been successfully adopted. But research publications point out that concept mining ultimately depends only upon manual methods, when accuracy is needed to a high level. In other words, automated methods would only provide approximate results. The tf-idf weighting (term frequency-inverse document frequency, a.k.a.TF-IDF) is a numerical statistic which reflects how important a word is to a document in a collection or corpus (Salton et al - 1988). By convention, the tf-idf value increases proportionally to the number of times a word appears in a document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. In view of this, the factor itself may be considered as a weighting factor in information retrieval and text mining. This paper hence attempts to determine the acceptability of tf-idf, through experimentation, on quantifying concepts from documents. Comparative study will be carried out from the results of manual methods. In concept mining techniques, variations of the tf-idf weighting scheme are often used by some search engines as a central tool in scoring and ranking a document’s relevance given by a user query. Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification. This paper attempts to quantify well proven conceptual terms, such as cognitive structures from instructional documents. David Merrill (2007), in his ‘First Principles of Instruction’ (FPI) divides any instructional event into four phases, which he calls ‘Activation’, ’Demonstration’, ’Application’ and ‘Integration’. Central to this instructional model is a real-time problem-solving theme, called ‘problem’. Merrill suggests that fundamental principles of instructional design should be relied on and these apply regardless of any instructional design model used. Violating this would produce a decrement in learning and performance. Each phase is a cognitive portrayal for learning a concept according to this author. Therefore it would be very apt to consider cognitive structure or portrayal terms as concepts for the purpose of quantification. We propose to rank order four sets of concept keywords of these four cognitive structures (portrayals) of Merrill’s FPI. The rank ordering would be carried out using normalized term frequency (tf) and inverse document frequency (idf) on selected University question paper documents of “C Language’. To ascertain the degree of accuracy (the objective of this paper), we propose to compare the results with manual methods. Conclusions have been drawn from this experimental study. The results will be of immense use to concept mining methodologists and for instructional designers. The material of this paper forms a part of a whole research programme on concept based quantification of portrayals by the authors. As these portrayals are taken for concept quantification, the meaning of each portrayal is essential to comprehend before attempting to experiment with. 2. ‘ACTIVATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’ LANGUAGE The ‘Activation’ portrayal of FPI specifies, “ Does the question direct students to recall, remember, repeat or recognize basic and fundamental knowledge of ‘C’ from the relevant past
  • 3. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 3 behavior (experience) of the students that can be used as a foundation for the acquisition of new (and further advanced) knowledge of ‘C’?“. Based on the above definition and on the past ten year question paper patterns analyzed by the authors, the following concept keywords have been specifically arrived at for the computation of term frequency: “list, define, tell, name, locate, identify, what, give, distinguish, acquire, underline, relate, state, recall, select, repeat, recognize, reproduce and measure”. 3. ‘DEMONSTRATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’ LANGUAGE The ‘Demonstration’ portrayal of FPI specifies, “Does the question directs the students to demonstrate of what has he learnt rather than merely provide information (unlike ’Activation’) about what is needed as foundation for further techniques of ‘C’?“. Based on the above definition and the past ten year question papers pattern analyzed by the authors, the following concept keywords have been exclusively arrived at for the computation of term frequency: “demonstrate, summarize, illustrate, interpret, contrast, predict, associate, distinguish, identify, show, label, collect, experiment, recite, classify, discuss, select, compare, translate, prepare, change, rephrase, differentiate, draw, explain, estimate, fill in, choose, operate, perform, organize and write”. 4. ‘APPLICATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’ LANGUAGE The ‘Application’ portrayal of FPI specifies, ”Does the question provides an opportunity for student to apply her acquired knowledge of ‘C’ to solve a problem?”. Based on this definition and from analysis carried out authors on the past ten year question papers, the following concept keywords have been exclusively arrived at for the computation of term frequencies: “apply, calculate, find, solve, illustrate, make, predict, construct, assess, practice, restrictive, classify, code, develop, generate, write”. 5. ‘INTEGRATION’ PORTRAYAL ON EVALUATION DOCUMENT OF ‘C’ LANGUAGE The ‘Integration’ portrayal of FPI specifies, ”Does the question triggers and encourages the student to integrate (transfer) her acquired knowledge of ‘C’ to a new situation in any critical and complex situation?”. Based on this definition and from the study on past ten year question papers, the following concept keywords have been arrived at exclusively for the computation of term frequencies: “analyze, solve, justify, infer, combine, integrate, plan, generalize, assess, decide, rank, grade, recommend, contrast, survey, examine, investigate, compose, invent, improve, imagine, hypothesize, predict, evaluate, rate, how and why”.
  • 4. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 4 Using these keywords, quantified benchmark values on the selected documents have been arrived at through manual methods. These benchmark values would be compared with automated term frequencies so as to arrive at conclusions. 6. COMPUTATION OF BENCHMARK VALUES BY MANUAL METHOD Based on the definitions of the portrayals of FPI (four cognitive structures), quantification of concept terms in terms of these four cognitive structures on the selected four question paper documents (samples) were carried out using concept keywords as provided above. Thus, the addition of all values (in percentages) would yield to 100%, as the method adapted is pure manual by the authors. Therefore the values need to be exact and hence taken as benchmark values for comparative studies. The values are provided in Table 1. Portrayal Documents Average Q1 Q2 Q3 Q4 Activation 48% 27% 30% 31% 34% Demonstration 22% 28% 42% 33% 31% Application 26% 36% 21% 29% 28% Integration 4% 9% 7% 7% 7% Table 1. Benchmark values in percentages computed manually 7. RESULTS AND DISCUSSIONS ON WEIGHTING USING TERM FREQUENCIES The four selected question papers (documents) of the subject ‘C Language’ have been used for the intended analysis. The normalized term frequency and inverse document frequencies on these selected four concept keywords, namely ‘Activation’, ’Demonstration’, ’Application’ and ’Integration’ using their respective keywords have been used for the calculation of two frequencies. These term frequencies are used for determining the weighting of these four portrayals that we have computed through inverse document frequencies. The total number of words, the most frequented word and the number of concepts (different keywords of each concept or portrayal) are used for the calculation of weightings (Wikipedia 2012). They are computed and presented for each document (i.e. question paper). They are presented in a nut shell in Table 2. The basic variables for term frequency computations are: Number of concept terms occurring in one document = i; Number of most frequently occurred word in one document = n; Normalized term frequency, tf = i/n ;_______________(1) The basic data for term frequencies are presented in Table 2.
  • 5. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 5 Document No. Total words No. of most frequented word(n) No. of concept terms appearing (i) Activation Demonstration Application Integration Q1 187 15 11 5 6 1 Q2 206 25 6 7 8 2 Q3 184 16 9 12 6 2 Q4 173 21 6 10 8 2 Total no. of concept terms appearing 32 34 28 7 Table 2.Basic Data for Term Frequency The total number of documents considered is N, which is 4. The normalized term frequencies for each concept term of each document is presented in Table 3 (refer to equation 1). Concept Normalized term frequency Total Average Q1 Q2 Q3 Q4 Activation 0.7333 0.2400 0.5625 0.2857 1.8215 0.4554 Demonstration 0.3333 0.2800 0.7500 0.4762 1.8395 0.4599 Application 0.4000 0.3200 0.3750 0.0952 1.1902 0.2976 Integration 0.0667 0.0800 0.1250 0.952 0.3669 0.0917 Table 3. Normalized Term Frequencies The inverse document frequency is computed as: idf = log (N/K) ;_______________(2) Where N = Total number of documents, which is 4 and K: Number of occurrences of terms in all the documents considered (see total concept terms appearing in Table 2). The final results are :- Activation = -0.903089 Demonstration = -0.92959 Application = -0.845098 Integration = -0.243038 The negative sign is due to the fact that the total number of documents considered is less than the occurrences of terms in all the documents. The objective of this research work is to compare these frequency values of concept terms with respect to the benchmarks that were calculated based on manual methods. As the study is limited to comparisons, the idf is considered with modular values. Thus weighting would be tf.│log (N/K)│;_______________(3)
  • 6. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.5, September 2013 6 Accordingly, the weighting for each concept term of each document is computed and presented in Table 4 Concept Weighting of terms Average Q1 Q2 Q3 Q4 Activation 0.6622 0.2167 0.5080 0.2580 0.4113 Demonstration 0.3098 0.2603 0.6972 0.4427 0.4275 Application 0.3380 0.2704 0.3169 0.0805 0.2515 Integration 0.0162 0.0194 0.0304 0.2313 0.0223 Table 4. Weighting of portrayals in each document The above values yield important conclusions. 8. CONCLUSIONS The largest and the least present cognitive portrayals in every document as per benchmark values tally well with weighting of term frequency of all question paper documents considered. It is thus concluded that normalized term frequencies may be used for quantifying concept terms in individualized documents to an acceptable standards. However the benchmark values deviate from that of weighting term frequency on average values. This clearly shows that unlike quantification of keywords, quantifying concept words may not be reliable when inverse document frequency is computed on small number of documents. The work may be extended to large number of documents for further study. REFERENCES [1] Salton G, Buckley C (1988), "Term-weighting approaches in automatic text retrieval". Information Processing and Management 24 (5): 513–523, 1988. [2] Sa-kwang Song, and Sung Hyon Myaeng, (2012), “A novel term weighting scheme based on discrimination power obtained from past retrieval results.”, Information Processing and Management 48 (2012) 919–930, 2012. [3] Masaru Ohba and Katsuhiko Gondow, (2005), “Toward Mining ‘Concept Keywords’ from Identifiers in Large Software Projects”, ACM SIGSOFT Software Engineering Notes 30(4): 1-5, 2005. [4] Merrill M.D., (2002), “First Principles of Instruction”, Englewood Cliffs, NJ: Educational Technology Publications, 2002. [5] Wikipedia (2012), http://en.wikipedia.org/wiki/Tf*idf, 2012 Authors S.Florence Vijila is working as Assistant Professor in the Department t of Computer Science at CSI Ewart Women’s Christian College,Tamil Nadu for the past 12 years.Her qualification is M.CA,M.Phil,Now she is doing Ph.D in the area of “Data Mining”.Her work have been published in the International conference on ”Innovations in contemporary IT research “ proceedings.