International Journal of Electrical and Computer Engineering (IJECE)
Vol. 14, No. 1, February 2024, pp. 711~720
ISSN: 2088-8708, DOI: 10.11591/ijece.v14i1.pp711-720  711
Journal homepage: http://ijece.iaescore.com
Sensing complicated meanings from unstructured data: a novel
hybrid approach
Shankarayya Shastri1
, Veeragangadhara Swamy Teligi Math2
, Patil Nagaraja Siddalingappa3
1
Department of Computer Science and Engineering, GM Institute of Technology Davangere, Visvesvaraya Technological University,
Belagavi, India
2
Department of Computer Science and Engineering, RYM College of Engineering Ballari, Visvesvaraya Technological University,
Belagavi, India
3
Department of Information Science and Engineering, Bapuji Institute of Engineering and Technology Davanagere, Visvesvaraya
Technological University, Belagavi, India
Article Info ABSTRACT
Article history:
Received Aug 11, 2023
Revised Sep 13, 2023
Accepted Sep 14, 2023
The majority of data on computers nowadays is in the form of unstructured
data and unstructured text. The inherent ambiguity of natural language
makes it incredibly difficult but also highly profitable to find hidden
information or comprehend complex semantics in unstructured text. In this
paper, we present the combination of natural language processing (NLP) and
convolution neural network (CNN) hybrid architecture called automated
analysis of unstructured text using machine learning (AAUT-ML) for the
detection of complex semantics from unstructured data that enables different
users to make understand formal semantic knowledge to be extracted from
an unstructured text corpus. The AAUT-ML has been evaluated using three
datasets data mining (DM), operating system (OS), and data base (DB), and
compared with the existing models, i.e., YAKE, term frequency-inverse
document frequency (TF-IDF) and text-R. The results show better outcomes
in terms of precision, recall, and macro-averaged F1-score. This work
presents a novel method for identifying complex semantics using
unstructured data.
Keywords:
Convolutional neural network
Natural language processing
Information extraction
Unstructured text
Complex semantics
This is an open access article under the CC BY-SA license.
Corresponding Author:
Shankarayya Shastri
Department of Computer Science and Engineering, GM Institute of Technology Davangere, Visvesvaraya
Technological University
Belagavi, India
Email: shankarayyasresearch@gmail.com
1. INTRODUCTION
Unstructured text poses difficulties for computer programs due to its lack of a clear structure or
defined data model [1]. This makes it unsuitable for conventional database models, causing issues in storage,
management, and indexing due to the absence of a schema and unclear structure. Consequently, search
results are less accurate due to the absence of predefined attributes. In today's digital landscape, the volume
of diverse unstructured text is increasing, generated from sources like web pages, research papers, and
articles [2]. This growth is driven by advancements in web technologies and text extraction tools [3]. While
semantic technologies and text mining systems aid in linking text to knowledge, processing complex
language and extracting intricate insights remains a challenge [4]. Information extraction (IE) algorithms help
extract knowledge by identifying references and relationships between entities [5]. Yet, deeper insights
require natural language processing (NLP) techniques, including convolutional neural networks (CNNs),
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720
712
which are widely used in image processing and text classification, to effectively capture unstructured
information from the text [6].
In NLP, CNNs are applied to categorization problems, using sliding windows for data analysis. Soni
et al. [7] have presented a Text ConvoNet which is used for classifying the multi and binary-class text classes
using CNN. This work has used intra-sentence 𝑛-gram characteristics for classifying the text. Song et al. [8]
have presented a TextCNN technique for classifying the text corpus. In study [9], a model has been presented
for classifying the text using news text using CNN. Fesseha et al. [10] used CNN with continuous bag-of-
words, FastText, and word to vector (Word2Vec) for evaluating the text from news. Lu et al. [11] have
proposed a model that uses CNN, bidirectional-encoder-representation using transformers (BERT),
transformer encoder, and four neural-networks, recurrent-neural-network (RNN), long-short term-memory
(LSTM), gated-recurrent-unit (GRU) and Bi-directional LSTM (Bi-LSTM) for classification of summary
notes of the patients. Zulqarnain et al. [12] examined three architectures, CNN, RNN, and deep belief neural
(DBN) for text classification. All these models have utilized CNN and achieved better results, but failed to
classify the complex semantics when the data is unstructured.
Further, by using NLP and CNN techniques, complex semantics within unstructured text documents
can be detected and analyzed. The semantic analysis involves important elements such as hyponymy, which
connects generic terms (hypernyms) to their instances (hyponyms) for instance, "color" and its variations.
Homonymy refers to words with the same spelling but different meanings, like "bat." Polysemy involves
words with distinct yet related meanings, such as "bank." Synonymy pertains to words with similar
meanings, like "author/writer." Antonymy represents symmetrical relationships between opposite words.
Semantic analysis finds applications in chatbots, support systems, sentiment analysis, search engines,
translation, Q&A, and grammar identification. This study uses this knowledge to propose automated analysis
of unstructured text using machine learning (AAUT-ML), a hybrid NLP and CNN approach that can detect
complicated semantics in unstructured data and is evaluated by measures of recall, precision, and F1-scores.
This work makes the following contributions; i) Classify unstructured text into their respective domains using
the NLP method, ii) Using 𝑛-gram detectors, 1-dimensional convolving filters will be employed, each of
which focuses on a certain family of closely related 𝑛-grams, iii) The appropriate 𝑛-grams are extracted for
decision-making through max-pooling over time, and iv) Based on data from Max-pooling, the rest of the
network extracts hidden or complex semantics from unstructured text.
This paper will have the following outline. In Section 2, the existing approaches are discussed. In
Section 3, the proposed methodology has been presented. In Section 4, the presented methodology is
evaluated and compared with the existing works. Finally, in Section 5, the conclusion and future work of the
complete work have been presented.
2. LITERATURE SURVEY
This section presents the different reviews, strategies, architectures, and methodologies used for
classifying the text. There has not been a lot of effort done to examine the drawbacks of information
extraction (IE) across all tasks and data kinds in a unified study. Hence, Adnan and Akbar [13], overcomes
that barrier by providing a comprehensive literature evaluation of cutting-edge methods for handling all
forms of big data and texts. Additionally, current issues with IE are highlighted and briefly discussed.
Solutions are suggested, along with directions for future study in the field of text IE. In light of the current
approaches and difficulties in the analysis of large amounts of data, the study is important. The findings and
suggestions offered here have analyzed large amounts of data more efficiently overall. Gupta et al. [14]
presented a work for extracting the relations using unsupervised machine learning by utilizing the radiology
reports of patients. For implementing this work, they have used a hybrid approach, i.e., dependency-based
parsed trees which will extract the IE. This work achieved an F-score of 94 percent when tested on
mammography data of patients. Chang and Mostafa [15] introduced a novel supervised machine learning
model and utilized the systematized nomenclature of medicine-clinical terms (SNOMED CT) [16] for
evaluating their model. This work was compared with the 2018 N2C2 model [17]. The proposed model
achieved a score of 0.933 when compared with [17]. Adnan and Akbar [18] has surveyed text IE using
unstructured data. First, it provides a consolidated summary of IE methods across different types of
unstructured data (including written content, visual content, audio content, and videos). Furthermore, it
explores how the diversity, dimensions, and amount of unstructured large data pose challenges to the
aforementioned established IE methods. Furthermore, possible approaches to enhance the unstructured large
data IE platforms for future study are offered. Tekli et al. [19] introduced a model, SemIndex+ for classifying
the partly structured, structured, and unstructured data. They have used weight functions for classifying the
text. The SemIndex+ achieved better results in terms of precision when compared with the existing works.
Int J Elec & Comp Eng ISSN: 2088-8708 
Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri)
713
From the above study, it can be seen that very little work has been done on classifying the
unstructured data directly, without data cleaning and data pre-processing. In addition, comprehensive research
is offered in [13], which discusses methods for data extraction from unstructured data. Furthermore, [14]–[17]
none of these papers have dealt with the issue of data extraction from unstructured sources. Both [18], [19]
discuss methods for extracting information from unstructured sources, but neither paper examines these
methods in relation to other datasets. The models [18], [19] may work properly for the given respective
datasets given in [18], [19]. Hence in this work, we utilize the NLP and CNN and build a model called AAUT-
ML to extract the text from the unstructured data. This work will classify the unstructured complex semantics
into their respective domains using the NLP method. Also, by using 𝑛-gram detectors, 1-dimensional
convolving filters are employed, each of which focuses on a certain family of closely related 𝑛-grams. The
appropriate 𝑛-grams are extracted for decision-making through max-pooling over time. Based on data from
Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text.
3. METHODOLOGY FOR SENSING COMPLICATED MEANINGS FROM UNSTRUCTURED
TEXT FROM UNSTRUCTURED DATA USING NLP AND CNN TECHNIQUE
In this two-phase proposed work, we are trying to detect complex semantics from unstructured text
using NLP and CNN Techniques. In the first phase (NLP) we are pre-processing, filtering, and classifying
unstructured text to specific data domains. In the second phase (CNN), we try to figure out how CNN handles
text after which we use that knowledge to discover complex semantics in unstructured text. The main aim is
to satisfy the objectives that are listed as follows:
a. To classify unstructured text into their respective domains using the NLP method.
b. As 𝑛-gram detectors, 1-dimensional convolving filters are employed, each of which focuses on a certain
family of closely related 𝑛-grams.
c. The appropriate 𝑛-grams are extracted for decision-making through max-pooling over time.
d. Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from
unstructured text.
In this experiment setup, we are taking sample input datasets from computer science domains such
as Databases, Operating systems, and Data mining unstructured text files in .txt format. Tokenization, stop
word removal, and rare word removal functions must be applied to the input unstructured documents as part
of the pre-processing step. Pre-processing eliminates missing or inconsistent data values brought on by
human or technological faults. Pre-processing can make a dataset's accuracy and quality more precise,
dependable, and consistent. The proposed work will be carried out in two distinct phases. The detailed
procedures for both Phase 1 along Phase 2 are outlined below. Figure 1 provides a comprehensive block-
diagram of the novel developed architecture.
Figure 1. Novel developed block-diagram for NLP and CNN for Text classification and detection of complex
semantics
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720
714
Phase 1. Pre-processing and filtering candidate concepts
Step 1: Given a collection of unstructured text files, denoted as 𝐷 = {𝑑1, 𝑑2, 𝑑3, … 𝑑𝑚}, describing a
collection of concepts denoted as 𝐶 = {𝑐1, 𝑐2, 𝑐3, … . 𝑐𝑚}, where both n and m are greater than 1, the
objective is to identify and categorize the most pertinent concepts from 𝐶. The initial step involves
tokenizing the unstructured textual documents, represented as 𝑑𝑖 ∈ 𝐷, into 𝑛-grams that serve as the
preliminary candidate concepts, denoted as 𝑐𝑖 ∈ 𝐶.
𝑑𝑖 ∈ 𝐷 > 𝑇𝑜𝑘𝑒𝑛𝑖𝑧𝑎𝑡𝑖𝑜𝑛{𝑐1, 𝑐2, 𝑐3, … , 𝑐𝑚}
Step 2: In these 𝑛-grams, common stop words are eliminated based on a predefined stop-word list, and the
occurrences of the remaining candidates 𝑐𝑖 in 𝐷 are tallied to generate a set of tuples denoted as
𝑠 = {(𝑐𝑖, 𝑓𝑖), … . }, comprising each candidate 𝑐𝑖 and its corresponding frequency 𝑓𝑖.
Step 3: To diminish noise, 𝑛-grams with low frequency, short length, and infrequent occurrences are
eliminated from both the set of 𝑛-grams (𝑇) and the concept set (𝐶) by applying a specified frequency
threshold 𝑓𝑡.
Step 4: Only meaningful unigrams (𝑛 >= 1) are retained only if it occurs at-least 𝑓𝑡 times more frequently
than any larger 𝑛-gram containing the same unigram within it (refer to Figure 1).
Step 5: In cases where multiple higher-order 𝑛-grams 𝑐𝑗 encompass 𝑐𝑖, 𝑓𝑖 that makes reference to the highest
frequently encountered 𝑐𝑗, then the remaining 𝑛-grams in 𝐶 are merged based on two rules: firstly,
plural tokens are filtered out if their singular form is also present as shown in Figure 1, and secondly,
current participle of a normal verb is not used if there is an alternative form that does not include it.
Step 6: Among the remaining candidates in 𝐶, those without a corresponding 𝐷𝐵𝑝𝑒𝑑𝑖𝑎 entry is filtered out.
Step 7: Initial context information denoted as 𝐶𝑖𝑛𝑓𝑜, is generated, and the documents 𝑑𝑖 are classified using
specific-domain.
Phase 2. Sensing complicated meanings
Following is how CNN works for text processing. The three-layer CNN framework is used in this
work's implementation. CNN's fundamental capabilities are analogous to those found in the visual-cortex in
the brains of animals. CNN performs admirably in text-classification tasks. Classifying texts follows a
process similar to those of categorizing images, with the exception that words are represented by vectors
within a matrix rather than pixels.
3.1. Target-function
Learning-capable neuron biases and weights are used throughout target-function implementation. To
generate an outcome, neurons receive multiple inputs, carry out a weighted average over those inputs, and
then send that value through an activation mechanism. By sending the network's outcome through the
softmax layer, a loss-function can usually be determined for the entire system. The outcome of a network
with a softmax layer, which has full connectivity, is down-sampled.
3.2. Representation
CNN's initial level consists of an embedding level, which transforms word-indices into three-
dimensional vectors. These vectors are discovered using the equivalent of a lookup-table. When considering
a sentence represented as 𝑁, and words represented as 𝑊, each word is translated through its associated
embedding and the highest sentence size 𝑉𝑤𝑟𝑑
is used to determine the vocabulary-size. Once all the words
have been converted to vectors, they are sent through the convolution-layer.
3.3. System structure
The developed architecture has three distinct stages. The architecture consists of two layers: the
Embedding level, that maps words to embedded vectors, and then the Convolution level, that performs most
part of the approach work. The sentence matrix is processed by a set of preset filters, which reduces its
dimensions. The softmax level, the final one, acts as a downsampling level that can both reduce the sentence-
matrix and compute the loss-function. The sentence's word-embedding can be obtained using the embedded
word lookup-table. To ensure that every single sentence is handled fairly, the matrix produced by the
embedding component remains padded. Once the filters have been established, the matrix continues to be
reduced and convolved-features are going to be generated. These complex characteristics are finally
simplified. As a next step in down-sampling, the resultant data from the convolved-features is distributed
across the maximum pooling level. Various sized and shaped filtration are specified. Three, four, and five are
the filter shapes employed by the suggested approach. Following that, padding is applied to each of the
Int J Elec & Comp Eng ISSN: 2088-8708 
Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri)
715
embedded sentences so that the resulting sentence matrix possesses an identical shape and size. Word vectors
𝑤1, … , 𝑤𝑛 ∈ 𝑅𝑑
are the result of embedding every symbol in the 𝑛-words text being entered in the form of
d-dimensional vector data. The generated 𝑑 × 𝑛 matrix can be utilized to transmit a sliding-window across
the text within a convolutional level. In accordance with each l-word 𝑛-gram.
𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] ∈ 𝑅𝑑×𝑙
; 0 ≤ 𝑖 ≤ 𝑛 − 𝑙 (1)
where matrix 𝐹 ∈ 𝑅𝑛×𝑚
. Max-pooling applied along the 𝑛-gram dimension yields 𝑝 ∈ 𝑅𝑚
. The non-linearity
of rectified linear unit (ReLU) is used to process 𝑅𝑚
. The distribution across the classes used for
classification is then generated by a linearly fully connected layer 𝑊 ∈ 𝑅𝑐×𝑚
, that then outputs the class with
the highest strength. In execution, we employ a range of window widths, from 𝑙 ∈ 𝐿, 𝐿 ∈ 𝑁, by chaining
together the outputs 𝑝𝑙
vectors of numerous convolution levels. It is important to take into account that the
procedures described here also work for dilated convolutions. This is represented as (2) to (6).
𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] ∈ 𝑅𝑑×𝑙
; 0 ≤ 𝑖 ≤ 𝑛 − 𝑙 (2)
𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] (3)
𝐹𝑖𝑗 = 〈𝑢𝑖,𝑓𝑖〉 (4)
𝑝𝑗 = 𝑅𝑒𝐿𝑈(𝑚𝑎𝑥𝐹𝑖𝑗) (5)
𝑜 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊
𝑝) (6)
3.4. Identification of important features
According to conventional wisdom, filters can be thought of as 𝑛-gram detectors, with every filter
looking for a unique category of 𝑛-grams and marking them with high-scores. After the max-pooling process,
only the 𝑛-grams with the most favorable scores remain. Once the total number of 𝑛-grams within the max-
pooled vector (which is represented through the collection of matching filters) has been determined, a
conclusion can be drawn. Any filter's high-scoring 𝑛-grams (compared with the way it ranks similar
𝑛-grams) should be considered to be particularly useful for text classification. In this subsection, we enhance
this perspective by posing and aiming to respond to the following issues: what data underlying 𝑛-grams can
be obtained within the max-pooled vector, and in what manner is it utilized in the last classification?
3.5. Informative vs. uninformative 𝒏-grams
The pooled vector 𝑝, that belongs to the 𝑚-dimensional real space 𝑅𝑚
, i.e., 𝑝 ∈ 𝑅𝑚
serves as the
foundation for the classification process. Every value of 𝑝𝑗 is derived through the ReLU applied to the
maximum inner product between the 𝑛-gram 𝑢𝑖 along with the filter 𝑓𝑗. These values could be attributed to
the specific 𝑛-gram 𝑢𝑖, which consists of a sequence of words [𝑤𝑖, … , 𝑤𝑖+𝑙−1] , which activated the filter. The
collection of 𝑛-grams that contribute to the overall probability distribution 𝑝 can be denoted as 𝑆𝑝. 𝑛-grams
that are not present in the collection 𝑆𝑝 cannot have any effect on the decision-making process within the
classifier. However, it is imperative to consider the presence of 𝑛-grams within the set 𝑆𝑝. In prior
investigations into the prediction-based analysis of CNN for text, the focus has been on identifying the
𝑛-grams found within the input sequence, denoted as 𝑆𝑝, and evaluating their respective rankings as a method
of understanding the underlying prediction process. In this context, we adopt an additional complicated
perspective. It is important to highlight the following: the ultimate categorization process does not
necessarily consider the specific 𝑛-gram identities, but rather evaluates these individuals based on the
rankings allocated through the filters. Therefore, it is imperative that the data contained in variable 𝑝 is
contingent upon the allocated rankings. From a conceptual standpoint, the 𝑛-grams within the set 𝑆𝑝 can be
categorized through two distinct classes: accidental and deliberate. The presence of deliberate 𝑛-grams in
𝑆𝑝 can be attributed to their higher rankings assigned by the filtering mechanism. This suggests that these
n-grams possess valuable information that is relevant to the ultimate decision-making process. In contrast, it
is observed that accidental 𝑛-grams, regardless of possessing a relatively low ranking, manage to find their
way into the set 𝑆𝑝. This occurrence can be attributed to the absence of an additional 𝑛-gram that achieved a
higher ranking compared to them. Based on the analysis conducted, it is evident that the 𝑛-grams in question
do not appear to possess significant informational value in relation to the classification selection at hand. Is it
possible to distinguish and differentiate between intentional and unintentional 𝑛-grams? It is postulated that
within the framework for every filter, a discernible threshold exists. Values surpassing this threshold are
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720
716
indicative of useful information pertaining to the classification process, whereas values falling beneath the
threshold are deemed inaccurate and are therefore disregarded for classification purposes. Formally, given
threshold dataset (𝑋, 𝑌).
𝑝𝑢𝑟𝑖𝑡𝑦(𝑓, 𝑡) =
| {(𝑥,𝑦) ∈ (𝑋,𝑌)𝑓 | 𝑥 ≥𝑡 & 𝑦=𝑡𝑟𝑢𝑒 }|
| {(𝑥,𝑦) ∈ (𝑋,𝑌)𝑓 | 𝑥 ≥𝑡 }|
(7)
Based on our empirical findings, we determine that an ideal purity-value of 0.75 is optimal when
determining the threshold of a given filter. Further, the results of the proposed work have been evaluated
using three datasets and in terms of recall, precision, and macro averaged F1-score. The results are discussed
in the next section.
4. RESULTS AND DISCUSSION
This section commences by providing a comprehensive overview of the system requirements,
followed by a detailed examination of the dataset employed in the study. Additionally, the performance
metrics utilized to evaluate the system's efficacy are thoroughly examined. The results obtained from the
proposed methodology were subsequently compared to those of previous studies, specifically on the basis of
recall, precision, and macro-averaged F1-score. The inclusion of a comprehensive discussion section within
the present work serves to provide a thorough analysis and interpretation of the obtained results.
4.1. System requirements, datasets and performance metrics
In this section, we conduct a series of experiments on Phase 1 and Phase 2. The code was written in
Python and a system having Windows 10 operating system, 16 GB RAM was used for executing the code. In
this approach three different datasets namely data mining (DM) [20], operating system (OS) [21], and data
base (DB) [22], [23] datasets are used for experiment purposes. The DM, OS, and DB were created in [24].
The performance of the presented approach AAUT-ML is investigated with different existing available
approaches i.e., YAKE [15], TF-IDF [25], and TextR [26]. All the results and datasets have been taken from
[24]. Results from the provided AAUT-ML are analyzed, and metrics such as recall, precision, and F1-score
are used to determine how well the model performs. By dividing the sum of all anticipated sequences within
a positive category by the precision, we can determine how many positive categories were successfully
anticipated. In (9) can calculate recall, that can be described as the proportion of correctly predicted positive
outcomes compared to actual positive outcomes. In machine learning, the F1-score is a crucial evaluation
metric. It neatly summarizes a model's prediction ability through the combination of the precision and recall
measures, which are frequently assessed using (10).
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
(8)
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
(9)
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =
2 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
(10)
where 𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 is the sum of all the predicted key-phrases that were found to be a good match with the
standard key-phrases, and 𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 is the sum of all the predicted key-phrases from the document.
4.2. Precision
In Figure 2, the precision has been evaluated and compared with the YAKE, TF-IDF, and Text-R
models. For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR
by 66.21%, 87.66%, and 98.65% respectively. For the OS dataset, the AAUT-ML performed better when
compared to YAKE, TF-IDF, and TextR by 66.74%, 84.64%, and 97.57% respectively. For the DB dataset,
the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 33.20%, 84.52%, and
97.35% respectively. The proposed AAUT-ML has achieved better results for precision in comparison to the
YAKE, TF-IDF, and TextR.
4.3. Recall
In Figure 3, the recall score has been evaluated and compared with the YAKE, TF-IDF, and Text-R
models. For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR
Int J Elec & Comp Eng ISSN: 2088-8708 
Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri)
717
by 48.17%, 81.70%, and 96.34% respectively. For the OS dataset, the AAUT-ML performed better when
compared to YAKE, TF-IDF, and TextR by 62.91%, 51.65%, and 3.97% respectively. For the DB dataset,
the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 28.67%, 80.51%, and
96.69% respectively. The proposed AAUT-ML has achieved better results for recall in comparison to the
YAKE, TF-IDF, and TextR.
Figure 2. Macro-averaged precision scores of AAUT-ML versus three baseline methods on three different
datasets
Figure 3. Macro-averaged Recall scores of AAUT-ML versus three baseline methods on three different
datasets
4.4. Macro-averaged F1-score
In Figure 4, the macro averaged F1-score has been evaluated and compared with the YAKE, TF-
IDF, and Text-R models. For the DM dataset, the AAUT-ML performed better when compared to YAKE,
TF-IDF, and TextR by 64.93%, 72.22%, and 96.52% respectively. For the OS dataset, the AAUT-ML
performed better when compared to YAKE, TF-IDF, and TextR by 61.88%, 86.06%, and 95.28%
respectively. For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and
TextR by 30.79%, 82.88%, and 99.61% respectively. The proposed AAUT-ML has achieved better results
for macro averaged F1-Score in comparison to the YAKE, TF-IDF, and TextR.
Data Mining
Operating
System
Data Base
AAUT-ML 0.373 0.866 0.265
YAKE 0.126 0.288 0.177
TF-IDF 0.046 0.133 0.041
TextR 0.005 0.021 0.007
0
0.2
0.4
0.6
0.8
1
Precision
Precision
AAUT-ML YAKE TF-IDF TextR
Data Mining
Operating
System
Data Base
AAUT-ML 0.164 0.302 0.272
YAKE 0.085 0.112 0.194
TF-IDF 0.03 0.146 0.053
TextR 0.006 0.29 0.009
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Recall
Recall
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720
718
Figure 4. Macro-averaged F1-scores of AAUT-ML versus three baseline methods on three different datasets
4.5. Discussion
Table 1 shows macro averaged F1-score, recall, and precision values for four different methods
(AAUT-ML, YAKE, TF-IDF, TextR) applied to three different datasets (data mining, operating system, data
base). For the "operating system" dataset, it is seen that AAUT-ML seems to perform the best across all
metrics, with high F1-score, recall, and precision values. Also, in the "data mining" dataset, AAUT-ML
performs relatively well compared to other methods. Finally, for the "data base" dataset, YAKE and TF-IDF
have similar F1-scores, recall, and precision values, with AAUT-ML performing slightly better. This work
also tried to utilize the sparemax and fusedmax but both these failed to achieve higher results in comparison
to the softmax.
Table 1. Comparative study
Macro Averaged F1-Score Recall Precision
AAUT
-ML
YAKE TF-
IDF
TextR AAUT-
ML
YAKE TF-
IDF
TextR AAUT-
ML
YAKE TF-
IDF
TextR
Data mining 0.288 0.101 0.08 0.01 0.164 0.085 0.03 0.006 0.373 0.126 0.046 0.005
Operating
system
0.488 0.186 0.068 0.023 0.302 0.112 0.146 0.29 0.866 0.288 0.133 0.021
Data base 0.263 0.182 0.045 0.001 0.272 0.194 0.053 0.009 0.265 0.177 0.041 0.007
5. CONCLUSION
In this work, sensing complex meanings from unstructured data using natural language processing
and convolution neural network techniques has been presented. In this analysis, different datasets namely
data base, data mining, and operating systems datasets are used. Our study has challenged some
conventional assumptions about how CNNs process and classify text. Firstly, we have demonstrated that
max-pooling over time introduces a thresholding effect on the output of the convolution layer, effectively
distinguishing between relevant and non-relevant features for the final classification. This insight allowed
us to identify the crucial 𝑛-grams for classification, associating each filter with the class it contributes to.
We have also highlighted instances where filters assign negative values to specific word activations,
leading to low scores for 𝑛-grams containing them, despite otherwise having highly activating words.
These findings contribute to enhancing the interpretability of CNNs for text classification. Our approach
effectively categorizes various documents and their respective domains. We evaluated its performance
using metrics such as precision, recall, and F1-score across multiple datasets, demonstrating superior
results compared to existing methods. The performance of the presented work shows better results in
comparison to the existing works. For future work, this work can be used for classifying corpus semantics
in structured data. Different feature extraction processes can be used for structuring the data. Also, along
with NLP, machine learning can be used.
Data Mining
Operating
System
Data Base
AAUT-ML 0.288 0.488 0.263
YAKE 0.101 0.186 0.182
TF-IDF 0.08 0.068 0.045
TextR 0.01 0.023 0.001
0
0.1
0.2
0.3
0.4
0.5
0.6
Macro
Averaged
F1-Score
Macro Averaged F1-Score
Int J Elec & Comp Eng ISSN: 2088-8708 
Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri)
719
REFERENCES
[1] E. Camilleri and S. J. Miah, “Evaluating latent content within unstructured text: an analytical methodology based on a temporal
network of associated topics,” Journal of Big Data, vol. 8, no. 1, Sep. 2021, doi: 10.1186/s40537-021-00511-0.
[2] D. Antons, E. Grünwald, P. Cichy, and T. O. Salge, “The application of text mining methods in innovation research: current state,
evolution patterns, and development priorities,” R&D Management, vol. 50, no. 3, pp. 329–351, 2020, doi: 10.1111/radm.12408.
[3] S. Sulova, “A conceptual framework for the technological advancement of e-commerce applications,” Businesses, vol. 3, no. 1,
pp. 220–230, Mar. 2023, doi: 10.3390/businesses3010015.
[4] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,”
Multimedia Tools and Applications, vol. 82, no. 3, pp. 3713–3744, Jul. 2023, doi: 10.1007/s11042-022-13428-4.
[5] M. Y. Landolsi, L. Hlaoua, and L. Ben Romdhane, “Information extraction from electronic medical documents: state of the art
and future research directions,” Knowledge and Information Systems, vol. 65, no. 2, pp. 463–516, Nov. 2023, doi:
10.1007/s10115-022-01779-1.
[6] A. K. Sharma, S. Chaurasia, and D. K. Srivastava, “Sentimental short sentences classification by using CNN deep learning model
with fine tuned Word2Vec,” Procedia Computer Science, vol. 167, pp. 1139–1147, 2020, doi: 10.1016/j.procs.2020.03.416.
[7] S. Soni, S. S. Chouhan, and S. S. Rathore, “TextConvoNet: a convolutional neural network based architecture for text
classification,” Applied Intelligence, vol. 53, no. 11, pp. 14249–14268, Oct. 2023, doi: 10.1007/s10489-022-04221-9.
[8] P. Song, C. Geng, and Z. Li, “Research on text classification based on convolutional neural network,” in Proceedings - 2nd
International Conference on Computer Network, Electronic and Automation, ICCNEA 2019, IEEE, Sep. 2019, pp. 229–232. doi:
10.1109/ICCNEA.2019.00052.
[9] Y. Zhu, “Research on news text classification based on deep learning convolutional neural network,” Wireless Communications
and Mobile Computing, vol. 2021, pp. 1–6, Dec. 2021, doi: 10.1155/2021/1508150.
[10] A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, “Text classification based on convolutional neural networks and
word embedding for low-resource languages: Tigrinya,” Information (Switzerland), vol. 12, no. 2, pp. 1–17, Jan. 2021, doi:
10.3390/info12020052.
[11] H. Lu, L. Ehwerhemuepha, and C. Rakovski, “A comparative study on deep learning models for text classification of unstructured
medical notes with various levels of class imbalance,” BMC Medical Research Methodology, vol. 22, no. 1, Jul. 2022, doi:
10.1186/s12874-022-01665-y.
[12] M. Zulqarnain, R. Ghazali, Y. M. M. Hassim, and M. Rehan, “A comparative review on deep learning models for text
classification,” Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 19, no. 1, pp. 325–335, Jul.
2020, doi: 10.11591/ijeecs.v19.i1.pp325-335.
[13] K. Adnan and R. Akbar, “An analytical study of information extraction from unstructured and multidimensional big data,”
Journal of Big Data, vol. 6, no. 1, Oct. 2019, doi: 10.1186/s40537-019-0254-8.
[14] A. Gupta, I. Banerjee, and D. L. Rubin, “Automatic information extraction from unstructured mammography reports using
distributed semantics,” Journal of Biomedical Informatics, vol. 78, pp. 78–86, Feb. 2018, doi: 10.1016/j.jbi.2017.12.016.
[15] E. Chang and J. Mostafa, “Cohort identification from free-text clinical notes using SNOMED CT’s hierarchical semantic
relations,” AMIA ... Annual Symposium proceedings. AMIA Symposium, vol. 2022, pp. 349–358, 2022.
[16] E. Chang and J. Mostafa, “The use of SNOMED CT, 2013-2020: A literature review,” Journal of the American Medical
Informatics Association, vol. 28, no. 9, pp. 2017–2026, Jun. 2021, doi: 10.1093/jamia/ocab084.
[17] S. Henry, K. Buchan, M. Filannino, A. Stubbs, and O. Uzuner, “2018 N2C2 Shared task on adverse drug events and medication
extraction in electronic health records,” Journal of the American Medical Informatics Association, vol. 27, no. 1, pp. 3–12, Oct.
2020, doi: 10.1093/jamia/ocz166.
[18] K. Adnan and R. Akbar, “Limitations of information extraction methods and techniques for heterogeneous unstructured big data,”
International Journal of Engineering Business Management, vol. 11, Art. no. 184797901989077, Jan. 2019, doi:
10.1177/1847979019890771.
[19] J. Tekli, R. Chbeir, A. J. M. Traina, and C. Traina, “SemIndex+: A semantic indexing scheme for structured, unstructured, and
partly structured data,” Knowledge-Based Systems, vol. 164, pp. 378–403, Jan. 2019, doi: 10.1016/j.knosys.2018.11.010.
[20] C. C. Aggarwal, Data mining: The textbook. Synthesis Collection of Technology, 2015.
[21] R. Ramakrishnan and J.Gehrke, Database management systems. India, 2014.
[22] VOCW, “Operating systems,” cnx.org. https://cnx.org/contents/epUq7msG@2.1:vLiqr17-@1/Process (accessed Aug. 10, 2023).
[23] OpenStax, “OpenStax | free textbooks online with no catch,” cnx.org. http://cnx.org/content/col10785/1.2/ (accessed Aug. 10, 2023).
[24] S. Gul, S. Räbiger, and Y. Saygın, “Context-based extraction of concepts from unstructured textual documents,” Information
Sciences, vol. 588, pp. 248–264, Apr. 2022, doi: 10.1016/j.ins.2021.12.056.
[25] A. Bougouin, F. Boudin, and B. Daille, “TopicRank: Graph-based topic Ranking for keyphrase extraction,” in 6th International
Joint Conference on Natural Language Processing, IJCNLP 2013 - Proceedings of the Main Conference, Nagoya, Japan: Asian
Federation of Natural Language Processing, Oct. 2013, pp. 543–551. [Online]. Available: https://aclanthology.org/I13-1062
[26] F. Boudin, “Unsupervised keyphrase extraction with multipartite graphs,” in NAACL HLT 2018 - 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the
Conference, Association for Computational Linguistics, 2018, pp. 667–672. doi: 10.18653/v1/n18-2105.
BIOGRAPHIES OF AUTHORS
Shankarayya Shastri is assistant professor in the Department of Computer
Science and Engineering at GM Institute of Technology, Davangere, India with more than 16
years of teaching and research experience. He Holds a Bachelor of Engineering degree in
computer science and engineering and M.Tech. degree in computer science and engineering.
His research areas are text mining, data mining, big data analytics and pattern recognition. He
can be contacted at email: shan.shas@gmail.com.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720
720
Veeragangadhara Swamy Teligi Math received the Bachelor of Engineering
degree in computer science from the University BDT College of Engineering, Davangere and
the M.Tech. degree in computer science and engineering from Dr. A.I.T, Bangalore,
Karnataka. He obtained his PhD from SJJT University, Rajasthan. He used to hold several
administrative posts in various engineering colleges in Karnataka. He has supervised and co-
supervised more than 50 masters and 5 PhD students. He has authored or coauthored more
than 40 publications: 10 proceedings and 40 journals. His research interests include data
mining, web mining, big data and pattern recognition. He can be contacted at email:
swamytm@gmail.com.
Patil Nagaraja Siddalingappa working as associate professor in Department of
Information Science, Bapuji Institute of Engineering Technology, Davangere. His teaching
experience is 13 years and his area of interest is graph database, big data, database
management system, and data mining. He can be contacted at email: patilbathi@gmail.com.

More Related Content

PDF
A simplified classification computational model of opinion mining using deep ...
PDF
Sentimental classification analysis of polarity multi-view textual data using...
PDF
EASESUM: an online abstractive and extractive text summarizer using deep lear...
PDF
An in-depth review on News Classification through NLP
PDF
Smart Solutions for Question Duplication: Deep Learning in Action
PDF
Sentiment analysis on Bangla conversation using machine learning approach
PDF
An optimal unsupervised text data segmentation 3
PDF
A novel ensemble deep network framework for scene text recognition
A simplified classification computational model of opinion mining using deep ...
Sentimental classification analysis of polarity multi-view textual data using...
EASESUM: an online abstractive and extractive text summarizer using deep lear...
An in-depth review on News Classification through NLP
Smart Solutions for Question Duplication: Deep Learning in Action
Sentiment analysis on Bangla conversation using machine learning approach
An optimal unsupervised text data segmentation 3
A novel ensemble deep network framework for scene text recognition

Similar to Sensing complicated meanings from unstructured data: a novel hybrid approach (20)

PDF
Semantics-based clustering approach for similar research area detection
PDF
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
PDF
Chinese paper classification based on pre-trained language model and hybrid d...
PDF
A semantic framework and software design to enable the transparent integratio...
PDF
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
PDF
Development of an intelligent information resource model based on modern na...
PDF
Eat it, Review it: A New Approach for Review Prediction
PDF
Extraction and Retrieval of Web based Content in Web Engineering
PDF
Opinion mining on newspaper headlines using SVM and NLP
PDF
A Review on Text Mining in Data Mining
PDF
Generating images using generative adversarial networks based on text descrip...
PDF
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
PDF
Enhancing speaker verification accuracy with deep ensemble learning and inclu...
PDF
IRJET - Text Detection in Natural Scene Images: A Survey
PDF
LSTM Based Sentiment Analysis
PDF
The sarcasm detection with the method of logistic regression
PDF
Classifier Model using Artificial Neural Network
PDF
The Identification of Depressive Moods from Twitter Data by Using Convolution...
PDF
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
Semantics-based clustering approach for similar research area detection
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
Chinese paper classification based on pre-trained language model and hybrid d...
A semantic framework and software design to enable the transparent integratio...
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Development of an intelligent information resource model based on modern na...
Eat it, Review it: A New Approach for Review Prediction
Extraction and Retrieval of Web based Content in Web Engineering
Opinion mining on newspaper headlines using SVM and NLP
A Review on Text Mining in Data Mining
Generating images using generative adversarial networks based on text descrip...
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
Enhancing speaker verification accuracy with deep ensemble learning and inclu...
IRJET - Text Detection in Natural Scene Images: A Survey
LSTM Based Sentiment Analysis
The sarcasm detection with the method of logistic regression
Classifier Model using Artificial Neural Network
The Identification of Depressive Moods from Twitter Data by Using Convolution...
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...
Ad

Recently uploaded (20)

PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
Soil Improvement Techniques Note - Rabbi
PDF
737-MAX_SRG.pdf student reference guides
PPT
Total quality management ppt for engineering students
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Amdahl’s law is explained in the above power point presentations
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Feature types and data preprocessing steps
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Information Storage and Retrieval Techniques Unit III
Categorization of Factors Affecting Classification Algorithms Selection
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Soil Improvement Techniques Note - Rabbi
737-MAX_SRG.pdf student reference guides
Total quality management ppt for engineering students
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
distributed database system" (DDBS) is often used to refer to both the distri...
Amdahl’s law is explained in the above power point presentations
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Management Information system : MIS-e-Business Systems.pptx
Feature types and data preprocessing steps
Visual Aids for Exploratory Data Analysis.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Fundamentals of Mechanical Engineering.pptx
Information Storage and Retrieval Techniques Unit III

Sensing complicated meanings from unstructured data: a novel hybrid approach

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 14, No. 1, February 2024, pp. 711~720 ISSN: 2088-8708, DOI: 10.11591/ijece.v14i1.pp711-720  711 Journal homepage: http://ijece.iaescore.com Sensing complicated meanings from unstructured data: a novel hybrid approach Shankarayya Shastri1 , Veeragangadhara Swamy Teligi Math2 , Patil Nagaraja Siddalingappa3 1 Department of Computer Science and Engineering, GM Institute of Technology Davangere, Visvesvaraya Technological University, Belagavi, India 2 Department of Computer Science and Engineering, RYM College of Engineering Ballari, Visvesvaraya Technological University, Belagavi, India 3 Department of Information Science and Engineering, Bapuji Institute of Engineering and Technology Davanagere, Visvesvaraya Technological University, Belagavi, India Article Info ABSTRACT Article history: Received Aug 11, 2023 Revised Sep 13, 2023 Accepted Sep 14, 2023 The majority of data on computers nowadays is in the form of unstructured data and unstructured text. The inherent ambiguity of natural language makes it incredibly difficult but also highly profitable to find hidden information or comprehend complex semantics in unstructured text. In this paper, we present the combination of natural language processing (NLP) and convolution neural network (CNN) hybrid architecture called automated analysis of unstructured text using machine learning (AAUT-ML) for the detection of complex semantics from unstructured data that enables different users to make understand formal semantic knowledge to be extracted from an unstructured text corpus. The AAUT-ML has been evaluated using three datasets data mining (DM), operating system (OS), and data base (DB), and compared with the existing models, i.e., YAKE, term frequency-inverse document frequency (TF-IDF) and text-R. The results show better outcomes in terms of precision, recall, and macro-averaged F1-score. This work presents a novel method for identifying complex semantics using unstructured data. Keywords: Convolutional neural network Natural language processing Information extraction Unstructured text Complex semantics This is an open access article under the CC BY-SA license. Corresponding Author: Shankarayya Shastri Department of Computer Science and Engineering, GM Institute of Technology Davangere, Visvesvaraya Technological University Belagavi, India Email: shankarayyasresearch@gmail.com 1. INTRODUCTION Unstructured text poses difficulties for computer programs due to its lack of a clear structure or defined data model [1]. This makes it unsuitable for conventional database models, causing issues in storage, management, and indexing due to the absence of a schema and unclear structure. Consequently, search results are less accurate due to the absence of predefined attributes. In today's digital landscape, the volume of diverse unstructured text is increasing, generated from sources like web pages, research papers, and articles [2]. This growth is driven by advancements in web technologies and text extraction tools [3]. While semantic technologies and text mining systems aid in linking text to knowledge, processing complex language and extracting intricate insights remains a challenge [4]. Information extraction (IE) algorithms help extract knowledge by identifying references and relationships between entities [5]. Yet, deeper insights require natural language processing (NLP) techniques, including convolutional neural networks (CNNs),
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720 712 which are widely used in image processing and text classification, to effectively capture unstructured information from the text [6]. In NLP, CNNs are applied to categorization problems, using sliding windows for data analysis. Soni et al. [7] have presented a Text ConvoNet which is used for classifying the multi and binary-class text classes using CNN. This work has used intra-sentence 𝑛-gram characteristics for classifying the text. Song et al. [8] have presented a TextCNN technique for classifying the text corpus. In study [9], a model has been presented for classifying the text using news text using CNN. Fesseha et al. [10] used CNN with continuous bag-of- words, FastText, and word to vector (Word2Vec) for evaluating the text from news. Lu et al. [11] have proposed a model that uses CNN, bidirectional-encoder-representation using transformers (BERT), transformer encoder, and four neural-networks, recurrent-neural-network (RNN), long-short term-memory (LSTM), gated-recurrent-unit (GRU) and Bi-directional LSTM (Bi-LSTM) for classification of summary notes of the patients. Zulqarnain et al. [12] examined three architectures, CNN, RNN, and deep belief neural (DBN) for text classification. All these models have utilized CNN and achieved better results, but failed to classify the complex semantics when the data is unstructured. Further, by using NLP and CNN techniques, complex semantics within unstructured text documents can be detected and analyzed. The semantic analysis involves important elements such as hyponymy, which connects generic terms (hypernyms) to their instances (hyponyms) for instance, "color" and its variations. Homonymy refers to words with the same spelling but different meanings, like "bat." Polysemy involves words with distinct yet related meanings, such as "bank." Synonymy pertains to words with similar meanings, like "author/writer." Antonymy represents symmetrical relationships between opposite words. Semantic analysis finds applications in chatbots, support systems, sentiment analysis, search engines, translation, Q&A, and grammar identification. This study uses this knowledge to propose automated analysis of unstructured text using machine learning (AAUT-ML), a hybrid NLP and CNN approach that can detect complicated semantics in unstructured data and is evaluated by measures of recall, precision, and F1-scores. This work makes the following contributions; i) Classify unstructured text into their respective domains using the NLP method, ii) Using 𝑛-gram detectors, 1-dimensional convolving filters will be employed, each of which focuses on a certain family of closely related 𝑛-grams, iii) The appropriate 𝑛-grams are extracted for decision-making through max-pooling over time, and iv) Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text. This paper will have the following outline. In Section 2, the existing approaches are discussed. In Section 3, the proposed methodology has been presented. In Section 4, the presented methodology is evaluated and compared with the existing works. Finally, in Section 5, the conclusion and future work of the complete work have been presented. 2. LITERATURE SURVEY This section presents the different reviews, strategies, architectures, and methodologies used for classifying the text. There has not been a lot of effort done to examine the drawbacks of information extraction (IE) across all tasks and data kinds in a unified study. Hence, Adnan and Akbar [13], overcomes that barrier by providing a comprehensive literature evaluation of cutting-edge methods for handling all forms of big data and texts. Additionally, current issues with IE are highlighted and briefly discussed. Solutions are suggested, along with directions for future study in the field of text IE. In light of the current approaches and difficulties in the analysis of large amounts of data, the study is important. The findings and suggestions offered here have analyzed large amounts of data more efficiently overall. Gupta et al. [14] presented a work for extracting the relations using unsupervised machine learning by utilizing the radiology reports of patients. For implementing this work, they have used a hybrid approach, i.e., dependency-based parsed trees which will extract the IE. This work achieved an F-score of 94 percent when tested on mammography data of patients. Chang and Mostafa [15] introduced a novel supervised machine learning model and utilized the systematized nomenclature of medicine-clinical terms (SNOMED CT) [16] for evaluating their model. This work was compared with the 2018 N2C2 model [17]. The proposed model achieved a score of 0.933 when compared with [17]. Adnan and Akbar [18] has surveyed text IE using unstructured data. First, it provides a consolidated summary of IE methods across different types of unstructured data (including written content, visual content, audio content, and videos). Furthermore, it explores how the diversity, dimensions, and amount of unstructured large data pose challenges to the aforementioned established IE methods. Furthermore, possible approaches to enhance the unstructured large data IE platforms for future study are offered. Tekli et al. [19] introduced a model, SemIndex+ for classifying the partly structured, structured, and unstructured data. They have used weight functions for classifying the text. The SemIndex+ achieved better results in terms of precision when compared with the existing works.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 713 From the above study, it can be seen that very little work has been done on classifying the unstructured data directly, without data cleaning and data pre-processing. In addition, comprehensive research is offered in [13], which discusses methods for data extraction from unstructured data. Furthermore, [14]–[17] none of these papers have dealt with the issue of data extraction from unstructured sources. Both [18], [19] discuss methods for extracting information from unstructured sources, but neither paper examines these methods in relation to other datasets. The models [18], [19] may work properly for the given respective datasets given in [18], [19]. Hence in this work, we utilize the NLP and CNN and build a model called AAUT- ML to extract the text from the unstructured data. This work will classify the unstructured complex semantics into their respective domains using the NLP method. Also, by using 𝑛-gram detectors, 1-dimensional convolving filters are employed, each of which focuses on a certain family of closely related 𝑛-grams. The appropriate 𝑛-grams are extracted for decision-making through max-pooling over time. Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text. 3. METHODOLOGY FOR SENSING COMPLICATED MEANINGS FROM UNSTRUCTURED TEXT FROM UNSTRUCTURED DATA USING NLP AND CNN TECHNIQUE In this two-phase proposed work, we are trying to detect complex semantics from unstructured text using NLP and CNN Techniques. In the first phase (NLP) we are pre-processing, filtering, and classifying unstructured text to specific data domains. In the second phase (CNN), we try to figure out how CNN handles text after which we use that knowledge to discover complex semantics in unstructured text. The main aim is to satisfy the objectives that are listed as follows: a. To classify unstructured text into their respective domains using the NLP method. b. As 𝑛-gram detectors, 1-dimensional convolving filters are employed, each of which focuses on a certain family of closely related 𝑛-grams. c. The appropriate 𝑛-grams are extracted for decision-making through max-pooling over time. d. Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text. In this experiment setup, we are taking sample input datasets from computer science domains such as Databases, Operating systems, and Data mining unstructured text files in .txt format. Tokenization, stop word removal, and rare word removal functions must be applied to the input unstructured documents as part of the pre-processing step. Pre-processing eliminates missing or inconsistent data values brought on by human or technological faults. Pre-processing can make a dataset's accuracy and quality more precise, dependable, and consistent. The proposed work will be carried out in two distinct phases. The detailed procedures for both Phase 1 along Phase 2 are outlined below. Figure 1 provides a comprehensive block- diagram of the novel developed architecture. Figure 1. Novel developed block-diagram for NLP and CNN for Text classification and detection of complex semantics
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720 714 Phase 1. Pre-processing and filtering candidate concepts Step 1: Given a collection of unstructured text files, denoted as 𝐷 = {𝑑1, 𝑑2, 𝑑3, … 𝑑𝑚}, describing a collection of concepts denoted as 𝐶 = {𝑐1, 𝑐2, 𝑐3, … . 𝑐𝑚}, where both n and m are greater than 1, the objective is to identify and categorize the most pertinent concepts from 𝐶. The initial step involves tokenizing the unstructured textual documents, represented as 𝑑𝑖 ∈ 𝐷, into 𝑛-grams that serve as the preliminary candidate concepts, denoted as 𝑐𝑖 ∈ 𝐶. 𝑑𝑖 ∈ 𝐷 > 𝑇𝑜𝑘𝑒𝑛𝑖𝑧𝑎𝑡𝑖𝑜𝑛{𝑐1, 𝑐2, 𝑐3, … , 𝑐𝑚} Step 2: In these 𝑛-grams, common stop words are eliminated based on a predefined stop-word list, and the occurrences of the remaining candidates 𝑐𝑖 in 𝐷 are tallied to generate a set of tuples denoted as 𝑠 = {(𝑐𝑖, 𝑓𝑖), … . }, comprising each candidate 𝑐𝑖 and its corresponding frequency 𝑓𝑖. Step 3: To diminish noise, 𝑛-grams with low frequency, short length, and infrequent occurrences are eliminated from both the set of 𝑛-grams (𝑇) and the concept set (𝐶) by applying a specified frequency threshold 𝑓𝑡. Step 4: Only meaningful unigrams (𝑛 >= 1) are retained only if it occurs at-least 𝑓𝑡 times more frequently than any larger 𝑛-gram containing the same unigram within it (refer to Figure 1). Step 5: In cases where multiple higher-order 𝑛-grams 𝑐𝑗 encompass 𝑐𝑖, 𝑓𝑖 that makes reference to the highest frequently encountered 𝑐𝑗, then the remaining 𝑛-grams in 𝐶 are merged based on two rules: firstly, plural tokens are filtered out if their singular form is also present as shown in Figure 1, and secondly, current participle of a normal verb is not used if there is an alternative form that does not include it. Step 6: Among the remaining candidates in 𝐶, those without a corresponding 𝐷𝐵𝑝𝑒𝑑𝑖𝑎 entry is filtered out. Step 7: Initial context information denoted as 𝐶𝑖𝑛𝑓𝑜, is generated, and the documents 𝑑𝑖 are classified using specific-domain. Phase 2. Sensing complicated meanings Following is how CNN works for text processing. The three-layer CNN framework is used in this work's implementation. CNN's fundamental capabilities are analogous to those found in the visual-cortex in the brains of animals. CNN performs admirably in text-classification tasks. Classifying texts follows a process similar to those of categorizing images, with the exception that words are represented by vectors within a matrix rather than pixels. 3.1. Target-function Learning-capable neuron biases and weights are used throughout target-function implementation. To generate an outcome, neurons receive multiple inputs, carry out a weighted average over those inputs, and then send that value through an activation mechanism. By sending the network's outcome through the softmax layer, a loss-function can usually be determined for the entire system. The outcome of a network with a softmax layer, which has full connectivity, is down-sampled. 3.2. Representation CNN's initial level consists of an embedding level, which transforms word-indices into three- dimensional vectors. These vectors are discovered using the equivalent of a lookup-table. When considering a sentence represented as 𝑁, and words represented as 𝑊, each word is translated through its associated embedding and the highest sentence size 𝑉𝑤𝑟𝑑 is used to determine the vocabulary-size. Once all the words have been converted to vectors, they are sent through the convolution-layer. 3.3. System structure The developed architecture has three distinct stages. The architecture consists of two layers: the Embedding level, that maps words to embedded vectors, and then the Convolution level, that performs most part of the approach work. The sentence matrix is processed by a set of preset filters, which reduces its dimensions. The softmax level, the final one, acts as a downsampling level that can both reduce the sentence- matrix and compute the loss-function. The sentence's word-embedding can be obtained using the embedded word lookup-table. To ensure that every single sentence is handled fairly, the matrix produced by the embedding component remains padded. Once the filters have been established, the matrix continues to be reduced and convolved-features are going to be generated. These complex characteristics are finally simplified. As a next step in down-sampling, the resultant data from the convolved-features is distributed across the maximum pooling level. Various sized and shaped filtration are specified. Three, four, and five are the filter shapes employed by the suggested approach. Following that, padding is applied to each of the
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 715 embedded sentences so that the resulting sentence matrix possesses an identical shape and size. Word vectors 𝑤1, … , 𝑤𝑛 ∈ 𝑅𝑑 are the result of embedding every symbol in the 𝑛-words text being entered in the form of d-dimensional vector data. The generated 𝑑 × 𝑛 matrix can be utilized to transmit a sliding-window across the text within a convolutional level. In accordance with each l-word 𝑛-gram. 𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] ∈ 𝑅𝑑×𝑙 ; 0 ≤ 𝑖 ≤ 𝑛 − 𝑙 (1) where matrix 𝐹 ∈ 𝑅𝑛×𝑚 . Max-pooling applied along the 𝑛-gram dimension yields 𝑝 ∈ 𝑅𝑚 . The non-linearity of rectified linear unit (ReLU) is used to process 𝑅𝑚 . The distribution across the classes used for classification is then generated by a linearly fully connected layer 𝑊 ∈ 𝑅𝑐×𝑚 , that then outputs the class with the highest strength. In execution, we employ a range of window widths, from 𝑙 ∈ 𝐿, 𝐿 ∈ 𝑁, by chaining together the outputs 𝑝𝑙 vectors of numerous convolution levels. It is important to take into account that the procedures described here also work for dilated convolutions. This is represented as (2) to (6). 𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] ∈ 𝑅𝑑×𝑙 ; 0 ≤ 𝑖 ≤ 𝑛 − 𝑙 (2) 𝑢𝑖 = [𝑤𝑖, … , 𝑤𝑖+𝑙−1] (3) 𝐹𝑖𝑗 = 〈𝑢𝑖,𝑓𝑖〉 (4) 𝑝𝑗 = 𝑅𝑒𝐿𝑈(𝑚𝑎𝑥𝐹𝑖𝑗) (5) 𝑜 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊 𝑝) (6) 3.4. Identification of important features According to conventional wisdom, filters can be thought of as 𝑛-gram detectors, with every filter looking for a unique category of 𝑛-grams and marking them with high-scores. After the max-pooling process, only the 𝑛-grams with the most favorable scores remain. Once the total number of 𝑛-grams within the max- pooled vector (which is represented through the collection of matching filters) has been determined, a conclusion can be drawn. Any filter's high-scoring 𝑛-grams (compared with the way it ranks similar 𝑛-grams) should be considered to be particularly useful for text classification. In this subsection, we enhance this perspective by posing and aiming to respond to the following issues: what data underlying 𝑛-grams can be obtained within the max-pooled vector, and in what manner is it utilized in the last classification? 3.5. Informative vs. uninformative 𝒏-grams The pooled vector 𝑝, that belongs to the 𝑚-dimensional real space 𝑅𝑚 , i.e., 𝑝 ∈ 𝑅𝑚 serves as the foundation for the classification process. Every value of 𝑝𝑗 is derived through the ReLU applied to the maximum inner product between the 𝑛-gram 𝑢𝑖 along with the filter 𝑓𝑗. These values could be attributed to the specific 𝑛-gram 𝑢𝑖, which consists of a sequence of words [𝑤𝑖, … , 𝑤𝑖+𝑙−1] , which activated the filter. The collection of 𝑛-grams that contribute to the overall probability distribution 𝑝 can be denoted as 𝑆𝑝. 𝑛-grams that are not present in the collection 𝑆𝑝 cannot have any effect on the decision-making process within the classifier. However, it is imperative to consider the presence of 𝑛-grams within the set 𝑆𝑝. In prior investigations into the prediction-based analysis of CNN for text, the focus has been on identifying the 𝑛-grams found within the input sequence, denoted as 𝑆𝑝, and evaluating their respective rankings as a method of understanding the underlying prediction process. In this context, we adopt an additional complicated perspective. It is important to highlight the following: the ultimate categorization process does not necessarily consider the specific 𝑛-gram identities, but rather evaluates these individuals based on the rankings allocated through the filters. Therefore, it is imperative that the data contained in variable 𝑝 is contingent upon the allocated rankings. From a conceptual standpoint, the 𝑛-grams within the set 𝑆𝑝 can be categorized through two distinct classes: accidental and deliberate. The presence of deliberate 𝑛-grams in 𝑆𝑝 can be attributed to their higher rankings assigned by the filtering mechanism. This suggests that these n-grams possess valuable information that is relevant to the ultimate decision-making process. In contrast, it is observed that accidental 𝑛-grams, regardless of possessing a relatively low ranking, manage to find their way into the set 𝑆𝑝. This occurrence can be attributed to the absence of an additional 𝑛-gram that achieved a higher ranking compared to them. Based on the analysis conducted, it is evident that the 𝑛-grams in question do not appear to possess significant informational value in relation to the classification selection at hand. Is it possible to distinguish and differentiate between intentional and unintentional 𝑛-grams? It is postulated that within the framework for every filter, a discernible threshold exists. Values surpassing this threshold are
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720 716 indicative of useful information pertaining to the classification process, whereas values falling beneath the threshold are deemed inaccurate and are therefore disregarded for classification purposes. Formally, given threshold dataset (𝑋, 𝑌). 𝑝𝑢𝑟𝑖𝑡𝑦(𝑓, 𝑡) = | {(𝑥,𝑦) ∈ (𝑋,𝑌)𝑓 | 𝑥 ≥𝑡 & 𝑦=𝑡𝑟𝑢𝑒 }| | {(𝑥,𝑦) ∈ (𝑋,𝑌)𝑓 | 𝑥 ≥𝑡 }| (7) Based on our empirical findings, we determine that an ideal purity-value of 0.75 is optimal when determining the threshold of a given filter. Further, the results of the proposed work have been evaluated using three datasets and in terms of recall, precision, and macro averaged F1-score. The results are discussed in the next section. 4. RESULTS AND DISCUSSION This section commences by providing a comprehensive overview of the system requirements, followed by a detailed examination of the dataset employed in the study. Additionally, the performance metrics utilized to evaluate the system's efficacy are thoroughly examined. The results obtained from the proposed methodology were subsequently compared to those of previous studies, specifically on the basis of recall, precision, and macro-averaged F1-score. The inclusion of a comprehensive discussion section within the present work serves to provide a thorough analysis and interpretation of the obtained results. 4.1. System requirements, datasets and performance metrics In this section, we conduct a series of experiments on Phase 1 and Phase 2. The code was written in Python and a system having Windows 10 operating system, 16 GB RAM was used for executing the code. In this approach three different datasets namely data mining (DM) [20], operating system (OS) [21], and data base (DB) [22], [23] datasets are used for experiment purposes. The DM, OS, and DB were created in [24]. The performance of the presented approach AAUT-ML is investigated with different existing available approaches i.e., YAKE [15], TF-IDF [25], and TextR [26]. All the results and datasets have been taken from [24]. Results from the provided AAUT-ML are analyzed, and metrics such as recall, precision, and F1-score are used to determine how well the model performs. By dividing the sum of all anticipated sequences within a positive category by the precision, we can determine how many positive categories were successfully anticipated. In (9) can calculate recall, that can be described as the proportion of correctly predicted positive outcomes compared to actual positive outcomes. In machine learning, the F1-score is a crucial evaluation metric. It neatly summarizes a model's prediction ability through the combination of the precision and recall measures, which are frequently assessed using (10). 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 (8) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 (9) 𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (10) where 𝐾𝑒𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 is the sum of all the predicted key-phrases that were found to be a good match with the standard key-phrases, and 𝐾𝑒𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 is the sum of all the predicted key-phrases from the document. 4.2. Precision In Figure 2, the precision has been evaluated and compared with the YAKE, TF-IDF, and Text-R models. For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 66.21%, 87.66%, and 98.65% respectively. For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 66.74%, 84.64%, and 97.57% respectively. For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 33.20%, 84.52%, and 97.35% respectively. The proposed AAUT-ML has achieved better results for precision in comparison to the YAKE, TF-IDF, and TextR. 4.3. Recall In Figure 3, the recall score has been evaluated and compared with the YAKE, TF-IDF, and Text-R models. For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 717 by 48.17%, 81.70%, and 96.34% respectively. For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 62.91%, 51.65%, and 3.97% respectively. For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 28.67%, 80.51%, and 96.69% respectively. The proposed AAUT-ML has achieved better results for recall in comparison to the YAKE, TF-IDF, and TextR. Figure 2. Macro-averaged precision scores of AAUT-ML versus three baseline methods on three different datasets Figure 3. Macro-averaged Recall scores of AAUT-ML versus three baseline methods on three different datasets 4.4. Macro-averaged F1-score In Figure 4, the macro averaged F1-score has been evaluated and compared with the YAKE, TF- IDF, and Text-R models. For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 64.93%, 72.22%, and 96.52% respectively. For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 61.88%, 86.06%, and 95.28% respectively. For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 30.79%, 82.88%, and 99.61% respectively. The proposed AAUT-ML has achieved better results for macro averaged F1-Score in comparison to the YAKE, TF-IDF, and TextR. Data Mining Operating System Data Base AAUT-ML 0.373 0.866 0.265 YAKE 0.126 0.288 0.177 TF-IDF 0.046 0.133 0.041 TextR 0.005 0.021 0.007 0 0.2 0.4 0.6 0.8 1 Precision Precision AAUT-ML YAKE TF-IDF TextR Data Mining Operating System Data Base AAUT-ML 0.164 0.302 0.272 YAKE 0.085 0.112 0.194 TF-IDF 0.03 0.146 0.053 TextR 0.006 0.29 0.009 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Recall Recall
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720 718 Figure 4. Macro-averaged F1-scores of AAUT-ML versus three baseline methods on three different datasets 4.5. Discussion Table 1 shows macro averaged F1-score, recall, and precision values for four different methods (AAUT-ML, YAKE, TF-IDF, TextR) applied to three different datasets (data mining, operating system, data base). For the "operating system" dataset, it is seen that AAUT-ML seems to perform the best across all metrics, with high F1-score, recall, and precision values. Also, in the "data mining" dataset, AAUT-ML performs relatively well compared to other methods. Finally, for the "data base" dataset, YAKE and TF-IDF have similar F1-scores, recall, and precision values, with AAUT-ML performing slightly better. This work also tried to utilize the sparemax and fusedmax but both these failed to achieve higher results in comparison to the softmax. Table 1. Comparative study Macro Averaged F1-Score Recall Precision AAUT -ML YAKE TF- IDF TextR AAUT- ML YAKE TF- IDF TextR AAUT- ML YAKE TF- IDF TextR Data mining 0.288 0.101 0.08 0.01 0.164 0.085 0.03 0.006 0.373 0.126 0.046 0.005 Operating system 0.488 0.186 0.068 0.023 0.302 0.112 0.146 0.29 0.866 0.288 0.133 0.021 Data base 0.263 0.182 0.045 0.001 0.272 0.194 0.053 0.009 0.265 0.177 0.041 0.007 5. CONCLUSION In this work, sensing complex meanings from unstructured data using natural language processing and convolution neural network techniques has been presented. In this analysis, different datasets namely data base, data mining, and operating systems datasets are used. Our study has challenged some conventional assumptions about how CNNs process and classify text. Firstly, we have demonstrated that max-pooling over time introduces a thresholding effect on the output of the convolution layer, effectively distinguishing between relevant and non-relevant features for the final classification. This insight allowed us to identify the crucial 𝑛-grams for classification, associating each filter with the class it contributes to. We have also highlighted instances where filters assign negative values to specific word activations, leading to low scores for 𝑛-grams containing them, despite otherwise having highly activating words. These findings contribute to enhancing the interpretability of CNNs for text classification. Our approach effectively categorizes various documents and their respective domains. We evaluated its performance using metrics such as precision, recall, and F1-score across multiple datasets, demonstrating superior results compared to existing methods. The performance of the presented work shows better results in comparison to the existing works. For future work, this work can be used for classifying corpus semantics in structured data. Different feature extraction processes can be used for structuring the data. Also, along with NLP, machine learning can be used. Data Mining Operating System Data Base AAUT-ML 0.288 0.488 0.263 YAKE 0.101 0.186 0.182 TF-IDF 0.08 0.068 0.045 TextR 0.01 0.023 0.001 0 0.1 0.2 0.3 0.4 0.5 0.6 Macro Averaged F1-Score Macro Averaged F1-Score
  • 9. Int J Elec & Comp Eng ISSN: 2088-8708  Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 719 REFERENCES [1] E. Camilleri and S. J. Miah, “Evaluating latent content within unstructured text: an analytical methodology based on a temporal network of associated topics,” Journal of Big Data, vol. 8, no. 1, Sep. 2021, doi: 10.1186/s40537-021-00511-0. [2] D. Antons, E. Grünwald, P. Cichy, and T. O. Salge, “The application of text mining methods in innovation research: current state, evolution patterns, and development priorities,” R&D Management, vol. 50, no. 3, pp. 329–351, 2020, doi: 10.1111/radm.12408. [3] S. Sulova, “A conceptual framework for the technological advancement of e-commerce applications,” Businesses, vol. 3, no. 1, pp. 220–230, Mar. 2023, doi: 10.3390/businesses3010015. [4] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimedia Tools and Applications, vol. 82, no. 3, pp. 3713–3744, Jul. 2023, doi: 10.1007/s11042-022-13428-4. [5] M. Y. Landolsi, L. Hlaoua, and L. Ben Romdhane, “Information extraction from electronic medical documents: state of the art and future research directions,” Knowledge and Information Systems, vol. 65, no. 2, pp. 463–516, Nov. 2023, doi: 10.1007/s10115-022-01779-1. [6] A. K. Sharma, S. Chaurasia, and D. K. Srivastava, “Sentimental short sentences classification by using CNN deep learning model with fine tuned Word2Vec,” Procedia Computer Science, vol. 167, pp. 1139–1147, 2020, doi: 10.1016/j.procs.2020.03.416. [7] S. Soni, S. S. Chouhan, and S. S. Rathore, “TextConvoNet: a convolutional neural network based architecture for text classification,” Applied Intelligence, vol. 53, no. 11, pp. 14249–14268, Oct. 2023, doi: 10.1007/s10489-022-04221-9. [8] P. Song, C. Geng, and Z. Li, “Research on text classification based on convolutional neural network,” in Proceedings - 2nd International Conference on Computer Network, Electronic and Automation, ICCNEA 2019, IEEE, Sep. 2019, pp. 229–232. doi: 10.1109/ICCNEA.2019.00052. [9] Y. Zhu, “Research on news text classification based on deep learning convolutional neural network,” Wireless Communications and Mobile Computing, vol. 2021, pp. 1–6, Dec. 2021, doi: 10.1155/2021/1508150. [10] A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, “Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya,” Information (Switzerland), vol. 12, no. 2, pp. 1–17, Jan. 2021, doi: 10.3390/info12020052. [11] H. Lu, L. Ehwerhemuepha, and C. Rakovski, “A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance,” BMC Medical Research Methodology, vol. 22, no. 1, Jul. 2022, doi: 10.1186/s12874-022-01665-y. [12] M. Zulqarnain, R. Ghazali, Y. M. M. Hassim, and M. Rehan, “A comparative review on deep learning models for text classification,” Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 19, no. 1, pp. 325–335, Jul. 2020, doi: 10.11591/ijeecs.v19.i1.pp325-335. [13] K. Adnan and R. Akbar, “An analytical study of information extraction from unstructured and multidimensional big data,” Journal of Big Data, vol. 6, no. 1, Oct. 2019, doi: 10.1186/s40537-019-0254-8. [14] A. Gupta, I. Banerjee, and D. L. Rubin, “Automatic information extraction from unstructured mammography reports using distributed semantics,” Journal of Biomedical Informatics, vol. 78, pp. 78–86, Feb. 2018, doi: 10.1016/j.jbi.2017.12.016. [15] E. Chang and J. Mostafa, “Cohort identification from free-text clinical notes using SNOMED CT’s hierarchical semantic relations,” AMIA ... Annual Symposium proceedings. AMIA Symposium, vol. 2022, pp. 349–358, 2022. [16] E. Chang and J. Mostafa, “The use of SNOMED CT, 2013-2020: A literature review,” Journal of the American Medical Informatics Association, vol. 28, no. 9, pp. 2017–2026, Jun. 2021, doi: 10.1093/jamia/ocab084. [17] S. Henry, K. Buchan, M. Filannino, A. Stubbs, and O. Uzuner, “2018 N2C2 Shared task on adverse drug events and medication extraction in electronic health records,” Journal of the American Medical Informatics Association, vol. 27, no. 1, pp. 3–12, Oct. 2020, doi: 10.1093/jamia/ocz166. [18] K. Adnan and R. Akbar, “Limitations of information extraction methods and techniques for heterogeneous unstructured big data,” International Journal of Engineering Business Management, vol. 11, Art. no. 184797901989077, Jan. 2019, doi: 10.1177/1847979019890771. [19] J. Tekli, R. Chbeir, A. J. M. Traina, and C. Traina, “SemIndex+: A semantic indexing scheme for structured, unstructured, and partly structured data,” Knowledge-Based Systems, vol. 164, pp. 378–403, Jan. 2019, doi: 10.1016/j.knosys.2018.11.010. [20] C. C. Aggarwal, Data mining: The textbook. Synthesis Collection of Technology, 2015. [21] R. Ramakrishnan and J.Gehrke, Database management systems. India, 2014. [22] VOCW, “Operating systems,” cnx.org. https://cnx.org/contents/epUq7msG@2.1:vLiqr17-@1/Process (accessed Aug. 10, 2023). [23] OpenStax, “OpenStax | free textbooks online with no catch,” cnx.org. http://cnx.org/content/col10785/1.2/ (accessed Aug. 10, 2023). [24] S. Gul, S. Räbiger, and Y. Saygın, “Context-based extraction of concepts from unstructured textual documents,” Information Sciences, vol. 588, pp. 248–264, Apr. 2022, doi: 10.1016/j.ins.2021.12.056. [25] A. Bougouin, F. Boudin, and B. Daille, “TopicRank: Graph-based topic Ranking for keyphrase extraction,” in 6th International Joint Conference on Natural Language Processing, IJCNLP 2013 - Proceedings of the Main Conference, Nagoya, Japan: Asian Federation of Natural Language Processing, Oct. 2013, pp. 543–551. [Online]. Available: https://aclanthology.org/I13-1062 [26] F. Boudin, “Unsupervised keyphrase extraction with multipartite graphs,” in NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, Association for Computational Linguistics, 2018, pp. 667–672. doi: 10.18653/v1/n18-2105. BIOGRAPHIES OF AUTHORS Shankarayya Shastri is assistant professor in the Department of Computer Science and Engineering at GM Institute of Technology, Davangere, India with more than 16 years of teaching and research experience. He Holds a Bachelor of Engineering degree in computer science and engineering and M.Tech. degree in computer science and engineering. His research areas are text mining, data mining, big data analytics and pattern recognition. He can be contacted at email: shan.shas@gmail.com.
  • 10.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 711-720 720 Veeragangadhara Swamy Teligi Math received the Bachelor of Engineering degree in computer science from the University BDT College of Engineering, Davangere and the M.Tech. degree in computer science and engineering from Dr. A.I.T, Bangalore, Karnataka. He obtained his PhD from SJJT University, Rajasthan. He used to hold several administrative posts in various engineering colleges in Karnataka. He has supervised and co- supervised more than 50 masters and 5 PhD students. He has authored or coauthored more than 40 publications: 10 proceedings and 40 journals. His research interests include data mining, web mining, big data and pattern recognition. He can be contacted at email: swamytm@gmail.com. Patil Nagaraja Siddalingappa working as associate professor in Department of Information Science, Bapuji Institute of Engineering Technology, Davangere. His teaching experience is 13 years and his area of interest is graph database, big data, database management system, and data mining. He can be contacted at email: patilbathi@gmail.com.