SlideShare a Scribd company logo
Supervised Papers Classification on Large-Scale
High-Dimensional Data with Apache Spark
Leonidas Akritidis, Panayiotis Bozanis, Athanasios Fevgas
Department of Electrical and Computer Engineering
Data Structuring and Engineering Lab
University of Thessaly
The Fourth IEEE International Conference on Big Data
Intelligence and Computing (DataCom 2018)
12-15 August 2018, Athens, Greece
Supervised papers classification
• Classify a set of unlabeled articles X into one
research field y of a given taxonomy structure Y.
• Supervised learning problem: The algorithm will
exploit a given set of articles with known labels.
• Important problem for academic search engines
& digital libraries - a robust solution allows:
– Results refinement by category.
– Browsing of articles by category.
– Recommendations of Similar articles.
L. Akritidis, P. Bozanis, A. Fevgas 2IEEE DataCom 2018, 12-15 August 2018, Athens
Large-Scale Dataset
• This work makes use of the Open Academic Graph
https://www.openacademic.ai/oag
• It contains 167 million publications.
• The articles are classified in 19 primary research
areas and thousands of children categories.
• It occupies 300GB in uncompressed form.
• Each paper and its characteristics are represented
by a record in JSON format.
L. Akritidis, P. Bozanis, A. Fevgas 3IEEE DataCom 2018, 12-15 August 2018, Athens
Features
• Keywords: Special words which are selected by the
authors to highlight the contents of their articles.
Present in some papers.
• Title words: treated as normal keywords after
removing the stop words. Present in all papers.
• Authors history: The research fields of the each
author and the respective frequency for each field.
• Co-authorship information: We record the research
fields of the each (author, coauthor) pair.
• Journal history: The research fields of the each
journal and the respective frequency for each field.
L. Akritidis, P. Bozanis, A. Fevgas 4IEEE DataCom 2018, 12-15 August 2018, Athens
The original classification algorithm
• L. Akritidis, P. Bozanis. “A supervised machine
learning classification algorithm for research
articles”, ACM SAC, pp. 115-120, 2013.
• Given a set of labeled articles, we build a model
which correlates the keywords, the authors and
the journals with one or more research fields.
• The model contains a dictionary data structure M
with the features F of the dataset.
L. Akritidis, P. Bozanis, A. Fevgas 5IEEE DataCom 2018, 12-15 August 2018, Athens
Original algorithm: Training phase
• For each feature f we store:
– A global frequency value |Xf|
(occurrences of f in dataset)
– A relevance desc. vector (RDV)
with components (y, |Xf,y|).
– |Xf,y|: number of f and y
co-occurrences (i.e. how many
times y has been correlated
with f).
L. Akritidis, P. Bozanis, A. Fevgas 6IEEE DataCom 2018, 12-15 August 2018, Athens
Original algorithm: Test phase
• For each unlabeled article
x we extract the features
and we search M.
• We retrieve its RDV and we
compute a score Sy for each
label y in the RDV:
• wf: The weight of the feature.
• Good accuracy: wk=0.3, wa=0.2, wj=0.5
L. Akritidis, P. Bozanis, A. Fevgas 7IEEE DataCom 2018, 12-15 August 2018, Athens
xf F 
xF
Apache Spark
• Spark is a fault-tolerant parallelization framework.
• In contrast to MapReduce, it has been designed to
allow storage of intermediate data in the main
memory of the cluster nodes.
• In contrast to MapReduce which forces a linear
dataflow, it is based on a DAG Scheduler which
enhances the job execution performance.
• The core element is its Resilient Distributed Datasets
(RDDs), i.e. fault-tolerant & read-only abstractions for
data handling with various persistence levels.
L. Akritidis, P. Bozanis, A. Fevgas 8IEEE DataCom 2018, 12-15 August 2018, Athens
Apache MLlib
• Spark powers a set of libraries including Spark
SQL, GraphX, Spark Streaming and MLlib.
• MLlib is Spark’s scalable machine learning library.
• It implements a series of classification and
regression algorithms.
• We are interested in comparing our model with
the multi-class classification algorithms of MLlib:
– i.e., Logistic Regression, Decision Trees and Random Forests.
L. Akritidis, P. Bozanis, A. Fevgas 9IEEE DataCom 2018, 12-15 August 2018, Athens
The LIBSVM Format
• MLlib algorithms accept their input data in
LIBSVM format.
• Each row is represented by a LabeledPoint,
an abstraction which contains:
– The label of the data sample (double).
– A sparse vector of feature-weight pairs (int, double).
• We converted the dataset in LIBSVM format with
the aim of comparing our method with the
adversary classifiers of MLlib.
L. Akritidis, P. Bozanis, A. Fevgas 10IEEE DataCom 2018, 12-15 August 2018, Athens
Dataset Preprocessing
• Spark offers a powerful SQL-
like filtration mechanism.
– We discard all the unlabeled
samples and all the samples
with irrelevant labels.
– We discard all the non-English
articles.
– We convert the dataset to the
LIBSVM format by applying the
hashing trick.
L. Akritidis, P. Bozanis, A. Fevgas 11IEEE DataCom 2018, 12-15 August 2018, Athens
Dimensionality Reduction
• After preprocessing, our dataset consisted of
about 75 million articles and 83 million features.
• Our method executes normally on this huge
feature space. However, MLlib algorithms do not:
– Out of memory errors
• Therefore, we applied Sparse Random Projection
to reduce the dimensionality of the feature space.
– The built-in dimensionality reduction algorithms of
MLlib, SVD and PCA, failed to complete the task.
– The final projected space included only 4181 features.
L. Akritidis, P. Bozanis, A. Fevgas 12IEEE DataCom 2018, 12-15 August 2018, Athens
Algorithm parallelization on Spark (1)
• The Driver program controls
the execution flow of the job.
• The dataset has been
converted to LIBSVM format.
• The split method automatically shuffles the
dataset and splits it in the training and test sets,
based on the parameter N (we set N=0.6).
• After the model M has been trained, it is
transmitted to all cluster nodes via a special
broadcast call.
L. Akritidis, P. Bozanis, A. Fevgas 13IEEE DataCom 2018, 12-15 August 2018, Athens
Algorithm parallelization on Spark (2)
• Our parallel model implementation includes:
– The features dictionary M.
– The training (fitting) function.
– The classification function.
• The training phase is a flatMap
function which operates in two
phases:
L. Akritidis, P. Bozanis, A. Fevgas 14IEEE DataCom 2018, 12-15 August 2018, Athens
Algorithm parallelization on Spark (3)
• Stage 1: For each LabeledPoint of the input return
a local list λ of (f,y,wf) tuples.
• Collect all local lists λ and merge
them into a list l.
• Stage 2: Traverse the list l and
build the model M (insert the
features, compute frequencies,
build the RDVs).
L. Akritidis, P. Bozanis, A. Fevgas 15IEEE DataCom 2018, 12-15 August 2018, Athens
Algorithm parallelization on Spark (4)
• Test Phase: For each unlabeled article, establish a
list of candidate labels Y.
• For each feature, search M.
• In case it is found, retrieve
its RDV and for each entry in
the RDV, compute the scores
and update Y.
• Sort Y in decreasing score order.
• Assign the highest-scoring label to the article.
L. Akritidis, P. Bozanis, A. Fevgas 16IEEE DataCom 2018, 12-15 August 2018, Athens
Experiments
• We used the cluster of our department: 8 nodes
with 16 CPUs and 64GB of RAM each.
• Java 1.8, Hadoop 2.9.0, HDFS, YARN, Spark 2.3.0.
• 15 executors (2 executors per node) with 28 GB of
available RAM each.
– One node runs only 1 executor plus the Application Master.
• Our method: Paper Classifier (PC)
• MLlib adversaries: Logistic Regression, Decision
Trees, Random Forests.
L. Akritidis, P. Bozanis, A. Fevgas 17IEEE DataCom 2018, 12-15 August 2018, Athens
Accuracy Measurements
• In the original feature space (83,3M features)
only our method managed to complete, achieving
accuracy of 79.1%.
• In the reduced feature space, our method lost a
portion of its accuracy (52.1%), however, it
outperformed all the classifiers of MLlib.
L. Akritidis, P. Bozanis, A. Fevgas 18IEEE DataCom 2018, 12-15 August 2018, Athens
Efficiency Measurements
• Our method was much faster than the algorithms
of MLlib. Even in the original feature space (83M
features), it was twice as fast as Logistic
Regression in the reduced feature space. In the
reduced feature space, it was 11 times faster.
• The dashes symbolize the failure of SVD and PCA
in reducing the feature space.
L. Akritidis, P. Bozanis, A. Fevgas 19IEEE DataCom 2018, 12-15 August 2018, Athens
Conclusions
• We presented a parallel supervised learning
algorithm for classifying research articles on
Apache Spark.
• Our method takes into consideration multiple
features including keywords, title words, authors,
and publishing journals of the articles.
• Our method operates effectively and efficiently
on large, high-dimensional datasets.
• It outperforms the built-in Spark MLlib classifiers
by a significant margin.
L. Akritidis, P. Bozanis, A. Fevgas 20IEEE DataCom 2018

More Related Content

PPTX
eNanoMapper database, search tools and templates
PPT
State and future of linked data in learning analytics
PPTX
Publishing Linked Statistical Data: Aragón, a case study
PDF
MOCHA 2018 Challenge @ ESWC2018
DOCX
PPT
Sigmaplot 13 PPT
PDF
ABench: Big Data Architecture Stack Benchmark
PPTX
Icampam jive
eNanoMapper database, search tools and templates
State and future of linked data in learning analytics
Publishing Linked Statistical Data: Aragón, a case study
MOCHA 2018 Challenge @ ESWC2018
Sigmaplot 13 PPT
ABench: Big Data Architecture Stack Benchmark
Icampam jive

What's hot (20)

PPTX
UnifiedViews: Towards ETL Tool for Simple yet Powerful RDF Data Management.
PPTX
Systat 13 Training ppt
PPTX
Big Data Project using HIVE - college scorecard
PDF
"Dude, where's my graph?" RDF Data Cubes for Clinical Trials Data
PDF
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
PPTX
Introduction to R
PDF
High quality Linked Data generation for librarians
PDF
IMDb Data Integration
PPTX
Big Data LDN 2016: All data is equal – but some data is more equal than others
PDF
What Factors Influence the Design of a Linked Data Generation Algorithm?
PDF
Monitorama: How monitoring can improve the rest of the company
PDF
Semantics for integrated laboratory analytical processes - The Allotrope Pers...
PPTX
Observlets
PPTX
Qualitative data analysis software's By Iqbal Rana
PDF
Resume xiaodan(vinci)
PPTX
WF ED 540, Class Meeting 7, 8 October 2015
PDF
Machine Learning with Spark MLlib
PPTX
14. Files - Data Structures using C++ by Varsha Patil
PPT
Choosing the right software for your research study : an overview of leading ...
PDF
Symmetry 13-00195-v2
UnifiedViews: Towards ETL Tool for Simple yet Powerful RDF Data Management.
Systat 13 Training ppt
Big Data Project using HIVE - college scorecard
"Dude, where's my graph?" RDF Data Cubes for Clinical Trials Data
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
Introduction to R
High quality Linked Data generation for librarians
IMDb Data Integration
Big Data LDN 2016: All data is equal – but some data is more equal than others
What Factors Influence the Design of a Linked Data Generation Algorithm?
Monitorama: How monitoring can improve the rest of the company
Semantics for integrated laboratory analytical processes - The Allotrope Pers...
Observlets
Qualitative data analysis software's By Iqbal Rana
Resume xiaodan(vinci)
WF ED 540, Class Meeting 7, 8 October 2015
Machine Learning with Spark MLlib
14. Files - Data Structures using C++ by Varsha Patil
Choosing the right software for your research study : an overview of leading ...
Symmetry 13-00195-v2
Ad

Similar to Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark (20)

PDF
HPC I/O for Computational Scientists
PDF
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
PDF
Scaling Application on High Performance Computing Clusters and Analysis of th...
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PDF
Resume (kaushik shakkari)
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PDF
Resume(kaushik shakkari)
PPTX
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
PDF
Software tools for high-throughput materials data generation and data mining
PDF
Document semantic characterization
PDF
A Look at TensorFlow.js
DOCX
Map reduce advantages over parallel databases report
PDF
Effective Products Categorization with Importance Scores and Morphological An...
PPTX
2019 DSA 105 Introduction to Data Science Week 4
PDF
Kaushik shakkari internship - resume
PDF
OpenML Tutorial ECMLPKDD 2015
PDF
Software tools to facilitate materials science research
PDF
Satwik mishra resume
PDF
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
PDF
The Power of Machine Learning and Graphs
HPC I/O for Computational Scientists
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Resume (kaushik shakkari)
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
Resume(kaushik shakkari)
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Software tools for high-throughput materials data generation and data mining
Document semantic characterization
A Look at TensorFlow.js
Map reduce advantages over parallel databases report
Effective Products Categorization with Importance Scores and Morphological An...
2019 DSA 105 Introduction to Data Science Week 4
Kaushik shakkari internship - resume
OpenML Tutorial ECMLPKDD 2015
Software tools to facilitate materials science research
Satwik mishra resume
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
The Power of Machine Learning and Graphs
Ad

More from Leonidas Akritidis (7)

PDF
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
PDF
A Self-Pruning Classification Model for News
PDF
Effective Unsupervised Matching of Product Titles
PDF
A Supervised Machine Learning Algorithm for Research Articles
PDF
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
PPT
Positional Data Organization and Compression in Web Inverted Indexes
PDF
Identifying Influential Bloggers: Time Does Matter
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
A Self-Pruning Classification Model for News
Effective Unsupervised Matching of Product Titles
A Supervised Machine Learning Algorithm for Research Articles
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Positional Data Organization and Compression in Web Inverted Indexes
Identifying Influential Bloggers: Time Does Matter

Recently uploaded (20)

PPTX
Artificial Intelligence
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
UNIT 4 Total Quality Management .pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Safety Seminar civil to be ensured for safe working.
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Artificial Intelligence
OOP with Java - Java Introduction (Basics)
UNIT 4 Total Quality Management .pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
Safety Seminar civil to be ensured for safe working.
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Internet of Things (IOT) - A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CH1 Production IntroductoryConcepts.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Operating System & Kernel Study Guide-1 - converted.pdf
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark

  • 1. Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark Leonidas Akritidis, Panayiotis Bozanis, Athanasios Fevgas Department of Electrical and Computer Engineering Data Structuring and Engineering Lab University of Thessaly The Fourth IEEE International Conference on Big Data Intelligence and Computing (DataCom 2018) 12-15 August 2018, Athens, Greece
  • 2. Supervised papers classification • Classify a set of unlabeled articles X into one research field y of a given taxonomy structure Y. • Supervised learning problem: The algorithm will exploit a given set of articles with known labels. • Important problem for academic search engines & digital libraries - a robust solution allows: – Results refinement by category. – Browsing of articles by category. – Recommendations of Similar articles. L. Akritidis, P. Bozanis, A. Fevgas 2IEEE DataCom 2018, 12-15 August 2018, Athens
  • 3. Large-Scale Dataset • This work makes use of the Open Academic Graph https://www.openacademic.ai/oag • It contains 167 million publications. • The articles are classified in 19 primary research areas and thousands of children categories. • It occupies 300GB in uncompressed form. • Each paper and its characteristics are represented by a record in JSON format. L. Akritidis, P. Bozanis, A. Fevgas 3IEEE DataCom 2018, 12-15 August 2018, Athens
  • 4. Features • Keywords: Special words which are selected by the authors to highlight the contents of their articles. Present in some papers. • Title words: treated as normal keywords after removing the stop words. Present in all papers. • Authors history: The research fields of the each author and the respective frequency for each field. • Co-authorship information: We record the research fields of the each (author, coauthor) pair. • Journal history: The research fields of the each journal and the respective frequency for each field. L. Akritidis, P. Bozanis, A. Fevgas 4IEEE DataCom 2018, 12-15 August 2018, Athens
  • 5. The original classification algorithm • L. Akritidis, P. Bozanis. “A supervised machine learning classification algorithm for research articles”, ACM SAC, pp. 115-120, 2013. • Given a set of labeled articles, we build a model which correlates the keywords, the authors and the journals with one or more research fields. • The model contains a dictionary data structure M with the features F of the dataset. L. Akritidis, P. Bozanis, A. Fevgas 5IEEE DataCom 2018, 12-15 August 2018, Athens
  • 6. Original algorithm: Training phase • For each feature f we store: – A global frequency value |Xf| (occurrences of f in dataset) – A relevance desc. vector (RDV) with components (y, |Xf,y|). – |Xf,y|: number of f and y co-occurrences (i.e. how many times y has been correlated with f). L. Akritidis, P. Bozanis, A. Fevgas 6IEEE DataCom 2018, 12-15 August 2018, Athens
  • 7. Original algorithm: Test phase • For each unlabeled article x we extract the features and we search M. • We retrieve its RDV and we compute a score Sy for each label y in the RDV: • wf: The weight of the feature. • Good accuracy: wk=0.3, wa=0.2, wj=0.5 L. Akritidis, P. Bozanis, A. Fevgas 7IEEE DataCom 2018, 12-15 August 2018, Athens xf F  xF
  • 8. Apache Spark • Spark is a fault-tolerant parallelization framework. • In contrast to MapReduce, it has been designed to allow storage of intermediate data in the main memory of the cluster nodes. • In contrast to MapReduce which forces a linear dataflow, it is based on a DAG Scheduler which enhances the job execution performance. • The core element is its Resilient Distributed Datasets (RDDs), i.e. fault-tolerant & read-only abstractions for data handling with various persistence levels. L. Akritidis, P. Bozanis, A. Fevgas 8IEEE DataCom 2018, 12-15 August 2018, Athens
  • 9. Apache MLlib • Spark powers a set of libraries including Spark SQL, GraphX, Spark Streaming and MLlib. • MLlib is Spark’s scalable machine learning library. • It implements a series of classification and regression algorithms. • We are interested in comparing our model with the multi-class classification algorithms of MLlib: – i.e., Logistic Regression, Decision Trees and Random Forests. L. Akritidis, P. Bozanis, A. Fevgas 9IEEE DataCom 2018, 12-15 August 2018, Athens
  • 10. The LIBSVM Format • MLlib algorithms accept their input data in LIBSVM format. • Each row is represented by a LabeledPoint, an abstraction which contains: – The label of the data sample (double). – A sparse vector of feature-weight pairs (int, double). • We converted the dataset in LIBSVM format with the aim of comparing our method with the adversary classifiers of MLlib. L. Akritidis, P. Bozanis, A. Fevgas 10IEEE DataCom 2018, 12-15 August 2018, Athens
  • 11. Dataset Preprocessing • Spark offers a powerful SQL- like filtration mechanism. – We discard all the unlabeled samples and all the samples with irrelevant labels. – We discard all the non-English articles. – We convert the dataset to the LIBSVM format by applying the hashing trick. L. Akritidis, P. Bozanis, A. Fevgas 11IEEE DataCom 2018, 12-15 August 2018, Athens
  • 12. Dimensionality Reduction • After preprocessing, our dataset consisted of about 75 million articles and 83 million features. • Our method executes normally on this huge feature space. However, MLlib algorithms do not: – Out of memory errors • Therefore, we applied Sparse Random Projection to reduce the dimensionality of the feature space. – The built-in dimensionality reduction algorithms of MLlib, SVD and PCA, failed to complete the task. – The final projected space included only 4181 features. L. Akritidis, P. Bozanis, A. Fevgas 12IEEE DataCom 2018, 12-15 August 2018, Athens
  • 13. Algorithm parallelization on Spark (1) • The Driver program controls the execution flow of the job. • The dataset has been converted to LIBSVM format. • The split method automatically shuffles the dataset and splits it in the training and test sets, based on the parameter N (we set N=0.6). • After the model M has been trained, it is transmitted to all cluster nodes via a special broadcast call. L. Akritidis, P. Bozanis, A. Fevgas 13IEEE DataCom 2018, 12-15 August 2018, Athens
  • 14. Algorithm parallelization on Spark (2) • Our parallel model implementation includes: – The features dictionary M. – The training (fitting) function. – The classification function. • The training phase is a flatMap function which operates in two phases: L. Akritidis, P. Bozanis, A. Fevgas 14IEEE DataCom 2018, 12-15 August 2018, Athens
  • 15. Algorithm parallelization on Spark (3) • Stage 1: For each LabeledPoint of the input return a local list λ of (f,y,wf) tuples. • Collect all local lists λ and merge them into a list l. • Stage 2: Traverse the list l and build the model M (insert the features, compute frequencies, build the RDVs). L. Akritidis, P. Bozanis, A. Fevgas 15IEEE DataCom 2018, 12-15 August 2018, Athens
  • 16. Algorithm parallelization on Spark (4) • Test Phase: For each unlabeled article, establish a list of candidate labels Y. • For each feature, search M. • In case it is found, retrieve its RDV and for each entry in the RDV, compute the scores and update Y. • Sort Y in decreasing score order. • Assign the highest-scoring label to the article. L. Akritidis, P. Bozanis, A. Fevgas 16IEEE DataCom 2018, 12-15 August 2018, Athens
  • 17. Experiments • We used the cluster of our department: 8 nodes with 16 CPUs and 64GB of RAM each. • Java 1.8, Hadoop 2.9.0, HDFS, YARN, Spark 2.3.0. • 15 executors (2 executors per node) with 28 GB of available RAM each. – One node runs only 1 executor plus the Application Master. • Our method: Paper Classifier (PC) • MLlib adversaries: Logistic Regression, Decision Trees, Random Forests. L. Akritidis, P. Bozanis, A. Fevgas 17IEEE DataCom 2018, 12-15 August 2018, Athens
  • 18. Accuracy Measurements • In the original feature space (83,3M features) only our method managed to complete, achieving accuracy of 79.1%. • In the reduced feature space, our method lost a portion of its accuracy (52.1%), however, it outperformed all the classifiers of MLlib. L. Akritidis, P. Bozanis, A. Fevgas 18IEEE DataCom 2018, 12-15 August 2018, Athens
  • 19. Efficiency Measurements • Our method was much faster than the algorithms of MLlib. Even in the original feature space (83M features), it was twice as fast as Logistic Regression in the reduced feature space. In the reduced feature space, it was 11 times faster. • The dashes symbolize the failure of SVD and PCA in reducing the feature space. L. Akritidis, P. Bozanis, A. Fevgas 19IEEE DataCom 2018, 12-15 August 2018, Athens
  • 20. Conclusions • We presented a parallel supervised learning algorithm for classifying research articles on Apache Spark. • Our method takes into consideration multiple features including keywords, title words, authors, and publishing journals of the articles. • Our method operates effectively and efficiently on large, high-dimensional datasets. • It outperforms the built-in Spark MLlib classifiers by a significant margin. L. Akritidis, P. Bozanis, A. Fevgas 20IEEE DataCom 2018