SlideShare a Scribd company logo
Java in production for
Data Mining Research
projects
Alexey Zinovyev, Java Trainer in EPAM
2JDD conference
About
I am a <graph theory, machine learning,
traffic jams prediction, BigData algorithms>
scientist
But I'm a <Java, NoSQL, Hadoop, Spark>
programmer
3JDD conference
In this topic …
A lot of strange pictures and technologies from crazy zoo
We talk about
• Data Mining
• Hadoop ecosystem
• Spark and its friends
• Machine Learning libraries
4JDD conference
Are you a Hadoop developer?
5JDD conference
Let’s do THIS!
6JDD conference
The Good Old Days
7JDD conference
One of these fine days...
8JDD conference
We need in Python dev 'cause Data Mining
9JDD conference
No, you are JavaEE developer only, continue …
10JDD conference
Write your backends, dude!
11JDD conference
Let’s talk about it, Java-boy...
12JDD conference
Can a Java programmer to be a Data Scientist?
13JDD conference
Sexy Data Scientist
14JDD conference
Real Data Scientist
15JDD conference
And what I tell you, young man
16JDD conference
And what I tell you, young man
WHAT IS DATA MINING?
18JDD conference
Statistics?
19JDD conference
Tag cloud from B2B conference?
20JDD conference
Not OLAP, 100%
21JDD conference
Hey, man, predict something!
22JDD conference
Hey, man, predict something!
23JDD conference
Man or sofa?
24JDD conference
SUBJECT AREA
25JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
26JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
27JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
28JDD conference
Typical questions for DM
• Which loan applicants are high-risk?
• How do we detect phone card fraud?
• What is the revenue prediction for next year?
• Can you recommend music for users?
29JDD conference
It’s Time for Java Superhero, yeah!
30JDD conference
Before patterns discovering you should ..
• Select small pieces
• Define default values for missed
data
• Remove strange signals from data
• Merge some tables in one if
required
31JDD conference
TARGET DATA &
PERSONAL DATA
32JDD conference
Targeting
by …
33JDD conference
Pay with your personal data
All your personal data (PD) are
being deeply mined
34JDD conference
Pay with your personal data
The industry of collecting,
aggregating, and brokering PD is
“database marketing.”
35JDD conference
Pay with your personal data
1.1 billion browser cookies, 200
million mobile profiles, and an
average of 1,500 pieces of data per
consumer in Acxiom
36JDD conference
RTB
DATA
38JDD conference
Datasets
• Facebook users, tweets
• Trade transactions
• Government
• Medicine (genomic data)
• Telecommunications
39JDD conference
Data Sources
• Relational Databases
• Data warehouses (Historical data)
• Files in CSV or in binary format
• Internet or electronic mails
• Scientific, research (R, Octave,
Matlab)
PATTERN MINING
41JDD conference
Association rule learning
42JDD conference
What is Cluster Analysis?
It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.
43JDD conference
Different algorithms – different results
44JDD conference
Regression
45JDD conference
• Training set of classified
examples (supervised learning)
Classification
46JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
Classification
47JDD conference
• Training set of classified
examples (supervised learning)
• Test set of non-classified items
• Main goal: find a function
(classifier) that maps input data
to a category (class)
Classification
48JDD conference
Decision trees
49JDD conference
Cruel Tree
50JDD conference
Green circle is blue square or red
triangle? Let’s ask its neighbors!
kNN (k-nearest neighbor)
51JDD conference
Hit parade of algorithms
FASHION LANGUAGES
53JDD conference
Octave
54JDD conference
• A small amount of ML algorithms
• All your matrixes are belong to us!
• Single thread model
• Java support
• Octave in Java?
Why not Octave?
55JDD conference
Do you like
this GUI?
56JDD conference
• 25% of R packs are written in Java
• Syntax is too sweet
• You should read 1000 lines in docs
to write 1 line of code
• Single thread model for 95%
algorithms
Why not R?
57JDD conference
Now Python is an idol for young scientists
due to the low barrier to entry
Why not Python?
58JDD conference
• High-level language
• Have you ever heard about a
Jython?
• Long way to real Highload
production
• We are not Python developers
Why not Python?
59JDD conference
DM libraries
Let’s run on
JVM!
JAVA ECOSYSTEM
Family
Spring Data
HADOOP
65JDD conference
Hadoop
66JDD conference
Hadoop and Data Knights
67JDD conference
MapReduce for WordCount
68JDD conference
How to make features from Hadoop cluster?
69JDD conference
Pig & Hive
70JDD conference
Hive
71JDD conference
PIG
72JDD conference
PIG (Triangle count)
73JDD conference
Pig
• User Defined Functions (UDF)
74JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
75JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
• Pipeline-style
76JDD conference
Pig
• User Defined Functions (UDF)
• FOREACH
• Pipeline-style
• Easy parallelization
77JDD conference
Pig
78JDD conference
Why do we need in special graph approach?
HOW TO MAKE GRAPH
FEATURES
80JDD conference
SNA
81JDD conference
MapReduce for iterative calculations
• High complexity of graph problem reduction to key-value
model
• Iteration algorithms, but multiple chained jobs in M/R
with full saving and reading of each state
Think like a vertex…
82JDD conference
Data vs
Graph
83JDD conference
Messaging
84JDD conference
TRAIN
MODEL
85JDD conference
Java API for Data mining, JSR 73 and JSR 247
• javax.datamining.supervised defines the supervised
function-related interfaces
• javax.datamining.algorithm contains all mining algorithm
subclass packages
• JDM 2.0 adds Text Mining, Time series and so on..
JDM
86JDD conference
Who knows Weka?
87JDD conference
• Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL
databases
• Source code of all algorithms in Java
• Preprocessing tools: discretization, normalization,
resampling, attribute selection, transforming and combining
Weka
88JDD conference
Weka
89JDD conference
Weka +
Hadoop
90JDD conference
SPMF
• It’s codebase of algorithms in pattern mining field
• It has cool examples and implementation of 109
algorithms
• Cool performance results in specific area
• Codebase grows very fast
• Not so many classification algorithms are covered
91JDD conference
Mahout
• Scalable machine learning with Samsara
• Advanced Implementations of Java’s Collections Framework
for better Performance.
• New algorithms will build on Spark platform
• Collaborative Filtering, Classification, Clustering,
Dimensionality Reduction, Miscellaneous are supported
92JDD conference
Collaborative Filtering
93JDD conference
Code sample Mahout (K-Means)
// read the point values and generate vectors from input data
final List vectors = vectorize(points);
// Write data to sequence hadoop sequence files
writePointsToFile(configuration, vectors);
// Write initial centers for clusters
writeClusterInitialCenters(configuration, vectors);
// Run K-means algorithm
final Path inputPath = new Path(POINTS_PATH);
final Path clustersPath = new Path(CLUSTERS_PATH);
final Path outputPath = new Path(OUTPUT_PATH);
HadoopUtil.delete(configuration, outputPath);
KMeansDriver.run(configuration, inputPath, clustersPath, outputPath, 0.001, 10, true, 0, false);
// Read and print output values
readAndPrintOutputValues(configuration);
94JDD conference
Hadoop
ecosystem
HADOOP IS NOT SEXY
96JDD conference
Whaaaat?
97JDD conference
Map Reduce Job Writing
98JDD conference
Hadoop
Jobs
99JDD conference
Hadoop
Jobs
100JDD conference
YARN?
101JDD conference
SPARK: the bloody son of MR
• MapReduce in memory
• Up to 50x faster than Hadoop
• RDD is a basic building block
(immutable distributed
collections of objects)
• Pipeline API (no needs in PIG)
102JDD conference
GC & Spark
• > 100 GB for Spark apps
• big pauses as a result
• Garbage-First GC
• play with spark.storage.memoryFraction (cached
data/heap for transformation)
• no one recipe
103JDD conference
Mahout’s killer?
104JDD conference
MLlib supports
• Classification and regression
• Collaborative filtering
• Clustering
• Dimensionality reduction
• Optimization
105JDD conference
Code sample MLlib (K-Means)
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Save and load model
clusters.save(sc.sc(), "myModelPath");
KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
106JDD conference
MLlib
• .. extends scikit-learn (Python lib) and Mahout
• .. runs fully on Spark and supports Spark’s Pipeline API
• .. dataset is represented by Spark SQL’s SchemaRDD
• .. supports Hive like external data source
• .. is well for large datasets and parallelized algorithms
107JDD conference
It solves all problems!
108JDD conference
In conclusion
• Think about your data
109JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
110JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
111JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorithms
112JDD conference
In conclusion
• Think about your data
• Have friendship with DevOps engineer
• Run Spark
• Learn algorithms
• Write Java code
113JDD conference
Think Java
114JDD conference
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
LinkedIn: https://www.linkedin.com/in/zaleslaw

More Related Content

PDF
Hadoop Jungle
PDF
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PDF
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PPTX
BI, Reporting and Analytics on Apache Cassandra
Hadoop Jungle
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
PySpark Cassandra - Amsterdam Spark Meetup
BI, Reporting and Analytics on Apache Cassandra

What's hot (20)

PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Real-time Cassandra
PDF
Adios hadoop, Hola Spark! T3chfest 2015
PDF
Facebook keynote-nicolas-qcon
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PPTX
H2O Core Introduction
PDF
Spark after Dark by Chris Fregly of Databricks
PPTX
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
PDF
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
PDF
Spark meetup TCHUG
PPTX
Amazon aws big data demystified | Introduction to streaming and messaging flu...
PDF
Benchmarking sahara based big data as a service solutions
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
PPTX
Introduction to Apache Spark
PDF
Breakthrough OLAP performance with Cassandra and Spark
PPTX
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Real-time Cassandra
Adios hadoop, Hola Spark! T3chfest 2015
Facebook keynote-nicolas-qcon
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
H2O Core Introduction
Spark after Dark by Chris Fregly of Databricks
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Spark meetup TCHUG
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Benchmarking sahara based big data as a service solutions
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to Apache Spark
Breakthrough OLAP performance with Cassandra and Spark
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
Python and Bigdata - An Introduction to Spark (PySpark)
Apache Spark 2.0: Faster, Easier, and Smarter
Introduction to the Hadoop Ecosystem (codemotion Edition)
Ad

Viewers also liked (20)

PDF
HappyDev'15 Keynote: Когда все данные станут большими...
PDF
Data mining java titles adrit solutions
DOCX
2014 IEEE JAVA DATA MINING PROJECT Data mining with big data
PDF
An Intelligence Analysis of Crime Data for Law Enforcement Using Data Mining
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Mining statistically significant co locat...
PDF
Ieee 2010 java data mining projects sbgc
PDF
Мастер-класс по BigData Tools для HappyDev'15
PDF
Java BigData Full Stack Development (version 2.0)
PDF
Joker'15 Java straitjackets for MongoDB
PDF
IEEE Final Year Project Titles 2016-17 - Java - Data Mining
PPT
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
PPTX
Crime Pattern Detection using K-Means Clustering
PDF
Crime Analysis & Prediction System
PPT
Using Data Mining Techniques to Analyze Crime Pattern
PDF
Data Mining and Business Intelligence Tools
PPT
Ch 1 Intro to Data Mining
PPT
DATA WAREHOUSING AND DATA MINING
PDF
Big Data v Data Mining
PPT
Data mining slides
 
PPT
Data Mining Concepts
HappyDev'15 Keynote: Когда все данные станут большими...
Data mining java titles adrit solutions
2014 IEEE JAVA DATA MINING PROJECT Data mining with big data
An Intelligence Analysis of Crime Data for Law Enforcement Using Data Mining
IEEE 2014 JAVA DATA MINING PROJECTS Mining statistically significant co locat...
Ieee 2010 java data mining projects sbgc
Мастер-класс по BigData Tools для HappyDev'15
Java BigData Full Stack Development (version 2.0)
Joker'15 Java straitjackets for MongoDB
IEEE Final Year Project Titles 2016-17 - Java - Data Mining
A Comparative Study of Data Mining Methods to Analyzing Libyan National Crime...
Crime Pattern Detection using K-Means Clustering
Crime Analysis & Prediction System
Using Data Mining Techniques to Analyze Crime Pattern
Data Mining and Business Intelligence Tools
Ch 1 Intro to Data Mining
DATA WAREHOUSING AND DATA MINING
Big Data v Data Mining
Data mining slides
 
Data Mining Concepts
Ad

Similar to JavaDayKiev'15 Java in production for Data Mining Research projects (20)

PDF
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
PDF
Introduction to SeqAn, an Open-source C++ Template Library
PDF
Joker'14 Java as a fundamental working tool of the Data Scientist
PDF
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
PPTX
Recommendations for Building Machine Learning Software
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PDF
BigDL webinar - Deep Learning Library for Spark
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
PDF
Pivotal OSS meetup - MADlib and PivotalR
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
PPTX
Future Directions for Compute-for-Graphics
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
Introduction to SeqAn, an Open-source C++ Template Library
Joker'14 Java as a fundamental working tool of the Data Scientist
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Recommendations for Building Machine Learning Software
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
BigDL webinar - Deep Learning Library for Spark
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Pivotal OSS meetup - MADlib and PivotalR
Apache spark-melbourne-april-2015-meetup
Jump Start on Apache Spark 2.2 with Databricks
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Future Directions for Compute-for-Graphics
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

More from Alexey Zinoviev (20)

PDF
Kafka pours and Spark resolves
PDF
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
First steps in Data Mining Kindergarten
PDF
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
PDF
Android Geo Apps in Soviet Russia: Latitude and longitude find you
PDF
Keynote on JavaDay Omsk 2014 about new features in Java 8
PDF
Big data algorithms and data structures for large scale graphs
PDF
"Говнокод-шоу"
PDF
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
PDF
Алгоритмы и структуры данных BigData для графов большой размерности
PDF
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
PDF
GDG Devfest Omsk 2013. Year of events!
PDF
How to port JavaScript library to Android and iOS
PDF
Поездка на IT-DUMP 2012
PDF
MyBatis и Hibernate на одном проекте. Как подружить?
PDF
Google I/O туда и обратно.
PDF
Google Maps. Zinoviev Alexey.
PDF
Google Docs. Zinoviev Alexey
Kafka pours and Spark resolves
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
First steps in Data Mining Kindergarten
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Keynote on JavaDay Omsk 2014 about new features in Java 8
Big data algorithms and data structures for large scale graphs
"Говнокод-шоу"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Алгоритмы и структуры данных BigData для графов большой размерности
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
GDG Devfest Omsk 2013. Year of events!
How to port JavaScript library to Android and iOS
Поездка на IT-DUMP 2012
MyBatis и Hibernate на одном проекте. Как подружить?
Google I/O туда и обратно.
Google Maps. Zinoviev Alexey.
Google Docs. Zinoviev Alexey

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to machine learning and Linear Models
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to machine learning and Linear Models
.pdf is not working space design for the following data for the following dat...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx

JavaDayKiev'15 Java in production for Data Mining Research projects