SlideShare a Scribd company logo
NYC Data Science Academy
Hadoop Application Development with Real Cases
Hadoop Application Development with Real Cases
NYC Data Science Academy
Hadoop Application Development with Real Cases
Multi-layer Model
2
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Pyramid and Character
 Business personnel
 ETL Engineer
 Data Warehouse Engineer
 Analyzer
 Data Visualization Engineer
 IT supporter: Operation-
Maintanence, Programmer
3
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
 Analyze collected data with statistical methods on purpose, then understand and
implement the result
4
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Mining
 Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply
knowledge-discovery algorithms to large database and show the associations to the users.
 Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning
 Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
 Case: Beer and Diaper
 Science: Detecting Novel Associations in Large Data Sets
5
NYC Data Science Academy
Hadoop Application Development with Real Cases
Business Intelligence
 BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) +
Report (Demonstration)
 Our course
6
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis Algorithms
 Popular Algorithms
7
NYC Data Science Academy
Hadoop Application Development with Real Cases
Regression
8
NYC Data Science Academy
Hadoop Application Development with Real Cases
Time Series Analysis
NYC Data Science Academy
Hadoop Application Development with Real Cases
Classifier
10
NYC Data Science Academy
Hadoop Application Development with Real Cases
Clustering
11
NYC Data Science Academy
Hadoop Application Development with Real Cases
Association Rules
12
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
 Data Analysis Tools
13
NYC Data Science Academy
Hadoop Application Development with Real Cases
Popular Data Analysis Tools Ranking
14
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis stages
 stage 1: Dominate by Business personnel
 stage 2: Dominate by both Business personnel and Analyzer
 stage 3: Dominate by Analyzer
15
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 1
 Business staff set all the requirements and most analysis plans
 According to experiences, Business staff select features, set threshold, and
IT staff search, integrate data, analyzer make report
 Feature selection and choice of threshold is based on experience and
personal knowledge
 Suitable for simple cases, analysis technique is equivalent to the simplest
decision tree
 Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
 This is common in the traditional industry
16
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 2
 More complex. Business staffs could analyze a small number of
data records while cannot figure out all the features and the
relationship among them. They have no experience with large
number of samples.
 Analyzer come to clean data and select features, and finally build
suitable model to solve problem.
 Business staffs and analyzer could evaluate the result together,
very likely to success. Analyzer prefer this step because their ability
and value is confirmed.
17
NYC Data Science Academy
Hadoop Application Development with Real Cases
Spammer in Wordpress
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 3
 Business staffs have no experience for the
case, and cannot offer any useful prior
knowledge
 Data analyzers use various tools and models to
mine the data and trying to have interesting
discovery
 It is analyzer’s ideal world, while it is likely to
fail
 Business staffs cannot get involved, and they
dislike this stage
19
NYC Data Science Academy
Hadoop Application Development with Real Cases
Step Forward
 The first stage(Gold on the ground) -> The second
stage(Gold beneath the ground) -> The third stage (Gold
deeply buried)
 If analyzers are reckless, business staffs will resist to
help
 Data analysis is rooted in the business background. The
goal of analysis is increasing profit. Successful analysis
could not be apart from business
 Interesting topic is more important than the model
20
NYC Data Science Academy
Hadoop Application Development with Real Cases
What is Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Challenges for Analyzers
 Bottleneck for both insertion and query due to the increasing amount of data
 The trend of integrating users’ application and analysis result is asking for faster
real-time computation and response time
 More complex models require more expensive computation
23
NYC Data Science Academy
Hadoop Application Development with Real Cases
Dilemma of Traditional Data Analysis
Tools
 R, SAS, SPSS are experimental tools
 Capable data size is restricted by the memory size
 Use Oracle database for large volume of data, but lack of professional and fast
analyzing ability
 Sampling is a limited solution, it is not useful for clustering and recommendation
system
 Solution: Hadoop cluster and Map-Reduce parallel computing
24
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
25
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
 Configuration of the original database server: HP minicomputer, 128G memory, 48-
core CPU, RAC with two nodes, one node for insertion and the other for query
 Storage: HP virtual storage, over 1000 disks
 Architecture: Oracle RAC with two nodes
 Bottleneck: 1. Insertion 2. Query
26
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 2: DNA database
27
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, activity
fingerprint detection

28|
Public Voice
mail intersect IMSI 1 IMSI 2 …… IMSI n
total call
duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS
intersect IMSI 1 IMSI 2 …… IMSI n
Monthly
SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base
station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)
(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)
(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)
(0.15, 0.13, …, 0.02, 0.05
eigenvector
NYC Data Science Academy
Hadoop Application Development with Real Cases

When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity
fingerprint detection
29
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, VIP detection
30
NYC Data Science Academy
Hadoop Application Development with Real Cases
Solution that analyzers look forward to
 Perfectly eliminate the bottleneck in the foreseeable future
 Smoothly transplant available techniques, for example SQL and R.
 The cost of new platform: hardware and software, re-development, skill training,
maintenance
31
NYC Data Science Academy
Hadoop Application Development with Real Cases
Path to Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Idea of Hadoop
33
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce Programming
34
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce program for meteorological
data analysis
35
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
36
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
37
NYC Data Science Academy
Hadoop Application Development with Real Cases
Why not Hadoop?
 Java?
 Hard to control?
 Hard to integrate data?
 Hadoop vs Oracle
38
NYC Data Science Academy
Hadoop Application Development with Real Cases
Analysis under Hadoop system
 Mainstream: Java program
 Light-weighted script language: Pig
 Smooth transplant from SQL: Hive
 NoSQL: HBase
39
NYC Data Science Academy
Hadoop Application Development with Real Cases
Family of Hadoop
40
NYC Data Science Academy
Hadoop Application Development with Real Cases
pig
 Pig could be treated as a client software
to the hadoop, could connect to hadoop
and analyze
 Pig is convenient for users unfamiliar
with java, using a SQL-like language,
pig latin, dealing with data flow
 Pig latin could perform sorting, filtering,
sum, grouping, association, and define
custom functions. It is a light-weighted
script language for data operation and
analysis
 Pig could be treated as the mapping
from pig latin to map-reduce
41
NYC Data Science Academy
Hadoop Application Development with Real Cases
Hive
 Data warehouse tool, could turn
primary data structure in Hadoop into
tables in Hive
 Support HiveQL, a language almost
the same as SQL, its function is the
same as SQL except updating,
indexing and
 could be treated as the mapping from
SQL to map-reduce
 Offering interfaces for shell、
JDBC/ODBC、Thrift、Web
42
NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Mahout
 Mahout is for scalable machine learning
algorithms (M-R implementation), and
Hadoop platform is not necessary. The
core library also have efficient algorithms
on single machine
 Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
47
NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
server)
 Server: ESXi, capable of deploying multiple virtual machines and could run 3
machines at the same time
 PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine
 SSH: Use command ssh under linux, and SecureCRT or putty under Windows to
connect with remote linux server
 Vmware client: Management of ESXi
 Hadoop: Use version 1.x or 2.x
48
NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
only PC or laptop running Windows)
 At Least 4G memory, 64bit windows is preferred, because 32bit machine can use
only more than 3G memory.
 Install vmware workstation or virtual box
 Deploy 3 virtual machines and running at the same time. If can only run two VMs,
treat host as a node (by cygwin), and use bridged networking for virtual network
 Install Linux and Java
 Old computers could consider pseudo-distributed environment
49
NYC Data Science Academy
Hadoop Application Development with Real Cases
Experiment Environment
 Deploy Pig
 Deploy Hive
 Deploy Mahout
NYC Data Science Academy
Hadoop Application Development with Real Cases
List of Cases of the Course
 Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
 LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map-
Reduce)
 User analysis for telecommunication company; Labeling duplicated users by the fingerprint of
calls(Map-Reduce)
 Recommendation system for E-commerce company(Map-Reduce)
 Complicated recommendation system application(mahout)
 Social network; Distance between users; Community detection(Pig)
 Importance of nodes in a social network(Map-Reduce)
 Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
 Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)
 Set stock strategies with data analysis(Map-Reduce, Hive)
 GPS application; Sign-in data analysis(Pig)
 Implementation and optimization of sorting on Map-Reduce
 Middleware development; Cooperation of multiple Hadoop clusters

More Related Content

PDF
Big Data Scotland 2017
PDF
1524 how ibm's big data solution can help you gain insight into your data cen...
 
DOC
Sudhir hadoop and Data warehousing resume
PPTX
Use dependency injection to get Hadoop *out* of your application code
PPTX
Making Bank Predictive and Real-Time
PDF
Hadoop Trends
PPTX
Benefits of Transferring Real-Time Data to Hadoop at Scale
Big Data Scotland 2017
1524 how ibm's big data solution can help you gain insight into your data cen...
 
Sudhir hadoop and Data warehousing resume
Use dependency injection to get Hadoop *out* of your application code
Making Bank Predictive and Real-Time
Hadoop Trends
Benefits of Transferring Real-Time Data to Hadoop at Scale

What's hot (20)

PPTX
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
PPTX
Data Science at Speed. At Scale.
PDF
Destroying Data Silos
PPTX
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PDF
Ibm machine learning for z os
PDF
Privacy-Preserving AI Network - PlatON 2.0
PDF
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
PPTX
Monitizing Big Data at Telecom Service Providers
PPTX
Operational Analytics Using Spark and NoSQL Data Stores
PDF
OpenPOWER Update
PDF
IBM Big Data Analytics Concepts and Use Cases
PDF
Democratizing Data Science on Kubernetes
PPTX
Pouring the Foundation: Data Management in the Energy Industry
PDF
Modern Data Architecture
PDF
Machine Learning Everywhere
PPTX
Software engineering practices for the data science and machine learning life...
PPTX
Deutsche Telekom on Big Data
PPTX
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
PDF
Future of Data Platform in Cloud Native world
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
Data Science at Speed. At Scale.
Destroying Data Silos
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Ibm machine learning for z os
Privacy-Preserving AI Network - PlatON 2.0
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Monitizing Big Data at Telecom Service Providers
Operational Analytics Using Spark and NoSQL Data Stores
OpenPOWER Update
IBM Big Data Analytics Concepts and Use Cases
Democratizing Data Science on Kubernetes
Pouring the Foundation: Data Management in the Energy Industry
Modern Data Architecture
Machine Learning Everywhere
Software engineering practices for the data science and machine learning life...
Deutsche Telekom on Big Data
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Future of Data Platform in Cloud Native world
Ad

Viewers also liked (20)

PDF
Learn from HomeAway Hadoop Development and Operations Best Practices
PDF
Ebook 5 steps to speak a new language
PDF
Présentation Igloo - Forum Mind&Market 2016
PDF
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
PDF
Siscon Corporate Document
PDF
Informativa bluemoz italiano web
PDF
Whopper lust
PPTX
PPT Del Paseo Museo Interactivo Mirador
PPTX
evaluación distancia grado historia del arte_arte prehistórico
PDF
MBC Group - Magnolia in the Media
PDF
Phase transformation and volume collapse of sm bi under high pressure
PDF
2014 10 zoomsquare fh technikum excerpt
PPSX
Induction session | AIESEC BABEZ
PDF
Audio engineering timeline
PPT
Hadoop at Yahoo! -- University Talks
PPT
Salud enfermedad
PPTX
Geografia de grecia bach
PPT
S2 1 Intro Anva
PDF
LCCS Charity Golf & Gala Dinner
DOCX
El Rubius
Learn from HomeAway Hadoop Development and Operations Best Practices
Ebook 5 steps to speak a new language
Présentation Igloo - Forum Mind&Market 2016
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
Siscon Corporate Document
Informativa bluemoz italiano web
Whopper lust
PPT Del Paseo Museo Interactivo Mirador
evaluación distancia grado historia del arte_arte prehistórico
MBC Group - Magnolia in the Media
Phase transformation and volume collapse of sm bi under high pressure
2014 10 zoomsquare fh technikum excerpt
Induction session | AIESEC BABEZ
Audio engineering timeline
Hadoop at Yahoo! -- University Talks
Salud enfermedad
Geografia de grecia bach
S2 1 Intro Anva
LCCS Charity Golf & Gala Dinner
El Rubius
Ad

Similar to Hadoop dev 01 (20)

PDF
Big Data at a Gaming Company: Spil Games
DOC
GauravSriastava
PDF
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
PPTX
Big data and apache hadoop adoption
PPTX
Builiding analytical apps on Hadoop
PDF
50 Shades of SQL
PPTX
Big Data Introduction
PPTX
Hadoop Boosts Profits in Media and Telecom Industry
PDF
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
PPTX
Harvard case studies presentation 09102013
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
PDF
Capturing big value in big data
PDF
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
PPTX
DataJan27.pptxDataFoundationsPresentation
PPTX
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
PPTX
BIG Data & Hadoop Applications in Finance
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
Hortonworks.HadoopPatternsOfUse.201304
PPTX
Big Data, Baby Steps
Big Data at a Gaming Company: Spil Games
GauravSriastava
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Big data and apache hadoop adoption
Builiding analytical apps on Hadoop
50 Shades of SQL
Big Data Introduction
Hadoop Boosts Profits in Media and Telecom Industry
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Harvard case studies presentation 09102013
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Capturing big value in big data
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
DataJan27.pptxDataFoundationsPresentation
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
BIG Data & Hadoop Applications in Finance
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Hortonworks.HadoopPatternsOfUse.201304
Big Data, Baby Steps

More from Vivian S. Zhang (20)

PDF
Why NYC DSA.pdf
PPTX
Career services workshop- Roger Ren
PDF
Nycdsa wordpress guide book
PDF
We're so skewed_presentation
PDF
Wikipedia: Tuned Predictions on Big Data
PDF
A Hybrid Recommender with Yelp Challenge Data
PDF
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
PDF
Data mining with caret package
PDF
PPTX
Streaming Python on Hadoop
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Nycdsa ml conference slides march 2015
PDF
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
PDF
Max Kuhn's talk on R machine learning
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Using Machine Learning to aid Journalism at the New York Times
PDF
Introducing natural language processing(NLP) with r
PDF
Bayesian models in r
Why NYC DSA.pdf
Career services workshop- Roger Ren
Nycdsa wordpress guide book
We're so skewed_presentation
Wikipedia: Tuned Predictions on Big Data
A Hybrid Recommender with Yelp Challenge Data
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Data mining with caret package
Streaming Python on Hadoop
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Nyc open-data-2015-andvanced-sklearn-expanded
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Max Kuhn's talk on R machine learning
Winning data science competitions, presented by Owen Zhang
Using Machine Learning to aid Journalism at the New York Times
Introducing natural language processing(NLP) with r
Bayesian models in r

Recently uploaded (20)

PPTX
Cell Types and Its function , kingdom of life
PDF
Classroom Observation Tools for Teachers
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
Cell Types and Its function , kingdom of life
Classroom Observation Tools for Teachers
STATICS OF THE RIGID BODIES Hibbelers.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
TR - Agricultural Crops Production NC III.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O7-L3 Supply Chain Operations - ICLT Program
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Cell Structure & Organelles in detailed.
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
01-Introduction-to-Information-Management.pdf
Final Presentation General Medicine 03-08-2024.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Abdominal Access Techniques with Prof. Dr. R K Mishra
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025

Hadoop dev 01

  • 1. NYC Data Science Academy Hadoop Application Development with Real Cases Hadoop Application Development with Real Cases
  • 2. NYC Data Science Academy Hadoop Application Development with Real Cases Multi-layer Model 2
  • 3. NYC Data Science Academy Hadoop Application Development with Real Cases Data Pyramid and Character  Business personnel  ETL Engineer  Data Warehouse Engineer  Analyzer  Data Visualization Engineer  IT supporter: Operation- Maintanence, Programmer 3
  • 4. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Analyze collected data with statistical methods on purpose, then understand and implement the result 4
  • 5. NYC Data Science Academy Hadoop Application Development with Real Cases Data Mining  Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply knowledge-discovery algorithms to large database and show the associations to the users.  Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning  Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis  Case: Beer and Diaper  Science: Detecting Novel Associations in Large Data Sets 5
  • 6. NYC Data Science Academy Hadoop Application Development with Real Cases Business Intelligence  BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) + Report (Demonstration)  Our course 6
  • 7. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis Algorithms  Popular Algorithms 7
  • 8. NYC Data Science Academy Hadoop Application Development with Real Cases Regression 8
  • 9. NYC Data Science Academy Hadoop Application Development with Real Cases Time Series Analysis
  • 10. NYC Data Science Academy Hadoop Application Development with Real Cases Classifier 10
  • 11. NYC Data Science Academy Hadoop Application Development with Real Cases Clustering 11
  • 12. NYC Data Science Academy Hadoop Application Development with Real Cases Association Rules 12
  • 13. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Data Analysis Tools 13
  • 14. NYC Data Science Academy Hadoop Application Development with Real Cases Popular Data Analysis Tools Ranking 14
  • 15. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis stages  stage 1: Dominate by Business personnel  stage 2: Dominate by both Business personnel and Analyzer  stage 3: Dominate by Analyzer 15
  • 16. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 1  Business staff set all the requirements and most analysis plans  According to experiences, Business staff select features, set threshold, and IT staff search, integrate data, analyzer make report  Feature selection and choice of threshold is based on experience and personal knowledge  Suitable for simple cases, analysis technique is equivalent to the simplest decision tree  Business staffs has valuable experiences and hard to be replaced, analyzers are just for graphing and is easily replaced  This is common in the traditional industry 16
  • 17. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 2  More complex. Business staffs could analyze a small number of data records while cannot figure out all the features and the relationship among them. They have no experience with large number of samples.  Analyzer come to clean data and select features, and finally build suitable model to solve problem.  Business staffs and analyzer could evaluate the result together, very likely to success. Analyzer prefer this step because their ability and value is confirmed. 17
  • 18. NYC Data Science Academy Hadoop Application Development with Real Cases Spammer in Wordpress
  • 19. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 3  Business staffs have no experience for the case, and cannot offer any useful prior knowledge  Data analyzers use various tools and models to mine the data and trying to have interesting discovery  It is analyzer’s ideal world, while it is likely to fail  Business staffs cannot get involved, and they dislike this stage 19
  • 20. NYC Data Science Academy Hadoop Application Development with Real Cases Step Forward  The first stage(Gold on the ground) -> The second stage(Gold beneath the ground) -> The third stage (Gold deeply buried)  If analyzers are reckless, business staffs will resist to help  Data analysis is rooted in the business background. The goal of analysis is increasing profit. Successful analysis could not be apart from business  Interesting topic is more important than the model 20
  • 21. NYC Data Science Academy Hadoop Application Development with Real Cases What is Big Data
  • 22. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Big Data
  • 23. NYC Data Science Academy Hadoop Application Development with Real Cases Challenges for Analyzers  Bottleneck for both insertion and query due to the increasing amount of data  The trend of integrating users’ application and analysis result is asking for faster real-time computation and response time  More complex models require more expensive computation 23
  • 24. NYC Data Science Academy Hadoop Application Development with Real Cases Dilemma of Traditional Data Analysis Tools  R, SAS, SPSS are experimental tools  Capable data size is restricted by the memory size  Use Oracle database for large volume of data, but lack of professional and fast analyzing ability  Sampling is a limited solution, it is not useful for clustering and recommendation system  Solution: Hadoop cluster and Map-Reduce parallel computing 24
  • 25. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company 25
  • 26. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company  Configuration of the original database server: HP minicomputer, 128G memory, 48- core CPU, RAC with two nodes, one node for insertion and the other for query  Storage: HP virtual storage, over 1000 disks  Architecture: Oracle RAC with two nodes  Bottleneck: 1. Insertion 2. Query 26
  • 27. NYC Data Science Academy Hadoop Application Development with Real Cases Case 2: DNA database 27
  • 28. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, activity fingerprint detection  28| Public Voice mail intersect IMSI 1 IMSI 2 …… IMSI n total call duration User A IMSI 20% 12% …… 5% 365 User B IMSI 15% 13% …… 2% 310 Public SMS intersect IMSI 1 IMSI 2 …… IMSI n Monthly SMS count User A IMSI 50% 10% …… 5% 200 User B IMSI 20% 13% …… 2% 260 Public base station CGI 1 CGI 2 …… CGI n Shutdown User A IMSI 20% 12% …… 5% 20% User B IMSI 15% 13% …… 2% 5% Public Fingerprint (0.2, 0.12, …, 0.05) (0.15, 0.13, …, 0.02) (0.5, 0.1, …, 0.05) (0.2, 0.13, …, 0.02) (0.2, 0.12, …, 0.05, 0.2) (0.15, 0.13, …, 0.02, 0.05 eigenvector
  • 29. NYC Data Science Academy Hadoop Application Development with Real Cases  When equals to , these two vectors are independent When equals to 0 , these two vectors are perfectly dependent The closer is from 0, the more dependent these vectors are 90 Case 3: Social analysis, activity fingerprint detection 29
  • 30. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, VIP detection 30
  • 31. NYC Data Science Academy Hadoop Application Development with Real Cases Solution that analyzers look forward to  Perfectly eliminate the bottleneck in the foreseeable future  Smoothly transplant available techniques, for example SQL and R.  The cost of new platform: hardware and software, re-development, skill training, maintenance 31
  • 32. NYC Data Science Academy Hadoop Application Development with Real Cases Path to Big Data
  • 33. NYC Data Science Academy Hadoop Application Development with Real Cases Idea of Hadoop 33
  • 34. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce Programming 34
  • 35. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce program for meteorological data analysis 35
  • 36. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 36
  • 37. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 37
  • 38. NYC Data Science Academy Hadoop Application Development with Real Cases Why not Hadoop?  Java?  Hard to control?  Hard to integrate data?  Hadoop vs Oracle 38
  • 39. NYC Data Science Academy Hadoop Application Development with Real Cases Analysis under Hadoop system  Mainstream: Java program  Light-weighted script language: Pig  Smooth transplant from SQL: Hive  NoSQL: HBase 39
  • 40. NYC Data Science Academy Hadoop Application Development with Real Cases Family of Hadoop 40
  • 41. NYC Data Science Academy Hadoop Application Development with Real Cases pig  Pig could be treated as a client software to the hadoop, could connect to hadoop and analyze  Pig is convenient for users unfamiliar with java, using a SQL-like language, pig latin, dealing with data flow  Pig latin could perform sorting, filtering, sum, grouping, association, and define custom functions. It is a light-weighted script language for data operation and analysis  Pig could be treated as the mapping from pig latin to map-reduce 41
  • 42. NYC Data Science Academy Hadoop Application Development with Real Cases Hive  Data warehouse tool, could turn primary data structure in Hadoop into tables in Hive  Support HiveQL, a language almost the same as SQL, its function is the same as SQL except updating, indexing and  could be treated as the mapping from SQL to map-reduce  Offering interfaces for shell、 JDBC/ODBC、Thrift、Web 42
  • 43. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Mahout  Mahout is for scalable machine learning algorithms (M-R implementation), and Hadoop platform is not necessary. The core library also have efficient algorithms on single machine  Mature and popular algorithms are 1. Frequent Itemset Mining 2. Clustering 3. Classifier 4. Recommendation System 5. Frequent Subgraph Mining 43
  • 44. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 45. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 46. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 47. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks 47
  • 48. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with server)  Server: ESXi, capable of deploying multiple virtual machines and could run 3 machines at the same time  PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine  SSH: Use command ssh under linux, and SecureCRT or putty under Windows to connect with remote linux server  Vmware client: Management of ESXi  Hadoop: Use version 1.x or 2.x 48
  • 49. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with only PC or laptop running Windows)  At Least 4G memory, 64bit windows is preferred, because 32bit machine can use only more than 3G memory.  Install vmware workstation or virtual box  Deploy 3 virtual machines and running at the same time. If can only run two VMs, treat host as a node (by cygwin), and use bridged networking for virtual network  Install Linux and Java  Old computers could consider pseudo-distributed environment 49
  • 50. NYC Data Science Academy Hadoop Application Development with Real Cases Experiment Environment  Deploy Pig  Deploy Hive  Deploy Mahout
  • 51. NYC Data Science Academy Hadoop Application Development with Real Cases List of Cases of the Course  Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)  LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map- Reduce)  User analysis for telecommunication company; Labeling duplicated users by the fingerprint of calls(Map-Reduce)  Recommendation system for E-commerce company(Map-Reduce)  Complicated recommendation system application(mahout)  Social network; Distance between users; Community detection(Pig)  Importance of nodes in a social network(Map-Reduce)  Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)  Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)  Set stock strategies with data analysis(Map-Reduce, Hive)  GPS application; Sign-in data analysis(Pig)  Implementation and optimization of sorting on Map-Reduce  Middleware development; Cooperation of multiple Hadoop clusters