Hadoop dev 01

NYC Data Science Academy
Hadoop Application Development with Real Cases

Multi-layer Model
2

Data Pyramid and Character
 Business personnel
 ETL Engineer
 Data Warehouse Engineer
 Analyzer
 Data Visualization Engineer
 IT supporter: Operation-
Maintanence, Programmer
3

Data Analysis
 Analyze collected data with statistical methods on purpose, then understand and
implement the result
4

Data Mining
 Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply
knowledge-discovery algorithms to large database and show the associations to the users.
 Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning
 Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
 Case: Beer and Diaper
 Science: Detecting Novel Associations in Large Data Sets
5

Business Intelligence
 BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) +
Report (Demonstration)
 Our course
6

Data Analysis Algorithms
 Popular Algorithms
7

Regression
8

Time Series Analysis

Classifier
10

Clustering
11

Association Rules
12

Data Analysis
 Data Analysis Tools
13

Popular Data Analysis Tools Ranking
14

Data Analysis stages
 stage 1: Dominate by Business personnel
 stage 2: Dominate by both Business personnel and Analyzer
 stage 3: Dominate by Analyzer
15

Data Analysis in stage 1
 Business staff set all the requirements and most analysis plans
 According to experiences, Business staff select features, set threshold, and
IT staff search, integrate data, analyzer make report
 Feature selection and choice of threshold is based on experience and
personal knowledge
 Suitable for simple cases, analysis technique is equivalent to the simplest
decision tree
 Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
 This is common in the traditional industry
16

 More complex. Business staffs could analyze a small number of
data records while cannot figure out all the features and the
relationship among them. They have no experience with large
number of samples.
 Analyzer come to clean data and select features, and finally build
suitable model to solve problem.
 Business staffs and analyzer could evaluate the result together,
very likely to success. Analyzer prefer this step because their ability
and value is confirmed.
17

Spammer in Wordpress

 Business staffs have no experience for the
case, and cannot offer any useful prior
knowledge
 Data analyzers use various tools and models to
mine the data and trying to have interesting
discovery
 It is analyzer’s ideal world, while it is likely to
fail
 Business staffs cannot get involved, and they
dislike this stage
19

Step Forward
 The first stage(Gold on the ground) -> The second
stage(Gold beneath the ground) -> The third stage (Gold
deeply buried)
 If analyzers are reckless, business staffs will resist to
help
 Data analysis is rooted in the business background. The
goal of analysis is increasing profit. Successful analysis
could not be apart from business
 Interesting topic is more important than the model
20

What is Big Data

Features of Big Data

Challenges for Analyzers
 Bottleneck for both insertion and query due to the increasing amount of data
 The trend of integrating users’ application and analysis result is asking for faster
real-time computation and response time
 More complex models require more expensive computation
23

Dilemma of Traditional Data Analysis
Tools
 R, SAS, SPSS are experimental tools
 Capable data size is restricted by the memory size
 Use Oracle database for large volume of data, but lack of professional and fast
analyzing ability
 Sampling is a limited solution, it is not useful for clustering and recommendation
system
 Solution: Hadoop cluster and Map-Reduce parallel computing
24

Case 1: analysis and monitor for a
telecommunication company
25

Case 1: analysis and monitor for a
telecommunication company
 Configuration of the original database server: HP minicomputer, 128G memory, 48-
core CPU, RAC with two nodes, one node for insertion and the other for query
 Storage: HP virtual storage, over 1000 disks
 Architecture: Oracle RAC with two nodes
 Bottleneck: 1. Insertion 2. Query
26

Case 2: DNA database
27

Case 3: Social analysis, activity
fingerprint detection

28|
Public Voice
mail intersect IMSI 1 IMSI 2 …… IMSI n
total call
duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS
intersect IMSI 1 IMSI 2 …… IMSI n
Monthly
SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base
station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)
(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)
(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)
(0.15, 0.13, …, 0.02, 0.05
eigenvector


When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity
fingerprint detection
29

Case 3: Social analysis, VIP detection
30

Solution that analyzers look forward to
 Perfectly eliminate the bottleneck in the foreseeable future
 Smoothly transplant available techniques, for example SQL and R.
 The cost of new platform: hardware and software, re-development, skill training,
maintenance
31

Path to Big Data

Idea of Hadoop
33

Map-Reduce Programming
34

Map-Reduce program for meteorological
data analysis
35

Map-Reduce implementation for popular
algorithms
36

Map-Reduce implementation for popular
algorithms
37

Why not Hadoop？
 Java?
 Hard to control?
 Hard to integrate data?
 Hadoop vs Oracle
38

Analysis under Hadoop system
 Mainstream: Java program
 Light-weighted script language: Pig
 Smooth transplant from SQL: Hive
 NoSQL: HBase
39

Family of Hadoop
40

pig
 Pig could be treated as a client software
to the hadoop, could connect to hadoop
and analyze
 Pig is convenient for users unfamiliar
with java, using a SQL-like language,
pig latin, dealing with data flow
 Pig latin could perform sorting, filtering,
sum, grouping, association, and define
custom functions. It is a light-weighted
script language for data operation and
analysis
 Pig could be treated as the mapping
from pig latin to map-reduce
41

Hive
 Data warehouse tool, could turn
primary data structure in Hadoop into
tables in Hive
 Support HiveQL, a language almost
the same as SQL, its function is the
same as SQL except updating,
indexing and
 could be treated as the mapping from
SQL to map-reduce
 Offering interfaces for shell、
JDBC/ODBC、Thrift、Web
42

Features of Mahout
 Mahout is for scalable machine learning
algorithms (M-R implementation), and
Hadoop platform is not necessary. The
core library also have efficient algorithms
on single machine
 Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43

Reference Textbooks

Reference Textbooks
47

Typical Experiment Environtment(with
server)
 Server: ESXi, capable of deploying multiple virtual machines and could run 3
machines at the same time
 PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine
 SSH: Use command ssh under linux, and SecureCRT or putty under Windows to
connect with remote linux server
 Vmware client: Management of ESXi
 Hadoop: Use version 1.x or 2.x
48

Typical Experiment Environtment(with
only PC or laptop running Windows)
 At Least 4G memory, 64bit windows is preferred, because 32bit machine can use
only more than 3G memory.
 Install vmware workstation or virtual box
 Deploy 3 virtual machines and running at the same time. If can only run two VMs,
treat host as a node (by cygwin), and use bridged networking for virtual network
 Install Linux and Java
 Old computers could consider pseudo-distributed environment
49

Experiment Environment
 Deploy Pig
 Deploy Hive
 Deploy Mahout

List of Cases of the Course
 Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
 LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map-
Reduce)
 User analysis for telecommunication company; Labeling duplicated users by the fingerprint of
calls(Map-Reduce)
 Recommendation system for E-commerce company(Map-Reduce)
 Complicated recommendation system application(mahout)
 Social network; Distance between users; Community detection(Pig)
 Importance of nodes in a social network(Map-Reduce)
 Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
 Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)
 Set stock strategies with data analysis(Map-Reduce, Hive)
 GPS application; Sign-in data analysis(Pig)
 Implementation and optimization of sorting on Map-Reduce
 Middleware development; Cooperation of multiple Hadoop clusters

Hadoop dev 01

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop dev 01 (20)

More from Vivian S. Zhang (20)

Recently uploaded (20)

Hadoop dev 01