Introduction to Designing and Building Big Data Applications

Tom Wheeler | Senior Curriculum Developer
April 2014
Introduction to Designing and
Building Big Data Applications

Agenda
 Cloudera's Learning Path for Developers
 Target Audience and Prerequisites
 Course Outline
 Short Presentation Based on Actual Course Material
 Question and Answer Session

Intro to
Data Science
HBase
Training
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Big Data
Applications
Build converged applications using multiple processing engines
Develop enterprise solutions using components across the EDH
Developer
Training
Learning Path: Developers
Create Powerful New Data Processing Tools
Aaron T. Myers
Software Engineer

25%
$115K
An engineer with Hadoop skills requires a min. salary premium of
Hadoop developers are now the top paid in tech, starting at
Sources: Business Insider, “10 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 11 August 2012.
Business Insider, “30 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 21 February 2013.
GigaOm, “Big Data Skills Bring Big Dough,” 17 February 2012.
$300K
Compensation for a very senior Data Scientist opens at
Hadoop Professionals: Build or Buy?
Professional Certification Decreases Hiring Risk

1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
More than 20,000 students trained since 2009
6 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
7 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
8 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 8,000 accredited Cloudera professionals
4 Trusted Source for Training
100,000+ people have attended online courses 9 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Aligned to Best Practices and the Pace of Change
5 State of the Art Curriculum
Courses updated as Hadoop evolves 10Commitment to Big Data Education
University partnerships to teach Hadoop in the classroom

Designing and Building
Big Data Applications
About the Course

• Intended for people who write code, such as
• Software Engineers
• Data Engineers
• ETL Developers
Target Audience

• Successful completion of our Developer course
• Or equivalent practical experience
• Intermediate-level Java skills
• Basic familiarity with Linux
• Knowledge of SQL or HiveQL is also helpful
Course Prerequisites

Example of Required Java Skill Level
package com.cloudera.example;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Example extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context ctx)
throws IOException, InterruptedException {
1
2
3
4
5
6
7
8
9
10
11
12
13
• Do you understand the following code? Could you write something similar?

Example of Required Linux Skill Level
• Are you comfortable editing text files on a Linux system?
• Are you familiar with the following commands?
$ mkdir -p /tmp/incoming/web_logs
$ cd /var/log/web
$ mv *.log /tmp/incoming/web_logs

• During this course, you will learn
• Determine which Hadoop-related tools are appropriate for specific tasks
• Understand how file formats, serialization, and data compression affect
application compatibility and performance
• Design and evolve schemas in Apache Avro
• Create, populate, and access data sets with the Kite SDK
• Integrate with external systems using Apache Sqoop and Apache Flume
• Integrate Apache Flume with existing applications and develop custom
components to extend Flume’s capabilities
Course Objectives

• Create, package, and deploy Oozie jobs to manage processing workflows
• Develop Java-based data processing pipelines with Apache Crunch
• Implement user-defined functions for use in Apache Hive and Impala
• Index both static and streaming data sets with Cloudera Search
• Use Hue to build a Web-based interface for Search queries
• Integrate results from Impala and Cloudera Search into your
applications
Course Objectives (continued)

• Frequent hands-on exercises
• Based on a hypothetical but realistic scenario
• Each works towards building a working application
Scenario for Hands-On Exercises
mobile
udacreo
L

Tools Used in Hands-On Exercises
HDFS
Sqoop Flume
Kite SDK / Morphlines
Ingest and Data Management
HCatalog Impala
Search
Interactive Queries
MapReduce
Crunch Hive
Batch Processing
Avro

Data Sources Used in Hands-On Exercises
RDBMS
Telecom Switches
Enterprise Data Hub
Equipment
Records
Customer
Records
Call Detail
Records
(Fixed-Width)
CRM System
Phone
Activations
(XML)
Point of Sale
Terminals
Web Servers
Static
Documents
(HTML)
Log Files
(Text)
Device Status
(CSV and TSV)
Chat
Transcripts
(JSON)

• Exercises use real-world development environment
• IDE (Eclipse)
• Unit testing library (JUnit)
• Build and configuration management tool (Maven)
Development Environment

• Introduction
• Application Architecture *
• Designing and Using Data Sets *
• Using the Kite SDK Data Module *
• Importing Relational Data with Apache Sqoop *
• Capturing Data with Apache Flume *
Course Outline
* This chapter contains a hands-on exercise
* This chapter contains multiple hands-on exercises

• Developing Custom Flume Components *
• Managing Workflows with Apache Oozie *
• Processing Data Pipelines with Apache Crunch *
• Working with Tables in Apache Hive *
• Developing User-Defined Functions *
• Executing Interactive Queries with Impala *
Course Outline (continued)

• Understanding Cloudera Search
• Indexing Data with Cloudera Search *
• Presenting Results to Users *
• Conclusion
Course Outline (continued)

• Based on chapter 3: Designing and Using Data Sets
Course Excerpt

• Define the concept of serialization
• Represents data as a series of bytes
• Allows us to store and transmit data
• There are many ways of serializing data
• How do you serialize the number 108125150?
• 4 bytes when stored as a Java int
• 9 bytes when stored as text
What is Data Serialization?

• Affects performance and storage space
• Chosen method may limit portability
• java.io.Serializable is Java-specific
• Writables are Hadoop-specific
• May also limit backwards compatibility
• Often depends on specific version of class
• Avro was developed to address these challenges
Implications of Data Serialization

• Avro is an open source data serialization framework
• Widely supported throughout Hadoop ecosystem
• Offers compatibility without sacrificing performance
• Data is serialized according to a schema you define
• Read and write from Java, C, C++, C#, Python, PHP, etc.
• Optimized binary encoding for efficient storage
• Defines rules for schema evolution
What is Apache Avro?

• Avro schemas define the structure of your data
• Similar to a CREATE TABLE in SQL, but more flexible
• Defined using JSON syntax
Avro Schemas
id name title bonus
108424 Alice Salesperson 2500
101837 Bob Manager 3000
107812 Chuck President 9000
105476 Dan Accountant 3000
Metadata
Data

• These are among the simple (scalar) types in Avro
Simple Types in Avro Schemas
Name Description
null An absence of a value
boolean A binary value
int 32-bit signed integer
long 64-bit signed integer
float Single-precision floating point value
double Double-precision floating point value
string Sequence of Unicode characters

• These are the complex types in Avro
Complex Types in Avro Schemas
Name Description
record A user-defined type composed of one or more named fields
enum A specified set of values
array Zero or more values of the same type
map Set of key-value pairs; key is string while value is of specified type
union Exactly one value matching a specified set of types
fixed A fixed number of 8-bit unsigned bytes

• SQL CREATE TABLE statement
Schema Example
CREATE TABLE employees
(id INT,
name VARCHAR(30),
title VARCHAR(20),
bonus INT);

• Equivalent Avro schema
Schema Example (Continued)
{"namespace": "com.loudacre.data",
"type": "record",
"name": "Employee",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "title", "type": "string"},
{"name": "bonus", "type": "int"}
]}

• Approaches for mapping Java object to a schema
• Generic: Write code to map each field manually
• Reflect: Generate a schema from an existing class
• Specific: Generate a Java class from your schema
Mapping Avro Schema to Java Object

• Hadoop and its ecosystem support many file formats
• May ingest in one format and convert to another
• Format selection involves several considerations
• Ingest pattern
• Tool compatibility
• Expected lifetime
• Storage and performance requirements
Considerations for File Formats

• Each file format may also support compression
• Reduces amount of disk space required to store data
• Tradeoff between time and space
• Can greatly improve performance
• Many Hadoop jobs are I/O-bound
Data Compression

• Refers to organizing data according to access patterns
• Improves performance by limiting input
• Common partitioning schemes
• Customers: partition by state, province, or region
• Events: separate by year, month, and day
Data Partitioning

• Imagine that you store all Web server log files in HDFS
• Marketing runs monthly jobs for search engine optimization
• Security runs daily jobs to identify attempted exploits
Partitioning Example
2014
March May
05 06 07 08 09 1001 02 03 04 11 12 13
April
Input for monthly job
Input for daily job

Introduction to Designing and Building Big Data Applications

Register for training and certification at
http://university.cloudera.com
Use discount code Apps10 to save 10%
on new enrollments in Big Data
Applications classes delivered by
Cloudera until July 4, 2014*
• Enter questions in the Q&A panel
• Follow Cloudera University: @ClouderaU
• Follow the Developer learning path:
http://university.cloudera.com/developers
• Learn about the enterprise data hub:
http://tinyurl.com/edh-webinar
• Join the Cloudera user community:
http://community.cloudera.com/
• Get Developer Certification:
http://university.cloudera.com/certification
• Explore Developer resources for Hadoop:
http://cloudera.com/content/dev-center/en/home.html
* Excludes classes sold or delivered by other partners

Introduction to Designing and Building Big Data Applications

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Introduction to Designing and Building Big Data Applications (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Introduction to Designing and Building Big Data Applications