SlideShare a Scribd company logo
Tom Wheeler | Senior Curriculum Developer
April 2014
Introduction to Designing and
Building Big Data Applications
Agenda
 Cloudera's Learning Path for Developers
 Target Audience and Prerequisites
 Course Outline
 Short Presentation Based on Actual Course Material
 Question and Answer Session
Intro to
Data Science
HBase
Training
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Big Data
Applications
Build converged applications using multiple processing engines
Develop enterprise solutions using components across the EDH
Developer
Training
Learning Path: Developers
Create Powerful New Data Processing Tools
Aaron T. Myers
Software Engineer
25%
$115K
An engineer with Hadoop skills requires a min. salary premium of
Hadoop developers are now the top paid in tech, starting at
Sources: Business Insider, “10 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 11 August 2012.
Business Insider, “30 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 21 February 2013.
GigaOm, “Big Data Skills Bring Big Dough,” 17 February 2012.
$300K
Compensation for a very senior Data Scientist opens at
Hadoop Professionals: Build or Buy?
Professional Certification Decreases Hiring Risk
1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
More than 20,000 students trained since 2009
6 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
7 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
8 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 8,000 accredited Cloudera professionals
4 Trusted Source for Training
100,000+ people have attended online courses 9 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Aligned to Best Practices and the Pace of Change
5 State of the Art Curriculum
Courses updated as Hadoop evolves 10Commitment to Big Data Education
University partnerships to teach Hadoop in the classroom
Designing and Building
Big Data Applications
About the Course
• Intended for people who write code, such as
• Software Engineers
• Data Engineers
• ETL Developers
Target Audience
• Successful completion of our Developer course
• Or equivalent practical experience
• Intermediate-level Java skills
• Basic familiarity with Linux
• Knowledge of SQL or HiveQL is also helpful
Course Prerequisites
Example of Required Java Skill Level
package com.cloudera.example;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Example extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context ctx)
throws IOException, InterruptedException {
1
2
3
4
5
6
7
8
9
10
11
12
13
• Do you understand the following code? Could you write something similar?
Example of Required Linux Skill Level
• Are you comfortable editing text files on a Linux system?
• Are you familiar with the following commands?
$ mkdir -p /tmp/incoming/web_logs
$ cd /var/log/web
$ mv *.log /tmp/incoming/web_logs
• During this course, you will learn
• Determine which Hadoop-related tools are appropriate for specific tasks
• Understand how file formats, serialization, and data compression affect
application compatibility and performance
• Design and evolve schemas in Apache Avro
• Create, populate, and access data sets with the Kite SDK
• Integrate with external systems using Apache Sqoop and Apache Flume
• Integrate Apache Flume with existing applications and develop custom
components to extend Flume’s capabilities
Course Objectives
• Create, package, and deploy Oozie jobs to manage processing workflows
• Develop Java-based data processing pipelines with Apache Crunch
• Implement user-defined functions for use in Apache Hive and Impala
• Index both static and streaming data sets with Cloudera Search
• Use Hue to build a Web-based interface for Search queries
• Integrate results from Impala and Cloudera Search into your
applications
Course Objectives (continued)
• Frequent hands-on exercises
• Based on a hypothetical but realistic scenario
• Each works towards building a working application
Scenario for Hands-On Exercises
mobile
udacreo
L
Tools Used in Hands-On Exercises
HDFS
Sqoop Flume
Kite SDK / Morphlines
Ingest and Data Management
HCatalog Impala
Search
Interactive Queries
MapReduce
Crunch Hive
Batch Processing
Avro
Data Sources Used in Hands-On Exercises
RDBMS
Telecom Switches
Enterprise Data Hub
Equipment
Records
Customer
Records
Call Detail
Records
(Fixed-Width)
CRM System
Phone
Activations
(XML)
Point of Sale
Terminals
Web Servers
Static
Documents
(HTML)
Log Files
(Text)
Device Status
(CSV and TSV)
Chat
Transcripts
(JSON)
• Exercises use real-world development environment
• IDE (Eclipse)
• Unit testing library (JUnit)
• Build and configuration management tool (Maven)
Development Environment
• Introduction
• Application Architecture *
• Designing and Using Data Sets *
• Using the Kite SDK Data Module *
• Importing Relational Data with Apache Sqoop *
• Capturing Data with Apache Flume *
Course Outline
* This chapter contains a hands-on exercise
* This chapter contains multiple hands-on exercises
• Developing Custom Flume Components *
• Managing Workflows with Apache Oozie *
• Processing Data Pipelines with Apache Crunch *
• Working with Tables in Apache Hive *
• Developing User-Defined Functions *
• Executing Interactive Queries with Impala *
Course Outline (continued)
• Understanding Cloudera Search
• Indexing Data with Cloudera Search *
• Presenting Results to Users *
• Conclusion
Course Outline (continued)
• Based on chapter 3: Designing and Using Data Sets
Course Excerpt
• Define the concept of serialization
• Represents data as a series of bytes
• Allows us to store and transmit data
• There are many ways of serializing data
• How do you serialize the number 108125150?
• 4 bytes when stored as a Java int
• 9 bytes when stored as text
What is Data Serialization?
• Affects performance and storage space
• Chosen method may limit portability
• java.io.Serializable is Java-specific
• Writables are Hadoop-specific
• May also limit backwards compatibility
• Often depends on specific version of class
• Avro was developed to address these challenges
Implications of Data Serialization
• Avro is an open source data serialization framework
• Widely supported throughout Hadoop ecosystem
• Offers compatibility without sacrificing performance
• Data is serialized according to a schema you define
• Read and write from Java, C, C++, C#, Python, PHP, etc.
• Optimized binary encoding for efficient storage
• Defines rules for schema evolution
What is Apache Avro?
• Avro schemas define the structure of your data
• Similar to a CREATE TABLE in SQL, but more flexible
• Defined using JSON syntax
Avro Schemas
id name title bonus
108424 Alice Salesperson 2500
101837 Bob Manager 3000
107812 Chuck President 9000
105476 Dan Accountant 3000
Metadata
Data
• These are among the simple (scalar) types in Avro
Simple Types in Avro Schemas
Name Description
null An absence of a value
boolean A binary value
int 32-bit signed integer
long 64-bit signed integer
float Single-precision floating point value
double Double-precision floating point value
string Sequence of Unicode characters
• These are the complex types in Avro
Complex Types in Avro Schemas
Name Description
record A user-defined type composed of one or more named fields
enum A specified set of values
array Zero or more values of the same type
map Set of key-value pairs; key is string while value is of specified type
union Exactly one value matching a specified set of types
fixed A fixed number of 8-bit unsigned bytes
• SQL CREATE TABLE statement
Schema Example
CREATE TABLE employees
(id INT,
name VARCHAR(30),
title VARCHAR(20),
bonus INT);
• Equivalent Avro schema
Schema Example (Continued)
{"namespace": "com.loudacre.data",
"type": "record",
"name": "Employee",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "title", "type": "string"},
{"name": "bonus", "type": "int"}
]}
• Approaches for mapping Java object to a schema
• Generic: Write code to map each field manually
• Reflect: Generate a schema from an existing class
• Specific: Generate a Java class from your schema
Mapping Avro Schema to Java Object
• Hadoop and its ecosystem support many file formats
• May ingest in one format and convert to another
• Format selection involves several considerations
• Ingest pattern
• Tool compatibility
• Expected lifetime
• Storage and performance requirements
Considerations for File Formats
• Each file format may also support compression
• Reduces amount of disk space required to store data
• Tradeoff between time and space
• Can greatly improve performance
• Many Hadoop jobs are I/O-bound
Data Compression
• Refers to organizing data according to access patterns
• Improves performance by limiting input
• Common partitioning schemes
• Customers: partition by state, province, or region
• Events: separate by year, month, and day
Data Partitioning
• Imagine that you store all Web server log files in HDFS
• Marketing runs monthly jobs for search engine optimization
• Security runs daily jobs to identify attempted exploits
Partitioning Example
2014
March May
05 06 07 08 09 1001 02 03 04 11 12 13
April
Input for monthly job
Input for daily job
Introduction to Designing and Building Big Data Applications
Register for training and certification at
http://university.cloudera.com
Use discount code Apps10 to save 10%
on new enrollments in Big Data
Applications classes delivered by
Cloudera until July 4, 2014*
• Enter questions in the Q&A panel
• Follow Cloudera University: @ClouderaU
• Follow the Developer learning path:
http://university.cloudera.com/developers
• Learn about the enterprise data hub:
http://tinyurl.com/edh-webinar
• Join the Cloudera user community:
http://community.cloudera.com/
• Get Developer Certification:
http://university.cloudera.com/certification
• Explore Developer resources for Hadoop:
http://cloudera.com/content/dev-center/en/home.html
* Excludes classes sold or delivered by other partners

More Related Content

PDF
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
PPTX
Scaling Data Science on Big Data
PPTX
Big Data Introduction
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
Empowering you with Democratized Data Access, Data Science and Machine Learning
PPTX
Hybrid Data Warehouse Hadoop Implementations
PPTX
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
PPTX
Hadoop vs. RDBMS for Advanced Analytics
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Scaling Data Science on Big Data
Big Data Introduction
Hadoop Reporting and Analysis - Jaspersoft
Empowering you with Democratized Data Access, Data Science and Machine Learning
Hybrid Data Warehouse Hadoop Implementations
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
Hadoop vs. RDBMS for Advanced Analytics

What's hot (20)

PPTX
Should I move my database to the cloud?
PPTX
Hadoop and Hive in Enterprises
PPTX
Schema-on-Read vs Schema-on-Write
PPTX
Introduction to Azure HDInsight
PDF
Building a Big Data platform with the Hadoop ecosystem
PPTX
Microsoft Azure Big Data Analytics
PPTX
Big Data Platforms: An Overview
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PDF
Hadoop Integration into Data Warehousing Architectures
PPTX
Microsoft Data Platform - What's included
PDF
Data Lake for the Cloud: Extending your Hadoop Implementation
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Introduction to Data Engineering
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
PPTX
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PDF
Planing and optimizing data lake architecture
PPTX
Data Science with Hadoop: A Primer
PDF
Democratizing Data Science on Kubernetes
Should I move my database to the cloud?
Hadoop and Hive in Enterprises
Schema-on-Read vs Schema-on-Write
Introduction to Azure HDInsight
Building a Big Data platform with the Hadoop ecosystem
Microsoft Azure Big Data Analytics
Big Data Platforms: An Overview
Introduction to Microsoft’s Hadoop solution (HDInsight)
Hadoop Integration into Data Warehousing Architectures
Microsoft Data Platform - What's included
Data Lake for the Cloud: Extending your Hadoop Implementation
Big Data Architecture Workshop - Vahid Amiri
Introduction to Data Engineering
Modern Data Warehousing with the Microsoft Analytics Platform System
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Jethro data meetup index base sql on hadoop - oct-2014
Planing and optimizing data lake architecture
Data Science with Hadoop: A Primer
Democratizing Data Science on Kubernetes
Ad

Viewers also liked (17)

PDF
Building Big Data Applications
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
PDF
How to Profit from Factoring 2015
PPTX
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
PDF
What is system level analysis
PPTX
Rate zonal centrifugation and Its applications
PPTX
Top 10 team coordinator interview questions and answers
PPT
HW09 Hadoop Vaidya
PDF
Apache Hadoop on Virtual Machines
PPTX
Moving From a Selenium Grid to the Cloud - A Real Life Story
PPTX
Progeny LIMS
PPTX
Getting Past No
PPT
IT Strategic Planning (Case Studies)
PPT
Matrix Effect
PPTX
The purpose and Benefits of setting high standards for your work
PPTX
High Performance Computing and Big Data
PDF
Digital Assurance: Develop a Comprehensive Testing Strategy for Digital Trans...
Building Big Data Applications
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
How to Profit from Factoring 2015
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
What is system level analysis
Rate zonal centrifugation and Its applications
Top 10 team coordinator interview questions and answers
HW09 Hadoop Vaidya
Apache Hadoop on Virtual Machines
Moving From a Selenium Grid to the Cloud - A Real Life Story
Progeny LIMS
Getting Past No
IT Strategic Planning (Case Studies)
Matrix Effect
The purpose and Benefits of setting high standards for your work
High Performance Computing and Big Data
Digital Assurance: Develop a Comprehensive Testing Strategy for Digital Trans...
Ad

Similar to Introduction to Designing and Building Big Data Applications (20)

PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
USQL Trivadis Azure Data Lake Event
PDF
Big Data Adavnced Analytics on Microsoft Azure
PPTX
Twitter with hadoop for oow
PPTX
Introducing U-SQL (SQLPASS 2016)
PDF
201905 Azure Databricks for Machine Learning
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Building and deploying LLM applications with Apache Airflow
PPTX
Intro to big data analytics using microsoft machine learning server with spark
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
New Developments in H2O: April 2017 Edition
PDF
QuerySurge Slide Deck for Big Data Testing Webinar
DOCX
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PPTX
Machine Learning and AI
PPTX
Building Big data solutions in Azure
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
USQL Trivadis Azure Data Lake Event
Big Data Adavnced Analytics on Microsoft Azure
Twitter with hadoop for oow
Introducing U-SQL (SQLPASS 2016)
201905 Azure Databricks for Machine Learning
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
Building and deploying LLM applications with Apache Airflow
Intro to big data analytics using microsoft machine learning server with spark
Composable Parallel Processing in Apache Spark and Weld
New Developments in H2O: April 2017 Edition
QuerySurge Slide Deck for Big Data Testing Webinar
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Big Data Developers Moscow Meetup 1 - sql on hadoop
Machine Learning and AI
Building Big data solutions in Azure

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Website Design Services for Small Businesses.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Download FL Studio Crack Latest version 2025 ?
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
Cost to Outsource Software Development in 2025
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
Digital Systems & Binary Numbers (comprehensive )
Why Generative AI is the Future of Content, Code & Creativity?
Website Design Services for Small Businesses.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Reimagine Home Health with the Power of Agentic AI​
iTop VPN Crack Latest Version Full Key 2025
Complete Guide to Website Development in Malaysia for SMEs
Computer Software and OS of computer science of grade 11.pptx
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Operating system designcfffgfgggggggvggggggggg
Adobe Illustrator 28.6 Crack My Vision of Vector Design
17 Powerful Integrations Your Next-Gen MLM Software Needs
Navsoft: AI-Powered Business Solutions & Custom Software Development
Download FL Studio Crack Latest version 2025 ?
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Cost to Outsource Software Development in 2025
Autodesk AutoCAD Crack Free Download 2025
Weekly report ppt - harsh dattuprasad patel.pptx
Oracle Fusion HCM Cloud Demo for Beginners
Digital Systems & Binary Numbers (comprehensive )

Introduction to Designing and Building Big Data Applications

  • 1. Tom Wheeler | Senior Curriculum Developer April 2014 Introduction to Designing and Building Big Data Applications
  • 2. Agenda  Cloudera's Learning Path for Developers  Target Audience and Prerequisites  Course Outline  Short Presentation Based on Actual Course Material  Question and Answer Session
  • 3. Intro to Data Science HBase Training Learn to code and write MapReduce programs for production Master advanced API topics required for real-world data analysis Design schemas to minimize latency on massive data sets Scale hundreds of thousands of operations per second Implement recommenders and data experiments Draw actionable insights from analysis of disparate data Big Data Applications Build converged applications using multiple processing engines Develop enterprise solutions using components across the EDH Developer Training Learning Path: Developers Create Powerful New Data Processing Tools Aaron T. Myers Software Engineer
  • 4. 25% $115K An engineer with Hadoop skills requires a min. salary premium of Hadoop developers are now the top paid in tech, starting at Sources: Business Insider, “10 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 11 August 2012. Business Insider, “30 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 21 February 2013. GigaOm, “Big Data Skills Bring Big Dough,” 17 February 2012. $300K Compensation for a very senior Data Scientist opens at Hadoop Professionals: Build or Buy? Professional Certification Decreases Hiring Risk
  • 5. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Most Experienced Instructors More than 20,000 students trained since 2009 6 Widest Geographic Coverage Most classes offered: 50 cities worldwide plus online 7 Most Relevant Platform & Community CDH deployed more than all other distributions combined 8 Depth of Training Material Hands-on labs and VMs support live instruction Leader in Certification Over 8,000 accredited Cloudera professionals 4 Trusted Source for Training 100,000+ people have attended online courses 9 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training? Aligned to Best Practices and the Pace of Change 5 State of the Art Curriculum Courses updated as Hadoop evolves 10Commitment to Big Data Education University partnerships to teach Hadoop in the classroom
  • 6. Designing and Building Big Data Applications About the Course
  • 7. • Intended for people who write code, such as • Software Engineers • Data Engineers • ETL Developers Target Audience
  • 8. • Successful completion of our Developer course • Or equivalent practical experience • Intermediate-level Java skills • Basic familiarity with Linux • Knowledge of SQL or HiveQL is also helpful Course Prerequisites
  • 9. Example of Required Java Skill Level package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class Example extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context ctx) throws IOException, InterruptedException { 1 2 3 4 5 6 7 8 9 10 11 12 13 • Do you understand the following code? Could you write something similar?
  • 10. Example of Required Linux Skill Level • Are you comfortable editing text files on a Linux system? • Are you familiar with the following commands? $ mkdir -p /tmp/incoming/web_logs $ cd /var/log/web $ mv *.log /tmp/incoming/web_logs
  • 11. • During this course, you will learn • Determine which Hadoop-related tools are appropriate for specific tasks • Understand how file formats, serialization, and data compression affect application compatibility and performance • Design and evolve schemas in Apache Avro • Create, populate, and access data sets with the Kite SDK • Integrate with external systems using Apache Sqoop and Apache Flume • Integrate Apache Flume with existing applications and develop custom components to extend Flume’s capabilities Course Objectives
  • 12. • Create, package, and deploy Oozie jobs to manage processing workflows • Develop Java-based data processing pipelines with Apache Crunch • Implement user-defined functions for use in Apache Hive and Impala • Index both static and streaming data sets with Cloudera Search • Use Hue to build a Web-based interface for Search queries • Integrate results from Impala and Cloudera Search into your applications Course Objectives (continued)
  • 13. • Frequent hands-on exercises • Based on a hypothetical but realistic scenario • Each works towards building a working application Scenario for Hands-On Exercises mobile udacreo L
  • 14. Tools Used in Hands-On Exercises HDFS Sqoop Flume Kite SDK / Morphlines Ingest and Data Management HCatalog Impala Search Interactive Queries MapReduce Crunch Hive Batch Processing Avro
  • 15. Data Sources Used in Hands-On Exercises RDBMS Telecom Switches Enterprise Data Hub Equipment Records Customer Records Call Detail Records (Fixed-Width) CRM System Phone Activations (XML) Point of Sale Terminals Web Servers Static Documents (HTML) Log Files (Text) Device Status (CSV and TSV) Chat Transcripts (JSON)
  • 16. • Exercises use real-world development environment • IDE (Eclipse) • Unit testing library (JUnit) • Build and configuration management tool (Maven) Development Environment
  • 17. • Introduction • Application Architecture * • Designing and Using Data Sets * • Using the Kite SDK Data Module * • Importing Relational Data with Apache Sqoop * • Capturing Data with Apache Flume * Course Outline * This chapter contains a hands-on exercise * This chapter contains multiple hands-on exercises
  • 18. • Developing Custom Flume Components * • Managing Workflows with Apache Oozie * • Processing Data Pipelines with Apache Crunch * • Working with Tables in Apache Hive * • Developing User-Defined Functions * • Executing Interactive Queries with Impala * Course Outline (continued)
  • 19. • Understanding Cloudera Search • Indexing Data with Cloudera Search * • Presenting Results to Users * • Conclusion Course Outline (continued)
  • 20. • Based on chapter 3: Designing and Using Data Sets Course Excerpt
  • 21. • Define the concept of serialization • Represents data as a series of bytes • Allows us to store and transmit data • There are many ways of serializing data • How do you serialize the number 108125150? • 4 bytes when stored as a Java int • 9 bytes when stored as text What is Data Serialization?
  • 22. • Affects performance and storage space • Chosen method may limit portability • java.io.Serializable is Java-specific • Writables are Hadoop-specific • May also limit backwards compatibility • Often depends on specific version of class • Avro was developed to address these challenges Implications of Data Serialization
  • 23. • Avro is an open source data serialization framework • Widely supported throughout Hadoop ecosystem • Offers compatibility without sacrificing performance • Data is serialized according to a schema you define • Read and write from Java, C, C++, C#, Python, PHP, etc. • Optimized binary encoding for efficient storage • Defines rules for schema evolution What is Apache Avro?
  • 24. • Avro schemas define the structure of your data • Similar to a CREATE TABLE in SQL, but more flexible • Defined using JSON syntax Avro Schemas id name title bonus 108424 Alice Salesperson 2500 101837 Bob Manager 3000 107812 Chuck President 9000 105476 Dan Accountant 3000 Metadata Data
  • 25. • These are among the simple (scalar) types in Avro Simple Types in Avro Schemas Name Description null An absence of a value boolean A binary value int 32-bit signed integer long 64-bit signed integer float Single-precision floating point value double Double-precision floating point value string Sequence of Unicode characters
  • 26. • These are the complex types in Avro Complex Types in Avro Schemas Name Description record A user-defined type composed of one or more named fields enum A specified set of values array Zero or more values of the same type map Set of key-value pairs; key is string while value is of specified type union Exactly one value matching a specified set of types fixed A fixed number of 8-bit unsigned bytes
  • 27. • SQL CREATE TABLE statement Schema Example CREATE TABLE employees (id INT, name VARCHAR(30), title VARCHAR(20), bonus INT);
  • 28. • Equivalent Avro schema Schema Example (Continued) {"namespace": "com.loudacre.data", "type": "record", "name": "Employee", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"}, {"name": "title", "type": "string"}, {"name": "bonus", "type": "int"} ]}
  • 29. • Approaches for mapping Java object to a schema • Generic: Write code to map each field manually • Reflect: Generate a schema from an existing class • Specific: Generate a Java class from your schema Mapping Avro Schema to Java Object
  • 30. • Hadoop and its ecosystem support many file formats • May ingest in one format and convert to another • Format selection involves several considerations • Ingest pattern • Tool compatibility • Expected lifetime • Storage and performance requirements Considerations for File Formats
  • 31. • Each file format may also support compression • Reduces amount of disk space required to store data • Tradeoff between time and space • Can greatly improve performance • Many Hadoop jobs are I/O-bound Data Compression
  • 32. • Refers to organizing data according to access patterns • Improves performance by limiting input • Common partitioning schemes • Customers: partition by state, province, or region • Events: separate by year, month, and day Data Partitioning
  • 33. • Imagine that you store all Web server log files in HDFS • Marketing runs monthly jobs for search engine optimization • Security runs daily jobs to identify attempted exploits Partitioning Example 2014 March May 05 06 07 08 09 1001 02 03 04 11 12 13 April Input for monthly job Input for daily job
  • 35. Register for training and certification at http://university.cloudera.com Use discount code Apps10 to save 10% on new enrollments in Big Data Applications classes delivered by Cloudera until July 4, 2014* • Enter questions in the Q&A panel • Follow Cloudera University: @ClouderaU • Follow the Developer learning path: http://university.cloudera.com/developers • Learn about the enterprise data hub: http://tinyurl.com/edh-webinar • Join the Cloudera user community: http://community.cloudera.com/ • Get Developer Certification: http://university.cloudera.com/certification • Explore Developer resources for Hadoop: http://cloudera.com/content/dev-center/en/home.html * Excludes classes sold or delivered by other partners