Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Fast-track Development for
Big Data Integration

Why do you need ETL?
Iterate Report

Design & Develop
Protot Define
database Under- &
stand
Cleanse ype &
Docum
Test
Design
XML ent

Flat Integrate & Transform Load
Extract
Files ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time,
Replication
Analytic
App
Manage Targets
Sources Monitor, Troubleshoot, Secure & Retain

3

Let’s suppose…
• ABC Bank is rolling out a new service to provide daily stock
recommendations based on customers prior transaction history, propensity
for risk, and stock popularity
• Input data is
• Market Data – Bloomberg daily stock price and volume for one year
• Customer Transactions (i.e. trades) – Stock purchases over last 5 years
• Twitter – Daily # of tweets for each stock symbol for one year
• Web Logs – Daily # of stock views for each customer for one year

• Output is
• Customer Stock Recommendations – daily stock recommendations for each customer

4

If you did this on your own
What would you need to build? What skills are needed?
JSON
<apex:page controller="RestDemoJsonController" SQL
tabStyle="Contact">
<apex:sectionHeader title="Google
Maps Geocoding" subtitle="REST Demo (JSON)"/> JAVA select id,
What if something
changes?
<apex:form > from (select t2.id, t2.item_attr1,
<apex:pageBlock >
t2.total_sales,
// >
String id, String name
<apex:pageBlockButtons
public static void GetLatLon_json_Request () {
sum(item_purchase.bought_at) as
<apex:commandButton action="{!submit}"

PERL
value="Submit"
total_purchase
Http http = new Http();
rerender="resultsPanel" status="status"/> from (select t1.id, t1.item_attr1,
HttpRequest req = new HttpRequest();
</apex:pageBlockButtons>
req.setEndpoint( sum(item_sales.sold_at) as total_sales
<apex:pageMessages />
‘http://maps.google.com/maps/api/geocode/xml?addres from (select id, item_attr1
s=torre+parquecristal+caracas+venezuela&sensor=true’);
from item
req.setMethod(‘GET’); open(DBFMT, $opt_database . ".fmt") where ...
|| die "can't open format ) as t1
HTTPResponse resp = http.send(req);
file:$opt_database",".fmtn"; left join item_sales
String json; on item.id =
json = resp.getBody().replace(‘n’, ”);
$docBegin="DOC";
$docEnd="/DOC";
$idBegin="DOCNO";
$idEnd="/DOCNO";
while (<DBFMT>) {
print STDERR if $debug;
if (/^s*TITLEs*:s*([^s]+)/) {
$title{$1}=1;

5

Doing this on your own has challenges
• Time-consuming
• Requires specialized skills
• Hard to maintain, difficult to change
• No reuse

6

There are alternative approaches…

Let’s see how this works with
an Informatica Demo

7

Challenges with traditional infrastructure
• Cannot cost-effectively scale as data volumes
grow
• Not designed to support many new data types
• Does not support rapid agile development
• Analysis is not flexible to facilitate rapid
discovery

8

Maximize your return on big data
Data Sources Operational Systems Analytical Systems Reports & Analytics

Data Data
OLTP MDM Mart Mart
Transactions,
OLTP, OLAP

ODS Data
OLTP
Warehouse

Documents,
Email

Social Media,
Web Logs
Access Parse & Discover Transform Extract &
& Ingest Prepare & Profile & Cleanse Deliver
Machine Device,
Scientific Manage (i.e. Security, Performance, Governance, Collaboration)

9

If you did this on your own
What would you need to build? What skills are needed?
JSON
<apex:page controller="RestDemoJsonController"
tabStyle="Contact">
SQL
<apex:sectionHeader title="Google
Maps Geocoding" subtitle="REST Demo (JSON)"/> JAVA MapReduce
<apex:form >
<apex:pageBlock >
select id,
from (select t2.id, t2.item_attr1, PIG HADOOP
t2.total_sales, public static void main(String[] args)
// >
String id, String name
<apex:pageBlockButtons
public static void GetLatLon_json_Request () {
sum(item_purchase.bought_at) as throws Exception
<apex:commandButton action="{!submit}" pv_by_industry = GROUP
PERL
value="Submit"
total_purchase {
Http http = new Http();
rerender="resultsPanel" status="status"/> profile_view by
from (select t1.id, t1.item_attr1, job.setMapperClass(WordMapper.cl
HIVE
HttpRequest req = new HttpRequest();
</apex:pageBlockButtons>
req.setEndpoint( sum(item_sales.sold_at) as viewee_industry_id
total_sales ass);
<apex:pageMessages />
‘http://maps.google.com/maps/api/geocode/xml?addres from (select id, item_attr1 job.setInputFormatClass(KeyValueTe
s=torre+parquecristal+caracas+venezuela&sensor=true’);
from item pv_avg_by_industry = FOREACH xtInputFormat.class);
req.setMethod(‘GET’); open(DBFMT, $opt_database . ".fmt") where ... pv_by_industry FileInputFormat.addInputPath(job,
|| die "can't open format ) as t1 INSERT OVERWRITE TABLE new Path("/tmp/hadoop-
HTTPResponse resp = http.send(req);
file:$opt_database",".fmtn"; left join item_sales dog_food cscarioni/dfs/name/file"));
on item.id = GENERATE group as
String json; SELECT pv.*, u.brand, u.age, FileOutputFormat.setOutputPath(jo
$docBegin="DOC"; viewee_industry_id, b, new Path("output")); }
json = resp.getBody().replace(‘n’, ”); f.SKU FROM page_view pv JOIN
$docEnd="/DOC"; AVG(profie_view) AS
user u ON (pv.id = u.id)
$idBegin="DOCNO"; average_pv;
$idEnd="/DOCNO"; JOIN breed_list f ON (u.id =
while (<DBFMT>) { f.uid)
print STDERR if $debug; WHERE pv.date = '2013-
if (/^s*TITLEs*:s*([^s]+)/) { 02-26';
$title{$1}=1;

10

Implement a proven path to innovation
Lower Big Data Project Costs
(helps self-fund big data projects)

Minimize Risk of New Technologies
(design once, deploy anywhere)

Innovate Faster With Big Data
(onboard, discover, operationalize)

11

Informatica + Cloudera: Lower Costs
Optimize processing with low cost
commodity hardware
Transactions, Traditional Grid Data
OLTP, OLAP Mart
Documents and Emails

Social Media, Web Logs EDW

12
Machine Device, Increase productivity up to 5X
Scientific Data Mart

12

Informatica + Cloudera: Minimize Risk
Quickly staff projects with trained
data integration experts

13

Informatica + Cloudera: Minimize Risk

Design once and deploy anywhere

Deploy On-Premise or in Pushdown to RDBMS or DW
the Cloud Traditional Grid
Appliance

14

Informatica + Cloudera: Innovate Faster
Onboard and analyze any type of data to
gain big data insights
Transactions,
OLTP, OLAP Analytics & Op
Dashboards

Discover insights faster through rapid
Documents and Emails
development and collaboration
Social Media, Web Logs Mobile Apps

Operationalize big data insights to
15 Machine Device,
generate new revenue streams
Scientific Real-Time Alerts

15

How does Informatica + Cloudera do this?

16

Maximize your return on big data
Data Sources Operational Systems Analytical Systems Reports & Analytics

Data Data
OLTP MDM Mart Mart
Transactions,
OLTP, OLAP

ODS Data
OLTP
Warehouse

Documents,
Email

Social Media,
Web Logs
Machine Device,
Scientific Manage (i.e. Security, Performance, Governance, Collaboration)

17

Data Ingestion and Extraction

Batch
Transactions, Applications
OLTP, OLAP

Replication
Documents and Emails Deliver

Social Media, Web Logs Data Warehouse
Streaming

18 Machine Device,
Scientific
Archiving
Data Mart

18
18

Integrate All Data: High Performance Data Access
WebSphere MQ Web Services JD Edwards SAP NetWeaver
Messaging & JMS TIBCO Lotus Notes SAP NetWeaver BI Packaged
Web Services MSMQ webMethods Oracle E-Business SAS Applications
SAP NetWeaver XI PeopleSoft Siebel
Oracle Informix Salesforce CRM ADP
DB2 UDB Teradata Force.com Hewitt
Relational & Flat DB2/400 Netezza SAP By Design SaaS/BPO
Files SQL Server ODBC RightNow
Oracle OnDemand
Sybase JDBC NetSuite
ADABAS VSAM FIX, SWIFT NACHA
Mainframe & Datacom C-ISAM EDI–X12 AST
Industry
Midrange DB2 Binary Flat Files EDI-Fact RosettaNet
IDMS Tape Formats… Standards
HL7 Cargo IMP
IMS
HIPAA MVR
Word, Excel Flat files
PDF ASCII reports XML ebXML
Unstructured StarOffice HTML LegalXML HL7 v3.0 XML Standards
Data & Files WordPerfect RPG IFX ACORD (AL3, XML)
Email (POP, IMPA) ANSI cXML
HTTP LDAP
Facebook Kapow
Teradata EMC/Greenplum Twitter Datasift Social Media
MPP Appliances AsterData Vertica LinkedIn

19

Informatica ETL Execution on Hadoop
1. Mapping translated and optimized to
Hive HQL and User Defined Functions
2. Optimized HQL translated to MapReduce
3. MapReduce and User Defined Functions
executed on Cloudera

Data Nodes
Data Node
Data Node
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, Informatica
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY Data Transformation Engine
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
Hive HQL
FROM lineitem
GROUP BY L_ORDERKEY
) T1 UDF MapReduce
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)

20

Data Profiling & Discovery on Hadoop
Value and Pattern
Frequency to Isolate
Data Quality Issues

Discover Data Domains
& Relationships
Including PII Data

21

Informatica + Cloudera Demo

22

Informatica + Cloudera Demo Scenario
• ABC Bank is rolling out a new service to provide daily stock
recommendations based on customers prior transaction history,
propensity for risk, and stock popularity
• Input data is
• Market Data – Bloomberg daily stock price and volume for 2012
• Customer Transactions (i.e. trades) – Stock purchases over last 5 years
• Twitter – Daily # of tweets for each stock symbol for 2012
• Web Logs – Daily # of stock views for each customer for 2012
• Output is
• Customer Stock Recommendations – daily stock recommendations for each
customer available in a relational data warehouse.

23

Metadata
RDBMS Repository

Access
Transactions, Documents, Social Media, Machine Device,

Data
OLTP, OLAP Email Web Logs Scientific

• Data Integration on Hadoop
• Data Quality and Profiling on Hadoop
• Data Parsing on Hadoop INFA
• NLP & Entity Extraction on Hadoop
• Replication to Hadoop
Clients
Informatica Services • Archiving on Hadoop

Connect Connect
to HDFS to Hive

Namenode Job Tracker DataNode1 DataNode2 DataNode3

Map Reduce Map Reduce Map Reduce Map Reduce
HDFS HDFS HDFS HDFS
24

24

Archive Profile Parse Transform Cleanse Match
1 LOWER COSTS
• OPTIMIZED END-TO-END DATA MANAGEMENT
PERFORMANCE ON HADOOP
• RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS,
DATA QUALITY RULES, COMPLEX FILE PARSING,
& DATA PROFILING ON HADOOP

2 INCREASE PRODUCTIVITY
• UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE
VISUAL DEVELOPMENT AND MANAGEMENT

3 ACCELERATE ADOPTION
• 500+ PARTNERS AND 100,000+ TRAINED
INFORMATICA DEVELOPERS
• 360+ PARTNERS AND 15,000+ TRAINED ON
CLOUDERA ANNUALLY ON 6 CONTINENTS

26

Discover
Archive Profile Parse Transform Cleanse Match

Measure
and Data Define
Monitor Governance

Apply
Apply

27

What is the plan forward?
• tomorrow
• Identify a business opportunity where data can have a significant impact
• Identify the skills you need to build a team with big data competencies
• 3 months
• Identify and prioritize the data you need to improve the business (both internal and external)
• Determine what data to store in Cloudera to lower and control cost
• Put a business plan together to optimize your DW/BI infrastructure
• Execute a quick win big data project with demonstrable ROI
• 1 year
• Extend data governance to include more data and more types of data that impacts the
business
• Consider a shared-services model to promote best practices and further lower infrastructure
and labor costs

28

Thank You!
cloudera.com/clouderasessions

29

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

More Related Content

What's hot (20)

Similar to Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration