SlideShare a Scribd company logo
Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration
Fast-track Development for
    Big Data Integration
Why do you need ETL?
                                                  Iterate                                 Report



                          Design & Develop
                                                   Protot       Define
     database               Under-                              &
                            stand
                                       Cleanse     ype &
                                                                Docum
                                                                           Test
                                                   Design
      XML                                                       ent




       Flat               Integrate & Transform                                   Load
                Extract
       Files                  ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time,
                                                 Replication
                                                                                         Analytic
       App
                          Manage                                                         Targets
      Sources             Monitor, Troubleshoot, Secure & Retain


3
Let’s suppose…
    • ABC Bank is rolling out a new service to provide daily stock
      recommendations based on customers prior transaction history, propensity
      for risk, and stock popularity
    • Input data is
        • Market Data – Bloomberg daily stock price and volume for one year
        • Customer Transactions (i.e. trades) – Stock purchases over last 5 years
        • Twitter – Daily # of tweets for each stock symbol for one year
        • Web Logs – Daily # of stock views for each customer for one year

    • Output is
        • Customer Stock Recommendations – daily stock recommendations for each customer




4
If you did this on your own
    What would you need to build? What skills are needed?
                              JSON
     <apex:page controller="RestDemoJsonController"                                                                            SQL
     tabStyle="Contact">
                     <apex:sectionHeader title="Google
     Maps Geocoding" subtitle="REST Demo (JSON)"/>       JAVA                                               select id,
                                                                                                                                                     What if something
                                                                                                                                                     changes?
       <apex:form >                                                                                         from (select t2.id, t2.item_attr1,
       <apex:pageBlock >
                                                                                                            t2.total_sales,
                               // >
                                  String id, String name
          <apex:pageBlockButtons
                               public static void GetLatLon_json_Request () {
                                                                                                            sum(item_purchase.bought_at) as
            <apex:commandButton action="{!submit}"

                                                                                  PERL
     value="Submit"
                                                                                                            total_purchase
                               Http http = new Http();
              rerender="resultsPanel" status="status"/>                                                         from (select t1.id, t1.item_attr1,
                               HttpRequest req = new HttpRequest();
          </apex:pageBlockButtons>
                               req.setEndpoint(                                                             sum(item_sales.sold_at) as total_sales
          <apex:pageMessages />
                               ‘http://maps.google.com/maps/api/geocode/xml?addres                                from (select id, item_attr1
                               s=torre+parquecristal+caracas+venezuela&sensor=true’);
                                                                                                                    from item
                               req.setMethod(‘GET’);                   open(DBFMT, $opt_database . ".fmt")             where ...
                                                                       || die "can't open format                       ) as t1
                               HTTPResponse resp = http.send(req);
                                                                       file:$opt_database",".fmtn";                    left join item_sales
                               String json;                                                                              on item.id =
                               json = resp.getBody().replace(‘n’, ”);
                                                                       $docBegin="DOC";
                                                                       $docEnd="/DOC";
                                                                       $idBegin="DOCNO";
                                                                       $idEnd="/DOCNO";
                                                                       while (<DBFMT>) {
                                                                          print STDERR if $debug;
                                                                          if (/^s*TITLEs*:s*([^s]+)/) {
                                                                                      $title{$1}=1;




5
Doing this on your own has challenges
    • Time-consuming
    • Requires specialized skills
    • Hard to maintain, difficult to change
    • No reuse




6
There are alternative approaches…

           Let’s see how this works with
               an Informatica Demo




7
Challenges with traditional infrastructure
    • Cannot cost-effectively scale as data volumes
      grow
    • Not designed to support many new data types
    • Does not support rapid agile development
    • Analysis is not flexible to facilitate rapid
      discovery

8
Maximize your return on big data
Data Sources                   Operational Systems           Analytical Systems                Reports & Analytics

                                                                      Data        Data
                               OLTP                  MDM              Mart        Mart
    Transactions,
    OLTP, OLAP

                                                 ODS                       Data
                                      OLTP
                                                                         Warehouse

    Documents,
      Email




    Social Media,
     Web Logs
                    Access             Parse &       Discover        Transform           Extract &
                    & Ingest           Prepare       & Profile       & Cleanse            Deliver
Machine Device,
  Scientific             Manage (i.e. Security, Performance, Governance, Collaboration)


9
If you did this on your own
     What would you need to build? What skills are needed?
                               JSON
      <apex:page controller="RestDemoJsonController"
      tabStyle="Contact">
                                                                                                                          SQL
                      <apex:sectionHeader title="Google
      Maps Geocoding" subtitle="REST Demo (JSON)"/>       JAVA                                                                                                                                               MapReduce
        <apex:form >
        <apex:pageBlock >
                                                                                                             select id,
                                                                                                             from (select t2.id, t2.item_attr1,         PIG                           HADOOP
                                                                                                             t2.total_sales,                                                                   public static void main(String[] args)
                                // >
                                   String id, String name
           <apex:pageBlockButtons
                                public static void GetLatLon_json_Request () {
                                                                                                             sum(item_purchase.bought_at) as                                                   throws Exception
             <apex:commandButton action="{!submit}"                                                                                         pv_by_industry = GROUP
                                                                                PERL
      value="Submit"
                                                                                                             total_purchase                                                                    {
                                Http http = new Http();
               rerender="resultsPanel" status="status"/>                                                                                    profile_view by
                                                                                                                 from (select t1.id, t1.item_attr1,                                            job.setMapperClass(WordMapper.cl
                                                                                                                                                                        HIVE
                                HttpRequest req = new HttpRequest();
           </apex:pageBlockButtons>
                                req.setEndpoint(                                                             sum(item_sales.sold_at) as viewee_industry_id
                                                                                                                                            total_sales                                        ass);
           <apex:pageMessages />
                                ‘http://maps.google.com/maps/api/geocode/xml?addres                                from (select id, item_attr1                                                 job.setInputFormatClass(KeyValueTe
                                s=torre+parquecristal+caracas+venezuela&sensor=true’);
                                                                                                                     from item              pv_avg_by_industry = FOREACH                       xtInputFormat.class);
                                req.setMethod(‘GET’);                   open(DBFMT, $opt_database . ".fmt")             where ...           pv_by_industry                                     FileInputFormat.addInputPath(job,
                                                                        || die "can't open format                       ) as t1                           INSERT OVERWRITE TABLE               new Path("/tmp/hadoop-
                                HTTPResponse resp = http.send(req);
                                                                        file:$opt_database",".fmtn";                    left join item_sales             dog_food                             cscarioni/dfs/name/file"));
                                                                                                                          on item.id =      GENERATE group as
                                String json;                                                                                                              SELECT pv.*, u.brand, u.age,         FileOutputFormat.setOutputPath(jo
                                                                        $docBegin="DOC";                                                    viewee_industry_id,                                b, new Path("output")); }
                                json = resp.getBody().replace(‘n’, ”);                                                                                   f.SKU FROM page_view pv JOIN
                                                                        $docEnd="/DOC";                                                    AVG(profie_view) AS
                                                                                                                                                          user u ON (pv.id = u.id)
                                                                        $idBegin="DOCNO";                                                   average_pv;
                                                                        $idEnd="/DOCNO";                                                                       JOIN breed_list f ON (u.id =
                                                                        while (<DBFMT>) {                                                                 f.uid)
                                                                           print STDERR if $debug;                                                               WHERE pv.date = '2013-
                                                                           if (/^s*TITLEs*:s*([^s]+)/) {                                              02-26';
                                                                                       $title{$1}=1;




10
Implement a proven path to innovation
            Lower Big Data Project Costs
          (helps self-fund big data projects)


          Minimize Risk of New Technologies
           (design once, deploy anywhere)


            Innovate Faster With Big Data
          (onboard, discover, operationalize)


11
Informatica + Cloudera: Lower Costs
                                    Optimize processing with low cost
                                         commodity hardware
         Transactions,                        Traditional Grid                            Data
         OLTP, OLAP                                                                       Mart
                         Access     Parse &       Discover       Transform   Extract &
                         & Ingest   Prepare       & Profile      & Cleanse    Deliver
     Documents and Emails


     Social Media, Web Logs                                                          EDW



12
      Machine Device,                  Increase productivity up to 5X
         Scientific                                                                      Data Mart



12
Informatica + Cloudera: Minimize Risk
                  Quickly staff projects with trained
                      data integration experts




13
Informatica + Cloudera: Minimize Risk




                            Design once and deploy anywhere

Deploy On-Premise or in                                       Pushdown to RDBMS or DW
       the Cloud          Traditional Grid
                                                                     Appliance




14
Informatica + Cloudera: Innovate Faster
                               Onboard and analyze any type of data to
                                       gain big data insights
          Transactions,
          OLTP, OLAP                                                     Analytics & Op
                                                                          Dashboards

                                Discover insights faster through rapid
      Documents and Emails
                                   development and collaboration
      Social Media, Web Logs                                              Mobile Apps

                                  Operationalize big data insights to
15     Machine Device,
                                   generate new revenue streams
          Scientific                                                     Real-Time Alerts



 15
How does Informatica + Cloudera do this?




16
Maximize your return on big data
Data Sources                 Operational Systems           Analytical Systems                Reports & Analytics

                                                                    Data        Data
                             OLTP                  MDM              Mart        Mart
 Transactions,
 OLTP, OLAP

                                               ODS                       Data
                                    OLTP
                                                                       Warehouse

  Documents,
    Email




 Social Media,
  Web Logs
                  Access             Parse &       Discover        Transform           Extract &
                  & Ingest           Prepare       & Profile       & Cleanse            Deliver
Machine Device,
  Scientific           Manage (i.e. Security, Performance, Governance, Collaboration)


17
Data Ingestion and Extraction

                                 Batch
          Transactions,                                 Applications
          OLTP, OLAP

                               Replication
      Documents and Emails                   Deliver

      Social Media, Web Logs                           Data Warehouse
                               Streaming

18     Machine Device,
          Scientific
                               Archiving
                                                          Data Mart



 18
 18
Integrate All Data: High Performance Data Access
                     WebSphere MQ            Web Services    JD Edwards          SAP NetWeaver
     Messaging &     JMS                     TIBCO           Lotus Notes         SAP NetWeaver BI   Packaged
     Web Services    MSMQ                    webMethods      Oracle E-Business   SAS                Applications
                     SAP NetWeaver XI                        PeopleSoft          Siebel
                     Oracle                  Informix        Salesforce CRM      ADP
                     DB2 UDB                 Teradata        Force.com           Hewitt
Relational & Flat    DB2/400                 Netezza                             SAP By Design      SaaS/BPO
            Files    SQL Server              ODBC            RightNow
                                                                                 Oracle OnDemand
                     Sybase                  JDBC            NetSuite
                     ADABAS              VSAM                FIX, SWIFT          NACHA
     Mainframe &     Datacom             C-ISAM              EDI–X12             AST
                                                                                                    Industry
        Midrange     DB2                 Binary Flat Files   EDI-Fact            RosettaNet
                     IDMS                Tape Formats…                                              Standards
                                                             HL7                 Cargo IMP
                     IMS
                                                             HIPAA               MVR
                     Word, Excel             Flat files
                     PDF                     ASCII reports   XML                 ebXML
     Unstructured    StarOffice              HTML            LegalXML            HL7 v3.0           XML Standards
      Data & Files   WordPerfect             RPG             IFX                 ACORD (AL3, XML)
                     Email (POP, IMPA)       ANSI            cXML
                     HTTP                    LDAP
                                                             Facebook            Kapow
                     Teradata              EMC/Greenplum     Twitter             Datasift           Social Media
MPP Appliances       AsterData             Vertica           LinkedIn



19
Informatica ETL Execution on Hadoop
                                                                                                1. Mapping translated and optimized to
                                                                                                   Hive HQL and User Defined Functions
                                                                                                2. Optimized HQL translated to MapReduce
                                                                                                3. MapReduce and User Defined Functions
                                                                                                   executed on Cloudera


                                                                                                                       Data Nodes
                                                                                                                      Data Node
                                                                                                                        Data Node
     SELECT
        T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,                          Informatica
        customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY                                                  Data Transformation Engine
        FROM
               (
               SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
                                                                             Hive HQL
               FROM lineitem
               GROUP BY L_ORDERKEY
               ) T1                                                                                             UDF           MapReduce
               JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)




20
Data Profiling & Discovery on Hadoop
                                   Value and Pattern
                                  Frequency to Isolate
                                   Data Quality Issues




          Discover Data Domains
              & Relationships
             Including PII Data



21
Informatica + Cloudera Demo




22
Informatica + Cloudera Demo Scenario
     • ABC Bank is rolling out a new service to provide daily stock
       recommendations based on customers prior transaction history,
       propensity for risk, and stock popularity
     • Input data is
         • Market Data – Bloomberg daily stock price and volume for 2012
         • Customer Transactions (i.e. trades) – Stock purchases over last 5 years
         • Twitter – Daily # of tweets for each stock symbol for 2012
         • Web Logs – Daily # of stock views for each customer for 2012
     • Output is
         • Customer Stock Recommendations – daily stock recommendations for each
           customer available in a relational data warehouse.


23
Metadata
                                                                                                                        RDBMS         Repository




                                               Access
     Transactions,          Documents,                      Social Media,      Machine Device,




                                                Data
     OLTP, OLAP               Email                          Web Logs            Scientific




                                                        •     Data Integration on Hadoop
                                                        •     Data Quality and Profiling on Hadoop
                                                        •     Data Parsing on Hadoop                           INFA
                                                        •     NLP & Entity Extraction on Hadoop
                                                        •     Replication to Hadoop
                                                                                                              Clients
                            Informatica Services        •     Archiving on Hadoop




                                                    Connect                     Connect
                                                    to HDFS                     to Hive



     Namenode          Job Tracker            DataNode1                                 DataNode2                       DataNode3

                Map Reduce                                  Map Reduce                               Map Reduce                 Map Reduce
                     HDFS                                      HDFS                                    HDFS                         HDFS
24



24
Next Steps




25
Archive   Profile   Parse   Transform   Cleanse   Match
                                                               1   LOWER COSTS
                                                                   • OPTIMIZED END-TO-END DATA MANAGEMENT
                                                                     PERFORMANCE ON HADOOP
                                                                   • RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS,
                                                                     DATA QUALITY RULES, COMPLEX FILE PARSING,
                                                                     & DATA PROFILING ON HADOOP




                                                               2   INCREASE PRODUCTIVITY
                                                                   • UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE
                                                                     VISUAL DEVELOPMENT AND MANAGEMENT




                                                               3   ACCELERATE ADOPTION
                                                                   • 500+ PARTNERS AND 100,000+ TRAINED
                                                                     INFORMATICA DEVELOPERS
                                                                   • 360+ PARTNERS AND 15,000+ TRAINED ON
                                                                     CLOUDERA ANNUALLY ON 6 CONTINENTS




26
Discover
     Archive   Profile   Parse   Transform   Cleanse   Match




                                                               Measure
                                                                 and      Data        Define
                                                               Monitor   Governance




                                                                           Apply
                                                                           Apply




27
What is the plan forward?
     • tomorrow
         • Identify a business opportunity where data can have a significant impact
         • Identify the skills you need to build a team with big data competencies
     • 3 months
         • Identify and prioritize the data you need to improve the business (both internal and external)
         • Determine what data to store in Cloudera to lower and control cost
         • Put a business plan together to optimize your DW/BI infrastructure
         • Execute a quick win big data project with demonstrable ROI
     • 1 year
          • Extend data governance to include more data and more types of data that impacts the
            business
          • Consider a shared-services model to promote best practices and further lower infrastructure
            and labor costs



28
Thank You!
     cloudera.com/clouderasessions


29

More Related Content

PPTX
Spring framework part 2
KEY
SOLID Principles
PDF
ZendCon2010 Doctrine MongoDB ODM
PDF
Symfony2 from the Trenches
PDF
Doctrine MongoDB Object Document Mapper
PDF
An introduction into Spring Data
PDF
Spring Data JPA
PDF
Symfony Day 2010 Doctrine MongoDB ODM
Spring framework part 2
SOLID Principles
ZendCon2010 Doctrine MongoDB ODM
Symfony2 from the Trenches
Doctrine MongoDB Object Document Mapper
An introduction into Spring Data
Spring Data JPA
Symfony Day 2010 Doctrine MongoDB ODM

What's hot (20)

PDF
Paintfree Object-Document Mapping for MongoDB by Philipp Krenn
PDF
Spring Data JPA from 0-100 in 60 minutes
PDF
Introduction to Datastore
PDF
Data access 2.0? Please welcome: Spring Data!
PDF
Erlang for data ops
PPT
Hibernate Tutorial
PPTX
Drupal II: The SQL
PDF
FrOScamp Zurich: Introducing JCR - 2010
PPTX
Rest with Java EE 6 , Security , Backbone.js
PDF
Webtuesday Zurich
PPTX
Easy data-with-spring-data-jpa
PPT
Introduction to hibernate
PPTX
Ajax
PDF
Lecture17
PDF
Advanced java practical semester 6_computer science
PPTX
Implementing CQRS and Event Sourcing with RavenDB
PDF
Non Cms For Web Apps
 
PDF
What do you mean, Backwards Compatibility?
PDF
Software architecture2008 ejbql-quickref
PDF
Doc Parsers Api Cheatsheet 1 0
Paintfree Object-Document Mapping for MongoDB by Philipp Krenn
Spring Data JPA from 0-100 in 60 minutes
Introduction to Datastore
Data access 2.0? Please welcome: Spring Data!
Erlang for data ops
Hibernate Tutorial
Drupal II: The SQL
FrOScamp Zurich: Introducing JCR - 2010
Rest with Java EE 6 , Security , Backbone.js
Webtuesday Zurich
Easy data-with-spring-data-jpa
Introduction to hibernate
Ajax
Lecture17
Advanced java practical semester 6_computer science
Implementing CQRS and Event Sourcing with RavenDB
Non Cms For Web Apps
 
What do you mean, Backwards Compatibility?
Software architecture2008 ejbql-quickref
Doc Parsers Api Cheatsheet 1 0
Ad

Similar to Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration (20)

PDF
Green dao
PPTX
Practical AngularJS
KEY
Eat whatever you can with PyBabe
PPTX
Ajax for dummies, and not only.
KEY
Intro to IndexedDB (Beta)
PPT
Accelerated data access
PDF
Having Fun with Play
PDF
Softshake - Offline applications
PPTX
Taming that client side mess with Backbone.js
PDF
Learning To Run - XPages for Lotus Notes Client Developers
ODP
JavaEE Spring Seam
PDF
Salesforce Batch processing - Atlanta SFUG
PDF
Contagion的Ruby/Rails投影片
 
PPT
Zend framework 03 - singleton factory data mapper caching logging
PDF
Lazy vs. Eager Loading Strategies in JPA 2.1
PDF
Devoxx08 - Nuxeo Core, JCR 2, CMIS
PDF
The Magic Revealed: Four Real-World Examples of Using the Client Object Model...
PDF
Java Web Programming on Google Cloud Platform [2/3] : Datastore
PPTX
NoSQL Endgame DevoxxUA Conference 2020
PDF
droidQuery: The Android port of jQuery
Green dao
Practical AngularJS
Eat whatever you can with PyBabe
Ajax for dummies, and not only.
Intro to IndexedDB (Beta)
Accelerated data access
Having Fun with Play
Softshake - Offline applications
Taming that client side mess with Backbone.js
Learning To Run - XPages for Lotus Notes Client Developers
JavaEE Spring Seam
Salesforce Batch processing - Atlanta SFUG
Contagion的Ruby/Rails投影片
 
Zend framework 03 - singleton factory data mapper caching logging
Lazy vs. Eager Loading Strategies in JPA 2.1
Devoxx08 - Nuxeo Core, JCR 2, CMIS
The Magic Revealed: Four Real-World Examples of Using the Client Object Model...
Java Web Programming on Google Cloud Platform [2/3] : Datastore
NoSQL Endgame DevoxxUA Conference 2020
droidQuery: The Android port of jQuery
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
Building Integrated photovoltaic BIPV_UPV.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

  • 2. Fast-track Development for Big Data Integration
  • 3. Why do you need ETL? Iterate Report Design & Develop Protot Define database Under- & stand Cleanse ype & Docum Test Design XML ent Flat Integrate & Transform Load Extract Files ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time, Replication Analytic App Manage Targets Sources Monitor, Troubleshoot, Secure & Retain 3
  • 4. Let’s suppose… • ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity • Input data is • Market Data – Bloomberg daily stock price and volume for one year • Customer Transactions (i.e. trades) – Stock purchases over last 5 years • Twitter – Daily # of tweets for each stock symbol for one year • Web Logs – Daily # of stock views for each customer for one year • Output is • Customer Stock Recommendations – daily stock recommendations for each customer 4
  • 5. If you did this on your own What would you need to build? What skills are needed? JSON <apex:page controller="RestDemoJsonController" SQL tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> JAVA select id, What if something changes? <apex:form > from (select t2.id, t2.item_attr1, <apex:pageBlock > t2.total_sales, // > String id, String name <apex:pageBlockButtons public static void GetLatLon_json_Request () { sum(item_purchase.bought_at) as <apex:commandButton action="{!submit}" PERL value="Submit" total_purchase Http http = new Http(); rerender="resultsPanel" status="status"/> from (select t1.id, t1.item_attr1, HttpRequest req = new HttpRequest(); </apex:pageBlockButtons> req.setEndpoint( sum(item_sales.sold_at) as total_sales <apex:pageMessages /> ‘http://maps.google.com/maps/api/geocode/xml?addres from (select id, item_attr1 s=torre+parquecristal+caracas+venezuela&sensor=true’); from item req.setMethod(‘GET’); open(DBFMT, $opt_database . ".fmt") where ... || die "can't open format ) as t1 HTTPResponse resp = http.send(req); file:$opt_database",".fmtn"; left join item_sales String json; on item.id = json = resp.getBody().replace(‘n’, ”); $docBegin="DOC"; $docEnd="/DOC"; $idBegin="DOCNO"; $idEnd="/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^s*TITLEs*:s*([^s]+)/) { $title{$1}=1; 5
  • 6. Doing this on your own has challenges • Time-consuming • Requires specialized skills • Hard to maintain, difficult to change • No reuse 6
  • 7. There are alternative approaches… Let’s see how this works with an Informatica Demo 7
  • 8. Challenges with traditional infrastructure • Cannot cost-effectively scale as data volumes grow • Not designed to support many new data types • Does not support rapid agile development • Analysis is not flexible to facilitate rapid discovery 8
  • 9. Maximize your return on big data Data Sources Operational Systems Analytical Systems Reports & Analytics Data Data OLTP MDM Mart Mart Transactions, OLTP, OLAP ODS Data OLTP Warehouse Documents, Email Social Media, Web Logs Access Parse & Discover Transform Extract & & Ingest Prepare & Profile & Cleanse Deliver Machine Device, Scientific Manage (i.e. Security, Performance, Governance, Collaboration) 9
  • 10. If you did this on your own What would you need to build? What skills are needed? JSON <apex:page controller="RestDemoJsonController" tabStyle="Contact"> SQL <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> JAVA MapReduce <apex:form > <apex:pageBlock > select id, from (select t2.id, t2.item_attr1, PIG HADOOP t2.total_sales, public static void main(String[] args) // > String id, String name <apex:pageBlockButtons public static void GetLatLon_json_Request () { sum(item_purchase.bought_at) as throws Exception <apex:commandButton action="{!submit}" pv_by_industry = GROUP PERL value="Submit" total_purchase { Http http = new Http(); rerender="resultsPanel" status="status"/> profile_view by from (select t1.id, t1.item_attr1, job.setMapperClass(WordMapper.cl HIVE HttpRequest req = new HttpRequest(); </apex:pageBlockButtons> req.setEndpoint( sum(item_sales.sold_at) as viewee_industry_id total_sales ass); <apex:pageMessages /> ‘http://maps.google.com/maps/api/geocode/xml?addres from (select id, item_attr1 job.setInputFormatClass(KeyValueTe s=torre+parquecristal+caracas+venezuela&sensor=true’); from item pv_avg_by_industry = FOREACH xtInputFormat.class); req.setMethod(‘GET’); open(DBFMT, $opt_database . ".fmt") where ... pv_by_industry FileInputFormat.addInputPath(job, || die "can't open format ) as t1 INSERT OVERWRITE TABLE new Path("/tmp/hadoop- HTTPResponse resp = http.send(req); file:$opt_database",".fmtn"; left join item_sales dog_food cscarioni/dfs/name/file")); on item.id = GENERATE group as String json; SELECT pv.*, u.brand, u.age, FileOutputFormat.setOutputPath(jo $docBegin="DOC"; viewee_industry_id, b, new Path("output")); } json = resp.getBody().replace(‘n’, ”); f.SKU FROM page_view pv JOIN $docEnd="/DOC"; AVG(profie_view) AS user u ON (pv.id = u.id) $idBegin="DOCNO"; average_pv; $idEnd="/DOCNO"; JOIN breed_list f ON (u.id = while (<DBFMT>) { f.uid) print STDERR if $debug; WHERE pv.date = '2013- if (/^s*TITLEs*:s*([^s]+)/) { 02-26'; $title{$1}=1; 10
  • 11. Implement a proven path to innovation Lower Big Data Project Costs (helps self-fund big data projects) Minimize Risk of New Technologies (design once, deploy anywhere) Innovate Faster With Big Data (onboard, discover, operationalize) 11
  • 12. Informatica + Cloudera: Lower Costs Optimize processing with low cost commodity hardware Transactions, Traditional Grid Data OLTP, OLAP Mart Access Parse & Discover Transform Extract & & Ingest Prepare & Profile & Cleanse Deliver Documents and Emails Social Media, Web Logs EDW 12 Machine Device, Increase productivity up to 5X Scientific Data Mart 12
  • 13. Informatica + Cloudera: Minimize Risk Quickly staff projects with trained data integration experts 13
  • 14. Informatica + Cloudera: Minimize Risk Design once and deploy anywhere Deploy On-Premise or in Pushdown to RDBMS or DW the Cloud Traditional Grid Appliance 14
  • 15. Informatica + Cloudera: Innovate Faster Onboard and analyze any type of data to gain big data insights Transactions, OLTP, OLAP Analytics & Op Dashboards Discover insights faster through rapid Documents and Emails development and collaboration Social Media, Web Logs Mobile Apps Operationalize big data insights to 15 Machine Device, generate new revenue streams Scientific Real-Time Alerts 15
  • 16. How does Informatica + Cloudera do this? 16
  • 17. Maximize your return on big data Data Sources Operational Systems Analytical Systems Reports & Analytics Data Data OLTP MDM Mart Mart Transactions, OLTP, OLAP ODS Data OLTP Warehouse Documents, Email Social Media, Web Logs Access Parse & Discover Transform Extract & & Ingest Prepare & Profile & Cleanse Deliver Machine Device, Scientific Manage (i.e. Security, Performance, Governance, Collaboration) 17
  • 18. Data Ingestion and Extraction Batch Transactions, Applications OLTP, OLAP Replication Documents and Emails Deliver Social Media, Web Logs Data Warehouse Streaming 18 Machine Device, Scientific Archiving Data Mart 18 18
  • 19. Integrate All Data: High Performance Data Access WebSphere MQ Web Services JD Edwards SAP NetWeaver Messaging & JMS TIBCO Lotus Notes SAP NetWeaver BI Packaged Web Services MSMQ webMethods Oracle E-Business SAS Applications SAP NetWeaver XI PeopleSoft Siebel Oracle Informix Salesforce CRM ADP DB2 UDB Teradata Force.com Hewitt Relational & Flat DB2/400 Netezza SAP By Design SaaS/BPO Files SQL Server ODBC RightNow Oracle OnDemand Sybase JDBC NetSuite ADABAS VSAM FIX, SWIFT NACHA Mainframe & Datacom C-ISAM EDI–X12 AST Industry Midrange DB2 Binary Flat Files EDI-Fact RosettaNet IDMS Tape Formats… Standards HL7 Cargo IMP IMS HIPAA MVR Word, Excel Flat files PDF ASCII reports XML ebXML Unstructured StarOffice HTML LegalXML HL7 v3.0 XML Standards Data & Files WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAP Facebook Kapow Teradata EMC/Greenplum Twitter Datasift Social Media MPP Appliances AsterData Vertica LinkedIn 19
  • 20. Informatica ETL Execution on Hadoop 1. Mapping translated and optimized to Hive HQL and User Defined Functions 2. Optimized HQL translated to MapReduce 3. MapReduce and User Defined Functions executed on Cloudera Data Nodes Data Node Data Node SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, Informatica customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY Data Transformation Engine FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx Hive HQL FROM lineitem GROUP BY L_ORDERKEY ) T1 UDF MapReduce JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) 20
  • 21. Data Profiling & Discovery on Hadoop Value and Pattern Frequency to Isolate Data Quality Issues Discover Data Domains & Relationships Including PII Data 21
  • 23. Informatica + Cloudera Demo Scenario • ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity • Input data is • Market Data – Bloomberg daily stock price and volume for 2012 • Customer Transactions (i.e. trades) – Stock purchases over last 5 years • Twitter – Daily # of tweets for each stock symbol for 2012 • Web Logs – Daily # of stock views for each customer for 2012 • Output is • Customer Stock Recommendations – daily stock recommendations for each customer available in a relational data warehouse. 23
  • 24. Metadata RDBMS Repository Access Transactions, Documents, Social Media, Machine Device, Data OLTP, OLAP Email Web Logs Scientific • Data Integration on Hadoop • Data Quality and Profiling on Hadoop • Data Parsing on Hadoop INFA • NLP & Entity Extraction on Hadoop • Replication to Hadoop Clients Informatica Services • Archiving on Hadoop Connect Connect to HDFS to Hive Namenode Job Tracker DataNode1 DataNode2 DataNode3 Map Reduce Map Reduce Map Reduce Map Reduce HDFS HDFS HDFS HDFS 24 24
  • 26. Archive Profile Parse Transform Cleanse Match 1 LOWER COSTS • OPTIMIZED END-TO-END DATA MANAGEMENT PERFORMANCE ON HADOOP • RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS, DATA QUALITY RULES, COMPLEX FILE PARSING, & DATA PROFILING ON HADOOP 2 INCREASE PRODUCTIVITY • UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE VISUAL DEVELOPMENT AND MANAGEMENT 3 ACCELERATE ADOPTION • 500+ PARTNERS AND 100,000+ TRAINED INFORMATICA DEVELOPERS • 360+ PARTNERS AND 15,000+ TRAINED ON CLOUDERA ANNUALLY ON 6 CONTINENTS 26
  • 27. Discover Archive Profile Parse Transform Cleanse Match Measure and Data Define Monitor Governance Apply Apply 27
  • 28. What is the plan forward? • tomorrow • Identify a business opportunity where data can have a significant impact • Identify the skills you need to build a team with big data competencies • 3 months • Identify and prioritize the data you need to improve the business (both internal and external) • Determine what data to store in Cloudera to lower and control cost • Put a business plan together to optimize your DW/BI infrastructure • Execute a quick win big data project with demonstrable ROI • 1 year • Extend data governance to include more data and more types of data that impacts the business • Consider a shared-services model to promote best practices and further lower infrastructure and labor costs 28
  • 29. Thank You! cloudera.com/clouderasessions 29