SlideShare a Scribd company logo
A New Generation of Data Transfer
    Tools for Hadoop: Sqoop 2
  Bilung Lee (blee at cloudera dot com)
  Kathleen Ting (kathleen at cloudera dot com)



                Hadoop Summit 2012. 6/13/12 Apache Sqoop
               Copyright 2012 The Apache Software Foundation
Who Are We?
• Bilung Lee
  – Apache Sqoop Committer
  – Software Engineer, Cloudera


• Kathleen Ting
  – Apache Sqoop Committer
  – Support Manager, Cloudera


                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  2
                  Copyright 2012 The Apache Software Foundation
What is Sqoop?
• Bulk data transfer tool
    – Import/Export from/to relational databases,
      enterprise data warehouses, and NoSQL systems
    – Populate tables in HDFS, Hive, and HBase
    – Integrate with Oozie as an action
    – Support plugins via connector based architecture
    May ‘09              March ‘10                         August ‘11      April ‘12


  First version          Moved to                          Moved to         Apache
(HADOOP-5815)             GitHub                           Apache       Top Level Project

                    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                       3
                   Copyright 2012 The Apache Software Foundation
Sqoop 1 Architecture
                                                  Document
                            Enterprise             Based
                              Data                Systems
                            Warehouse




                                                          Relational
                                                          Database



command
                    Hadoop



                                     Map Task



 Sqoop




                                               HDFS/HBase/
                                                  Hive




           Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                       4
          Copyright 2012 The Apache Software Foundation
Sqoop 1 Challenges
• Cryptic, contextual command line arguments
• Tight coupling between data transfer and
  output format
• Security concerns with openly shared
  credentials
• Not easy to manage installation/configuration
• Connectors are forced to follow JDBC model

                 Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                5
                Copyright 2012 The Apache Software Foundation
Sqoop 2 Architecture




     Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                    6
    Copyright 2012 The Apache Software Foundation
Sqoop 2 Themes
• Ease of Use

• Ease of Extension

• Security




                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  7
                  Copyright 2012 The Apache Software Foundation
Sqoop 2 Themes
• Ease of Use

• Ease of Extension

• Security




                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  8
                  Copyright 2012 The Apache Software Foundation
Ease of Use
Sqoop 1                                                Sqoop 2
Client-only Architecture                               Client/Server Architecture
CLI based                                              CLI + Web based
Client access to Hive, HBase                           Server access to Hive, HBase
Oozie and Sqoop tightly coupled                        Oozie finds REST API




                                Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                      9
                               Copyright 2012 The Apache Software Foundation
Sqoop 1: Client-side Tool
• Client-side installation + configuration
  – Connectors are installed/configured locally
  – Local requires root privileges
  – JDBC drivers are needed locally
  – Database connectivity is needed locally




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 10
                 Copyright 2012 The Apache Software Foundation
Sqoop 2: Sqoop as a Service
• Server-side installation + configuration
  – Connectors are installed/configured in one place
  – Managed by administrator and run by operator
  – JDBC drivers are needed in one place
  – Database connectivity is needed on the server




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 11
                 Copyright 2012 The Apache Software Foundation
Client Interface
• Sqoop 1 client interface:
  – Command line interface (CLI) based
  – Can be automated via scripting


• Sqoop 2 client interface:
  – CLI based (in either interactive or script mode)
  – Web based (remotely accessible)
  – REST API is exposed for external tool integration

                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 12
                 Copyright 2012 The Apache Software Foundation
Sqoop 1: Service Level Integration
• Hive, HBase
  – Require local installation
• Oozie
  – von Neumann(esque) integration:
     • Package Sqoop as an action
     • Then run Sqoop from node machines, causing one MR
       job to be dependent on another MR job
     • Error-prone, difficult to debug


                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  13
                  Copyright 2012 The Apache Software Foundation
Sqoop 2: Service Level Integration
• Hive, HBase
  – Server-side integration
• Oozie
  – REST API integration




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 14
                 Copyright 2012 The Apache Software Foundation
Ease of Use
Sqoop 1                                                Sqoop 2
Client-only Architecture                               Client/Server Architecture
CLI based                                              CLI + Web based
Client access to Hive, HBase                           Server access to Hive, HBase
Oozie and Sqoop tightly coupled                        Oozie finds REST API




                                Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                      15
                               Copyright 2012 The Apache Software Foundation
Sqoop 2 Themes
• Ease of Use

• Ease of Extension

• Security




                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  16
                  Copyright 2012 The Apache Software Foundation
Ease of Extension
Sqoop 1                                               Sqoop 2
Connector forced to follow JDBC model                 Connector given free rein
Connectors must implement functionality               Connectors benefit from common
                                                      framework of functionality
Connector selection is implicit                       Connector selection is explicit




                               Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                        17
                              Copyright 2012 The Apache Software Foundation
Sqoop 1: Implementing Connectors
• Connectors are forced to follow JDBC model
  – Connectors are limited/required to use common
    JDBC vocabulary (URL, database, table, etc)
• Connectors must implement all Sqoop
  functionality they want to support
  – New functionality may not be available for
    previously implemented connectors



                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 18
                 Copyright 2012 The Apache Software Foundation
Sqoop 2: Implementing Connectors
• Connectors are not restricted to JDBC model
  – Connectors can define own domain
• Common functionality are abstracted out of
  connectors
  – Connectors are only responsible for data transfer
  – Common Reduce phase implements data
    transformation and system integration
  – Connectors can benefit from future development
    of common functionality
                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 19
                 Copyright 2012 The Apache Software Foundation
Different Options, Different Results
Which is running MySQL?
$ sqoop import --connect jdbc:mysql://localhost/db 
--username foo --table TEST

$ sqoop import --connect jdbc:mysql://localhost/db 
--driver com.mysql.jdbc.Driver --username foo --table TEST


• Different options may lead to unpredictable
  results
   – Sqoop 2 requires explicit selection of a connector,
     thus disambiguating the process
                    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                   20
                   Copyright 2012 The Apache Software Foundation
Sqoop 1: Using Connectors
• Choice of connector is implicit
  – In a simple case, based on the URL in --connect
    string to access the database
  – Specification of different options can lead to
    different connector selection
  – Error-prone but good for power users




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 21
                 Copyright 2012 The Apache Software Foundation
Sqoop 1: Using Connectors
• Require knowledge of database idiosyncrasies
  – e.g. Couchbase does not need to specify a table
    name, which is required, causing --table to get
    overloaded as backfill or dump operation
  – e.g. --null-string representation is not supported
    by all connectors

• Functionality is limited to what the implicitly
  chosen connector supports


                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  22
                  Copyright 2012 The Apache Software Foundation
Sqoop 2: Using Connectors
• Users make explicit connector choice
  – Less error-prone, more predictable
• Users need not be aware of the functionality
  of all connectors
  – Couchbase users need not care that other
    connectors use tables




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 23
                 Copyright 2012 The Apache Software Foundation
Sqoop 2: Using Connectors
• Common functionality is available to all
  connectors
  – Connectors need not worry about common
    downstream functionality, such as transformation
    into various formats and integration with other
    systems




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 24
                 Copyright 2012 The Apache Software Foundation
Ease of Extension
Sqoop 1                                               Sqoop 2
Connector forced to follow JDBC model                 Connector given free rein
Connectors must implement functionality               Connectors benefit from common
                                                      framework of functionality
Connector selection is implicit                       Connector selection is explicit




                               Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                        25
                              Copyright 2012 The Apache Software Foundation
Sqoop 2 Themes
• Ease of Use

• Ease of Extension

• Security




                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  26
                  Copyright 2012 The Apache Software Foundation
Security
Sqoop 1                                               Sqoop 2
Support only for Hadoop security                      Support for Hadoop security and role-
                                                      based access control to external systems
High risk of abusing access to external               Reduced risk of abusing access to external
systems                                               systems
No resource management policy                         Resource management policy




                               Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                                 27
                              Copyright 2012 The Apache Software Foundation
Sqoop 1: Security
• Inherit/Propagate Kerberos principal for the
  jobs it launches
• Access to files on HDFS can be controlled via
  HDFS security
• Limited support (user/password) for secure
  access to external systems




                 Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                28
                Copyright 2012 The Apache Software Foundation
Sqoop 2: Security
• Inherit/Propagate Kerberos principal for the
  jobs it launches
• Access to files on HDFS can be controlled via
  HDFS security
• Support for secure access to external systems
  via role-based access to connection objects
  – Administrators create/edit/delete connections
  – Operators use connections



                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 29
                 Copyright 2012 The Apache Software Foundation
Sqoop 1: External System Access
• Every invocation requires necessary
  credentials to access external systems (e.g.
  relational database)
  – Workaround: create a user with limited access in
    lieu of giving out password
     • Does not scale
     • Permission granularity is hard to obtain
• Hard to prevent misuse once credentials are
  given
                    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                   30
                   Copyright 2012 The Apache Software Foundation
Sqoop 2: External System Access
• Connections are enabled as first-class objects
  – Connections encompass credentials
  – Connections are created once and then used
    many times for various import/export jobs
  – Connections are created by administrator and
    used by operator
     • Safeguard credential access from end users
• Connections can be restricted in scope based
  on operation (import/export)
  – Operators cannot abuse credentials
                    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                   31
                   Copyright 2012 The Apache Software Foundation
Sqoop 1: Resource Management
• No explicit resource management policy
  – Users specify the number of map jobs to run
  – Cannot throttle load on external systems




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 32
                 Copyright 2012 The Apache Software Foundation
Sqoop 2: Resource Management
• Connections allow specification of resource
  management policy
  – Administrators can limit the total number of
    physical connections open at one time
  – Connections can also be disabled




                  Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                 33
                 Copyright 2012 The Apache Software Foundation
Security
Sqoop 1                                               Sqoop 2
Support only for Hadoop security                      Support for Hadoop security and role-
                                                      based access control to external systems
High risk of abusing access to external               Reduced risk of abusing access to external
systems                                               systems
No resource management policy                         Resource management policy




                               Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                                                 34
                              Copyright 2012 The Apache Software Foundation
Demo Screenshots




    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                   35
   Copyright 2012 The Apache Software Foundation
Demo Screenshots




    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                   36
   Copyright 2012 The Apache Software Foundation
Demo Screenshots




    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                   37
   Copyright 2012 The Apache Software Foundation
Demo Screenshots




    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                   38
   Copyright 2012 The Apache Software Foundation
Demo Screenshots




    Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                   39
   Copyright 2012 The Apache Software Foundation
Takeaway
Sqoop 2 Highights:
  – Ease of Use: Sqoop as a Service
  – Ease of Extension: Connectors benefit from
    shared functionality
  – Security: Connections as first-class objects and
    role-based security




                   Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                  40
                  Copyright 2012 The Apache Software Foundation
Current Status: work-in-progress
• Sqoop2 Development:
 http://issues.apache.org/jira/browse/SQOOP-365

• Sqoop2 Blog Post:
 http://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

• Sqoop2 Design:
 http://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2




                         Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                        41
                        Copyright 2012 The Apache Software Foundation
Current Status: work-in-progress
• Sqoop2 Quickstart:
 http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart

• Sqoop2 Resource Layout:
 http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-+Resource+Layout

• Sqoop2 Feature Requests:
 http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests




                         Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                                             42
                        Copyright 2012 The Apache Software Foundation
Hadoop Summit 2012. 6/13/12 Apache Sqoop
                                                43
Copyright 2012 The Apache Software Foundation

More Related Content

PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PPTX
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
PDF
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
PDF
Hive on kafka
PDF
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
PPTX
Streamline Hadoop DevOps with Apache Ambari
PDF
HiveServer2 for Apache Hive
PPTX
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
New Data Transfer Tools for Hadoop: Sqoop 2
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Hive on kafka
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Streamline Hadoop DevOps with Apache Ambari
HiveServer2 for Apache Hive
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013

What's hot (20)

PPTX
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
PPTX
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
PDF
Hive on spark berlin buzzwords
PPTX
Get most out of Spark on YARN
PPTX
Simplified Cluster Operation & Troubleshooting
PPTX
YARN and the Docker container runtime
PPTX
Apache Ambari BOF - APIs - Hadoop Summit 2013
PPTX
Apache Hive on ACID
PPT
State of Security: Apache Spark & Apache Zeppelin
PPTX
Effective Spark on Multi-Tenant Clusters
PPTX
Hive analytic workloads hadoop summit san jose 2014
PPTX
Running Enterprise Workloads in the Cloud
PPTX
Apache Hadoop YARN: Past, Present and Future
PDF
Strata Stinger Talk October 2013
PPTX
Apache Slider
PPTX
Apache HBase: State of the Union
PDF
SQOOP PPT
PDF
The Heterogeneous Data lake
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Apache Hive 2.0: SQL, Speed, Scale
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Hive on spark berlin buzzwords
Get most out of Spark on YARN
Simplified Cluster Operation & Troubleshooting
YARN and the Docker container runtime
Apache Ambari BOF - APIs - Hadoop Summit 2013
Apache Hive on ACID
State of Security: Apache Spark & Apache Zeppelin
Effective Spark on Multi-Tenant Clusters
Hive analytic workloads hadoop summit san jose 2014
Running Enterprise Workloads in the Cloud
Apache Hadoop YARN: Past, Present and Future
Strata Stinger Talk October 2013
Apache Slider
Apache HBase: State of the Union
SQOOP PPT
The Heterogeneous Data lake
Flexible and Real-Time Stream Processing with Apache Flink
Apache Hive 2.0: SQL, Speed, Scale
Ad

Viewers also liked (6)

PDF
Habits of Effective Sqoop Users
PPTX
Apache sqoop with an use case
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PPTX
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
PDF
Sqoop on Spark for Data Ingestion
Habits of Effective Sqoop Users
Apache sqoop with an use case
Big data components - Introduction to Flume, Pig and Sqoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Sqoop on Spark for Data Ingestion
Ad

Similar to Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2 (20)

PPT
Data Science Day New York: The Platform for Big Data
PPTX
Introduction to the Hadoop EcoSystem
PDF
Running Hadoop as Service in AltiScale Platform
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
PDF
Big SQL Competitive Summary - Vendor Landscape
PDF
Dallas TDWI Meeting Dec. 2012: Hadoop
PPTX
Hadoop: today and tomorrow
ODP
The power of hadoop in cloud computing
PDF
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Apache Hadoop Now Next and Beyond
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PDF
Big Data Hoopla Simplified - TDWI Memphis 2014
PDF
Applications on Hadoop
PPTX
PPTX
Cloudera Manager Webinar | Cloudera Enterprise 3.7
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
Webinar: The Future of Hadoop
PPTX
Hadoopppt.pptx
Data Science Day New York: The Platform for Big Data
Introduction to the Hadoop EcoSystem
Running Hadoop as Service in AltiScale Platform
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Big SQL Competitive Summary - Vendor Landscape
Dallas TDWI Meeting Dec. 2012: Hadoop
Hadoop: today and tomorrow
The power of hadoop in cloud computing
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks tech workshop in-memory processing with spark
Apache Hadoop Now Next and Beyond
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Transitioning Compute Models: Hadoop MapReduce to Spark
Big Data Hoopla Simplified - TDWI Memphis 2014
Applications on Hadoop
Cloudera Manager Webinar | Cloudera Enterprise 3.7
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Webinar: The Future of Hadoop
Hadoopppt.pptx

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2

  • 1. A New Generation of Data Transfer Tools for Hadoop: Sqoop 2 Bilung Lee (blee at cloudera dot com) Kathleen Ting (kathleen at cloudera dot com) Hadoop Summit 2012. 6/13/12 Apache Sqoop Copyright 2012 The Apache Software Foundation
  • 2. Who Are We? • Bilung Lee – Apache Sqoop Committer – Software Engineer, Cloudera • Kathleen Ting – Apache Sqoop Committer – Support Manager, Cloudera Hadoop Summit 2012. 6/13/12 Apache Sqoop 2 Copyright 2012 The Apache Software Foundation
  • 3. What is Sqoop? • Bulk data transfer tool – Import/Export from/to relational databases, enterprise data warehouses, and NoSQL systems – Populate tables in HDFS, Hive, and HBase – Integrate with Oozie as an action – Support plugins via connector based architecture May ‘09 March ‘10 August ‘11 April ‘12 First version Moved to Moved to Apache (HADOOP-5815) GitHub Apache Top Level Project Hadoop Summit 2012. 6/13/12 Apache Sqoop 3 Copyright 2012 The Apache Software Foundation
  • 4. Sqoop 1 Architecture Document Enterprise Based Data Systems Warehouse Relational Database command Hadoop Map Task Sqoop HDFS/HBase/ Hive Hadoop Summit 2012. 6/13/12 Apache Sqoop 4 Copyright 2012 The Apache Software Foundation
  • 5. Sqoop 1 Challenges • Cryptic, contextual command line arguments • Tight coupling between data transfer and output format • Security concerns with openly shared credentials • Not easy to manage installation/configuration • Connectors are forced to follow JDBC model Hadoop Summit 2012. 6/13/12 Apache Sqoop 5 Copyright 2012 The Apache Software Foundation
  • 6. Sqoop 2 Architecture Hadoop Summit 2012. 6/13/12 Apache Sqoop 6 Copyright 2012 The Apache Software Foundation
  • 7. Sqoop 2 Themes • Ease of Use • Ease of Extension • Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 7 Copyright 2012 The Apache Software Foundation
  • 8. Sqoop 2 Themes • Ease of Use • Ease of Extension • Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 8 Copyright 2012 The Apache Software Foundation
  • 9. Ease of Use Sqoop 1 Sqoop 2 Client-only Architecture Client/Server Architecture CLI based CLI + Web based Client access to Hive, HBase Server access to Hive, HBase Oozie and Sqoop tightly coupled Oozie finds REST API Hadoop Summit 2012. 6/13/12 Apache Sqoop 9 Copyright 2012 The Apache Software Foundation
  • 10. Sqoop 1: Client-side Tool • Client-side installation + configuration – Connectors are installed/configured locally – Local requires root privileges – JDBC drivers are needed locally – Database connectivity is needed locally Hadoop Summit 2012. 6/13/12 Apache Sqoop 10 Copyright 2012 The Apache Software Foundation
  • 11. Sqoop 2: Sqoop as a Service • Server-side installation + configuration – Connectors are installed/configured in one place – Managed by administrator and run by operator – JDBC drivers are needed in one place – Database connectivity is needed on the server Hadoop Summit 2012. 6/13/12 Apache Sqoop 11 Copyright 2012 The Apache Software Foundation
  • 12. Client Interface • Sqoop 1 client interface: – Command line interface (CLI) based – Can be automated via scripting • Sqoop 2 client interface: – CLI based (in either interactive or script mode) – Web based (remotely accessible) – REST API is exposed for external tool integration Hadoop Summit 2012. 6/13/12 Apache Sqoop 12 Copyright 2012 The Apache Software Foundation
  • 13. Sqoop 1: Service Level Integration • Hive, HBase – Require local installation • Oozie – von Neumann(esque) integration: • Package Sqoop as an action • Then run Sqoop from node machines, causing one MR job to be dependent on another MR job • Error-prone, difficult to debug Hadoop Summit 2012. 6/13/12 Apache Sqoop 13 Copyright 2012 The Apache Software Foundation
  • 14. Sqoop 2: Service Level Integration • Hive, HBase – Server-side integration • Oozie – REST API integration Hadoop Summit 2012. 6/13/12 Apache Sqoop 14 Copyright 2012 The Apache Software Foundation
  • 15. Ease of Use Sqoop 1 Sqoop 2 Client-only Architecture Client/Server Architecture CLI based CLI + Web based Client access to Hive, HBase Server access to Hive, HBase Oozie and Sqoop tightly coupled Oozie finds REST API Hadoop Summit 2012. 6/13/12 Apache Sqoop 15 Copyright 2012 The Apache Software Foundation
  • 16. Sqoop 2 Themes • Ease of Use • Ease of Extension • Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 16 Copyright 2012 The Apache Software Foundation
  • 17. Ease of Extension Sqoop 1 Sqoop 2 Connector forced to follow JDBC model Connector given free rein Connectors must implement functionality Connectors benefit from common framework of functionality Connector selection is implicit Connector selection is explicit Hadoop Summit 2012. 6/13/12 Apache Sqoop 17 Copyright 2012 The Apache Software Foundation
  • 18. Sqoop 1: Implementing Connectors • Connectors are forced to follow JDBC model – Connectors are limited/required to use common JDBC vocabulary (URL, database, table, etc) • Connectors must implement all Sqoop functionality they want to support – New functionality may not be available for previously implemented connectors Hadoop Summit 2012. 6/13/12 Apache Sqoop 18 Copyright 2012 The Apache Software Foundation
  • 19. Sqoop 2: Implementing Connectors • Connectors are not restricted to JDBC model – Connectors can define own domain • Common functionality are abstracted out of connectors – Connectors are only responsible for data transfer – Common Reduce phase implements data transformation and system integration – Connectors can benefit from future development of common functionality Hadoop Summit 2012. 6/13/12 Apache Sqoop 19 Copyright 2012 The Apache Software Foundation
  • 20. Different Options, Different Results Which is running MySQL? $ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST $ sqoop import --connect jdbc:mysql://localhost/db --driver com.mysql.jdbc.Driver --username foo --table TEST • Different options may lead to unpredictable results – Sqoop 2 requires explicit selection of a connector, thus disambiguating the process Hadoop Summit 2012. 6/13/12 Apache Sqoop 20 Copyright 2012 The Apache Software Foundation
  • 21. Sqoop 1: Using Connectors • Choice of connector is implicit – In a simple case, based on the URL in --connect string to access the database – Specification of different options can lead to different connector selection – Error-prone but good for power users Hadoop Summit 2012. 6/13/12 Apache Sqoop 21 Copyright 2012 The Apache Software Foundation
  • 22. Sqoop 1: Using Connectors • Require knowledge of database idiosyncrasies – e.g. Couchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation – e.g. --null-string representation is not supported by all connectors • Functionality is limited to what the implicitly chosen connector supports Hadoop Summit 2012. 6/13/12 Apache Sqoop 22 Copyright 2012 The Apache Software Foundation
  • 23. Sqoop 2: Using Connectors • Users make explicit connector choice – Less error-prone, more predictable • Users need not be aware of the functionality of all connectors – Couchbase users need not care that other connectors use tables Hadoop Summit 2012. 6/13/12 Apache Sqoop 23 Copyright 2012 The Apache Software Foundation
  • 24. Sqoop 2: Using Connectors • Common functionality is available to all connectors – Connectors need not worry about common downstream functionality, such as transformation into various formats and integration with other systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 24 Copyright 2012 The Apache Software Foundation
  • 25. Ease of Extension Sqoop 1 Sqoop 2 Connector forced to follow JDBC model Connector given free rein Connectors must implement functionality Connectors benefit from common framework of functionality Connector selection is implicit Connector selection is explicit Hadoop Summit 2012. 6/13/12 Apache Sqoop 25 Copyright 2012 The Apache Software Foundation
  • 26. Sqoop 2 Themes • Ease of Use • Ease of Extension • Security Hadoop Summit 2012. 6/13/12 Apache Sqoop 26 Copyright 2012 The Apache Software Foundation
  • 27. Security Sqoop 1 Sqoop 2 Support only for Hadoop security Support for Hadoop security and role- based access control to external systems High risk of abusing access to external Reduced risk of abusing access to external systems systems No resource management policy Resource management policy Hadoop Summit 2012. 6/13/12 Apache Sqoop 27 Copyright 2012 The Apache Software Foundation
  • 28. Sqoop 1: Security • Inherit/Propagate Kerberos principal for the jobs it launches • Access to files on HDFS can be controlled via HDFS security • Limited support (user/password) for secure access to external systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 28 Copyright 2012 The Apache Software Foundation
  • 29. Sqoop 2: Security • Inherit/Propagate Kerberos principal for the jobs it launches • Access to files on HDFS can be controlled via HDFS security • Support for secure access to external systems via role-based access to connection objects – Administrators create/edit/delete connections – Operators use connections Hadoop Summit 2012. 6/13/12 Apache Sqoop 29 Copyright 2012 The Apache Software Foundation
  • 30. Sqoop 1: External System Access • Every invocation requires necessary credentials to access external systems (e.g. relational database) – Workaround: create a user with limited access in lieu of giving out password • Does not scale • Permission granularity is hard to obtain • Hard to prevent misuse once credentials are given Hadoop Summit 2012. 6/13/12 Apache Sqoop 30 Copyright 2012 The Apache Software Foundation
  • 31. Sqoop 2: External System Access • Connections are enabled as first-class objects – Connections encompass credentials – Connections are created once and then used many times for various import/export jobs – Connections are created by administrator and used by operator • Safeguard credential access from end users • Connections can be restricted in scope based on operation (import/export) – Operators cannot abuse credentials Hadoop Summit 2012. 6/13/12 Apache Sqoop 31 Copyright 2012 The Apache Software Foundation
  • 32. Sqoop 1: Resource Management • No explicit resource management policy – Users specify the number of map jobs to run – Cannot throttle load on external systems Hadoop Summit 2012. 6/13/12 Apache Sqoop 32 Copyright 2012 The Apache Software Foundation
  • 33. Sqoop 2: Resource Management • Connections allow specification of resource management policy – Administrators can limit the total number of physical connections open at one time – Connections can also be disabled Hadoop Summit 2012. 6/13/12 Apache Sqoop 33 Copyright 2012 The Apache Software Foundation
  • 34. Security Sqoop 1 Sqoop 2 Support only for Hadoop security Support for Hadoop security and role- based access control to external systems High risk of abusing access to external Reduced risk of abusing access to external systems systems No resource management policy Resource management policy Hadoop Summit 2012. 6/13/12 Apache Sqoop 34 Copyright 2012 The Apache Software Foundation
  • 35. Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 35 Copyright 2012 The Apache Software Foundation
  • 36. Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 36 Copyright 2012 The Apache Software Foundation
  • 37. Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 37 Copyright 2012 The Apache Software Foundation
  • 38. Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 38 Copyright 2012 The Apache Software Foundation
  • 39. Demo Screenshots Hadoop Summit 2012. 6/13/12 Apache Sqoop 39 Copyright 2012 The Apache Software Foundation
  • 40. Takeaway Sqoop 2 Highights: – Ease of Use: Sqoop as a Service – Ease of Extension: Connectors benefit from shared functionality – Security: Connections as first-class objects and role-based security Hadoop Summit 2012. 6/13/12 Apache Sqoop 40 Copyright 2012 The Apache Software Foundation
  • 41. Current Status: work-in-progress • Sqoop2 Development: http://issues.apache.org/jira/browse/SQOOP-365 • Sqoop2 Blog Post: http://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop • Sqoop2 Design: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2 Hadoop Summit 2012. 6/13/12 Apache Sqoop 41 Copyright 2012 The Apache Software Foundation
  • 42. Current Status: work-in-progress • Sqoop2 Quickstart: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart • Sqoop2 Resource Layout: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-+Resource+Layout • Sqoop2 Feature Requests: http://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests Hadoop Summit 2012. 6/13/12 Apache Sqoop 42 Copyright 2012 The Apache Software Foundation
  • 43. Hadoop Summit 2012. 6/13/12 Apache Sqoop 43 Copyright 2012 The Apache Software Foundation

Editor's Notes

  • #2: Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.
  • #4: It’s a Apache TLP now
  • #6: Different connectors interpret these options differently. Some options are not understood for the same operation by different connectors, while some connectors have custom options that do not apply to others. Confusing for users and detrimental to effective use.Some connectors may support a certain data format while others don’tRequired to use common JDBC vocabulary (URL, database, table, etc.)Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errorsDue to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files)There are security concerns with openly shared credentialsBy requiring root privileges, local configuration and installation are not easy to manageDebugging the map job is limited to turning on the verbose flagConnectors are forced to follow the JDBC model and are required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable
  • #7: Sqoop 1’s challenges will be addressed by Sqoop 2 Architecture Connector only focuses on connectivitySerialization, format conversion, Hive/HBase integration should be uniformly available via frameworkRepository Manager: creates/edits connectors, connections, jobsConnector Manager: register new connectors, enables/disables connectorsJob Manager: submit new jobs to MR, gets job progress, kills specific jobs
  • #8: Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
  • #9: Pause for Q. Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
  • #10: There are 4 points we want to address with Sqoop 2’s ease of use.
  • #11: Like other client-side tools, Sqoop 1 requires everything – connectors, root privileges, drivers, db connectivity – to be installed and configured locally.
  • #12: Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role, which will be discussed in detail later. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.Sqoop as a web-based service, exposes the REST APIFront-ended by CLI and browserBack-ended by a metadata repositoryExample of document based system is couchbase.Sqoop 1 has something called a sqoopmetastore, which is similar to a repository for metadata but not quite. That said, the model of operation for Sqoop 1 and Sqoop 2 is very different: Sqoop 1 was a limited vocabulary tool while Sqoop 2 is more metadata driven. The design of Sqoop 2’s metadata repository is such that it can be replaced by other providers.
  • #13: Sqoop 1 was intended for power users as evidenced by its CLI. Sqoop 2 has two modes: one for the power user and one for the newbie. Those new to Sqoop will appreciate the interactive UI, which walks you through import/export setup, eliminating redundant/incorrect options. Various connectors are added in one place, with connectors exposing necessary options to the Sqoop framework and with the user only required to provide info relevant to their use-case.Not bound by terminal, well documented return codes
  • #14: In Sqoop 1, Hive and HBase require local installation. Currently Oozie launches Sqoop by bundling it and running it on the cluster, which is error-prone and difficult to debug.
  • #15: With Sqoop 2, Hive and HBase integration happens not from client but from the backend. Hive does not need to be installed on Sqoop at all. What Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit OozieWhich Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal
  • #16: 4 pts
  • #17: Pause for questions.
  • #19: Because Sqoop is heavily JDBC centric, it’s not easy to work with non relational dbCouchbase implementation required different interpretationInconsistencies between connectors
  • #20: Two-phases: first, transfer; second, transform/integration with other componentsOption to opt-out of downstream processing (i.e. revert to Sqoop 1)Trade-off between ease of connector/tooling development vs faster performanceSeparating data transfer (Map) from data transform (Reduce) allows connectors to specializeConnectors benefit from a common framework of functionality – don’t have to worry about forward compabilityFunctionally, Sqoop 2 is a superset of Sqoop 1 but does it in a different way Too early in the design process to tell if the same CLI commands could be used but most likely not primarily because it is a fundamentally incompatible changeReduce phase limited to stream transformations (no aggregation to start with)
  • #21: Former is running MySQL b/c specifying driver option prevents the MySQL connector from working i.e. would end up using generic JDBC connector
  • #22: Based on the URL which is JDBC centric and something Sqoop 2 is moving away from in the connect string used to access the database, Sqoop attempts to predict which driver it should load.What are connectors?Plugin components based on Sqoop’s extension frameworkEfficiently transfer data between Hadoop and external storeMeant for optimized import/export or don’t support native JDBCBundled connectors: MySQL, PostgreSQL, Oracle, SQLServer, JDBCHigh-performance data transfer: Direct MySQL, Direct PostgreSQL
  • #23: Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errors. Due to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files).
  • #24: With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Connectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.
  • #25: Common functionality will be abstracted out of connectors, holding them responsible only for data transport. The reduce phase will implement common functionality, ensuring that connectors benefit from future development of functionality.
  • #27: Pause for questions. Bilung to take over.
  • #30: No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobs
  • #32: Connection is only for external systems
  • #34: No need to disable user in database