SlideShare a Scribd company logo
Vertica Integration with Apache Hadoop Hadoop World NYC 2009 HDFS Hadoop Compute  Cluster Map Map Map Reduce
Vertica ®  Analytic Database MPP columnar architecture Second to sub-second queries 300GB/node load times  Scales to hundreds of TBs Standard ETL & Reporting Tools www.vertica.com
What do people do with Hadoop? Transform data Archive data Look for Patterns Parse Logs
Big Data comes in Three Forms Unstructured Images, sound, video Semi-structured Logs, data feeds, event streams Fully Structured Relational tables
Availability, Scalability and Efficiency … how fast can you go from data to answers? Unstructured data needs to be analyzed to make sense. Semi-structure data parsed based on spec (or brute force). Structured data can be optimized for ad-hoc analysis.
Hadoop / Vertica Distributed processing framework (MapReduce) Distributed storage layer (HDFS) Vertica can be used as a data source and target for MapReduce Data can also be moved between Vertica and HDFS (sqoop) Hadoop talks to Vertica via custom Input and Output Formatters
Hadoop / Vertica Vertica serves as a structured data repository for hadoop Hadoop Compute  Cluster Map Map Map Reduce
Hadoop / Vertica Vertica’s input formatter takes a parameterized query Relational Map operations can be pushed down to the database Vertica’s output formatter takes an existing table name or a description Vertica output tables can be optimized directly from hadoop
Hadoop / Vertica Federate multiple Vertica database clusters with hadoop Hadoop Compute  Cluster Map Map Map Reduce Hadoop Compute  Cluster Map Map Map Reduce Hadoop Compute  Cluster Map Map Map Reduce Hadoop Compute  Cluster Map Map Map Reduce
What is the Interface? Input Formatter Query specifies which data to read Query can be parameterizes (map push down) Each input split gets one parameter OR, input can be spliced with order by and limit (slower) Output Formatter Job specifies format for output table Vertica converts reduced output into trickle loads Vertica can optimize new tables
Some Hadoop / Vertica Applications Elastic Map Reduce parsing and loading CloudFront Logs Tickstore algorithm with map push down Analyze time series Sessionize click streams Parse and load logs
Basic Example Elastic Map Reduce parsing and loading CloudFront Logs Mapper reads from S3 CloudFront Logs Parses into records, transmits to reducer Reducer loads into Vertica All done with streaming API ~ 10 lines of python Limitless SQL
Advanced Example Tickstore algorithm with map push down Input formatter queries Vertica using map push down Identity Mapper passes through to reducer Reducer runs proprietary algorithm moving average, correlations, secret sauce Results are stored in a new table for further analysis Vertica optimizes the new table
How to get started Get a copy of hadoop from Apache or Cloudera Get vertica from  www.vertica.com  or via Amazon or RightScale or as a VM Grab the formatter and Vertica jdbc drivers from vetica.com/MapReduce Included in contrib from hadoop 0.21.0 (MR-775) Put the jars in hadoop/lib Run your Hadoop/Vertica job
Future Directions and Questions Archiving information lifecycle (sqoop) Invoking hadoop jobs from Vertica Joining Vertica data mid job Using Vertica for (structured) transient job data [email_address] Vertica.com/MapReduce

More Related Content

PDF
Hadoop and Vertica: Data Analytics Platform at Twitter
PPTX
Azure Data Factory ETL Patterns in the Cloud
PPTX
Azure Data Factory Data Wrangling with Power Query
PDF
Fossasia 2018-chetan-khatri
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
PPTX
Digital Transformation with Microsoft Azure
PPTX
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Hadoop and Vertica: Data Analytics Platform at Twitter
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory Data Wrangling with Power Query
Fossasia 2018-chetan-khatri
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Digital Transformation with Microsoft Azure
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...

What's hot (20)

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Data pipeline and data lake for autonomous driving
PPTX
ADF Mapping Data Flows Level 300
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Hd insight overview
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PDF
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
PPTX
ETL big data with apache hadoop
KEY
Introduction to Hadoop, HBase, and NoSQL
PPTX
Azure Data Factory Data Flow
PPTX
Mapping Data Flows Training April 2021
PPTX
Azure Data Factory for Azure Data Week
PPTX
Azure data factory
PPTX
Big Data with SQL Server
PDF
Harnessing Spark Catalyst for Custom Data Payloads
PPTX
Data ingestion
PPTX
Data quality patterns in the cloud with ADF
PDF
ETL Practices for Better or Worse
PPTX
Azure Data Factory for Redmond SQL PASS UG Sept 2018
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Data pipeline and data lake for autonomous driving
ADF Mapping Data Flows Level 300
Azure Data Factory Data Flows Training (Sept 2020 Update)
Hd insight overview
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Data Quality Patterns in the Cloud with Azure Data Factory
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
ETL big data with apache hadoop
Introduction to Hadoop, HBase, and NoSQL
Azure Data Factory Data Flow
Mapping Data Flows Training April 2021
Azure Data Factory for Azure Data Week
Azure data factory
Big Data with SQL Server
Harnessing Spark Catalyst for Custom Data Payloads
Data ingestion
Data quality patterns in the cloud with ADF
ETL Practices for Better or Worse
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Ad

Similar to Hw09 Hadoop + Vertica (20)

PPTX
Hadoop_arunam_ppt
PDF
Apache spark - Architecture , Overview & libraries
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Vertica And Spark: Connecting Computation And Data
PDF
Vertica And Spark: Connecting Computation And Data
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Hoodie - DataEngConf 2017
PPTX
Hive with HDInsight
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
It takes two to tango! : Is SQL-on-Hadoop the next big step?
PPTX
Hadoop and rdbms with sqoop
PPT
Hive @ Hadoop day seattle_2010
PDF
Hadoop Technologies
PDF
Hadoop Big data Solution Provider
PPTX
Hadoop: An Industry Perspective
PPT
Chicago Data Summit: Apache HBase: An Introduction
PPTX
Evolution of spark framework for simplifying data analysis.
Hadoop_arunam_ppt
Apache spark - Architecture , Overview & libraries
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Hw09 Hadoop Development At Facebook Hive And Hdfs
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Hive Training -- Motivations and Real World Use Cases
Hoodie - DataEngConf 2017
Hive with HDInsight
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
It takes two to tango! : Is SQL-on-Hadoop the next big step?
Hadoop and rdbms with sqoop
Hive @ Hadoop day seattle_2010
Hadoop Technologies
Hadoop Big data Solution Provider
Hadoop: An Industry Perspective
Chicago Data Summit: Apache HBase: An Introduction
Evolution of spark framework for simplifying data analysis.
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mushroom cultivation and it's methods.pdf
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Heart disease approach using modified random forest and particle swarm optimi...
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Programs and apps: productivity, graphics, security and other tools
OMC Textile Division Presentation 2021.pptx
cloud_computing_Infrastucture_as_cloud_p
Network Security Unit 5.pdf for BCA BBA.
Mushroom cultivation and it's methods.pdf
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Univ-Connecticut-ChatGPT-Presentaion.pdf
A comparative analysis of optical character recognition models for extracting...

Hw09 Hadoop + Vertica

  • 1. Vertica Integration with Apache Hadoop Hadoop World NYC 2009 HDFS Hadoop Compute Cluster Map Map Map Reduce
  • 2. Vertica ® Analytic Database MPP columnar architecture Second to sub-second queries 300GB/node load times Scales to hundreds of TBs Standard ETL & Reporting Tools www.vertica.com
  • 3. What do people do with Hadoop? Transform data Archive data Look for Patterns Parse Logs
  • 4. Big Data comes in Three Forms Unstructured Images, sound, video Semi-structured Logs, data feeds, event streams Fully Structured Relational tables
  • 5. Availability, Scalability and Efficiency … how fast can you go from data to answers? Unstructured data needs to be analyzed to make sense. Semi-structure data parsed based on spec (or brute force). Structured data can be optimized for ad-hoc analysis.
  • 6. Hadoop / Vertica Distributed processing framework (MapReduce) Distributed storage layer (HDFS) Vertica can be used as a data source and target for MapReduce Data can also be moved between Vertica and HDFS (sqoop) Hadoop talks to Vertica via custom Input and Output Formatters
  • 7. Hadoop / Vertica Vertica serves as a structured data repository for hadoop Hadoop Compute Cluster Map Map Map Reduce
  • 8. Hadoop / Vertica Vertica’s input formatter takes a parameterized query Relational Map operations can be pushed down to the database Vertica’s output formatter takes an existing table name or a description Vertica output tables can be optimized directly from hadoop
  • 9. Hadoop / Vertica Federate multiple Vertica database clusters with hadoop Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce Hadoop Compute Cluster Map Map Map Reduce
  • 10. What is the Interface? Input Formatter Query specifies which data to read Query can be parameterizes (map push down) Each input split gets one parameter OR, input can be spliced with order by and limit (slower) Output Formatter Job specifies format for output table Vertica converts reduced output into trickle loads Vertica can optimize new tables
  • 11. Some Hadoop / Vertica Applications Elastic Map Reduce parsing and loading CloudFront Logs Tickstore algorithm with map push down Analyze time series Sessionize click streams Parse and load logs
  • 12. Basic Example Elastic Map Reduce parsing and loading CloudFront Logs Mapper reads from S3 CloudFront Logs Parses into records, transmits to reducer Reducer loads into Vertica All done with streaming API ~ 10 lines of python Limitless SQL
  • 13. Advanced Example Tickstore algorithm with map push down Input formatter queries Vertica using map push down Identity Mapper passes through to reducer Reducer runs proprietary algorithm moving average, correlations, secret sauce Results are stored in a new table for further analysis Vertica optimizes the new table
  • 14. How to get started Get a copy of hadoop from Apache or Cloudera Get vertica from www.vertica.com or via Amazon or RightScale or as a VM Grab the formatter and Vertica jdbc drivers from vetica.com/MapReduce Included in contrib from hadoop 0.21.0 (MR-775) Put the jars in hadoop/lib Run your Hadoop/Vertica job
  • 15. Future Directions and Questions Archiving information lifecycle (sqoop) Invoking hadoop jobs from Vertica Joining Vertica data mid job Using Vertica for (structured) transient job data [email_address] Vertica.com/MapReduce