SlideShare a Scribd company logo
Public Information
Lessons Learned Migrating from IBM BigInsights
to Hortonworks Data Platform
Data Works Summit 2018
Lisa Coleman – Data and Analytics Infrastructure
Robert Tucker – HDP Platform Administrator
Mica Glover – HDP Platform Administrator
May 2018
Public Information 2
OUR MISSION
The mission of the association is to
facilitate the financial security of its
members, associates and their families
through provision of a full range of highly
competitive financial products and services;
in so doing, USAA seeks to be the provider
of choice for the military community.
THE USAA STANDARD
• Keep our membership and mission first
• Live our core values: Service, Loyalty,
Honesty, Integrity
• Be authentic and build trust
• Create conditions for people to succeed
• Purposefully include diverse perspectives
for superior results
• Innovate and build for the future
Public Information 3
Challenges
& Drivers
Scope
of Work
Lessons
Learned
Agenda
Public Information 3
Public Information 4
Platform Challenges and Drivers
Interoperability
Support
Model
Velocity of
Change
Compatibility
IBM Strategy
& Guidance
 In-place upgrades not possible without GPFS
 Stack integration – Platform Symphony, WLM,
BigSQL
 Common BI tools in the industry not certified for
connectivity via GPFS
 Limited ability to leverage Streaming Data
Platforms as none were certified on our version of
IBM BigInsights
 Difficult resolution process
 Evolution of documentation to support version
upgrade
 Potential added expenses for professional
services
 Inconsistency in stability of enhancements and break-fix releases
 Difficulty securing the proper resources for assistance
 Limits our ability to innovate and take
advantage of new capabilities
 Slow adoption of new ODPi projects
Public Information 4
Public Information 5
Apache Project Support
Apache Project Usage Hortonworks 2.5 BigInsights 4.2 Required Capability
    
    
    
    
    
    
    
    
   
   
   
  
  
  
  
 
 
 

Hadoop
Hive
Sqoop
Ambari
Spark
Zeppelin
Atlas
Ranger
Knox
Tez
Nifi
Flume
Kafka
Pig
Storm
HBase
Phoenix
Solr
Accumulo
Core Components
SQL on Hadoop
Data Ingestion
Cluster Management
IoT, Streaming
Data Science
Governance
Security
Security
Improve Hive Performance
IoT
Data Ingestion
IoT, Streaming
ETL
IoT, Streaming
NoSQL
SQL Interface for Hbase
Social Media / NLP
NoSQL
2.7.3
1.2.1
1.4.6
2.4
1.6.2 AND 2.0*
0.6.0
0.7.0
0.6.0
0.9.0
0.7.0
1
1.5.2
0.10.0.1
1.2.0
1.0.1
1.1.2
4.7.0
5.2.2
1.7.0
2.7.2
1.2.1
1.4.6
2.2.0
1.6.1
N/A
N/A
0.5.2
0.7.0
N/A
N/A
1.6.0
0.9.0.1
0.15.0
N/A
1.2.0
4.6.1
5.5
N/A
Public Information 6
Hortonworks 2.5 BigInsights 4.2Capability
Key Innovations
Focus
Operations
Security /
Governance
Security
Rolling Upgrade – Zero cluster downtime /
Full cluster HA
Express Upgrade – controlled cluster downtime
Cluster preventative maintenance
Role Based Access Control (RBAC)
Basic Tag policy – Access and entitlements
can be based on attributes
Geo-based policy – Access policy based on
location
Time-based policy – Access policy based on
time windows
Prohibitions – Restrictions on combining two
data sets which might be in compliance
originally, but not when combined together
Yes. Proven. GA multiple releases
Yes. Proven. GA multiple releases
Yes, Hortonworks SmartSense
Yes, via Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
Yes, via Atlas and Ranger
No, equivalent offering
New, Unproven
No, equivalent offering
Yes, via Ranger
No, Atlas not part of distribution
No, Atlas not part of distribution
No, Atlas not part of distribution
No, Atlas not part of distribution
Public Information 7
Scope of Work
Batch Jobs
& Scripts
 DDL replication
 Convert Posix to
HDFS commands
 Covert Hive CLI to
Beeline
 Convert existing
Ingest Utilities
Historic &
Incremental Loads
 Re-evaluated all data
assets
 Bulk Data loads
required specialized
configuration
 Data validation
Hand-Off to
Support Teams
Phased
Releases
 Planning with
data support teams
 Parallel runs
 Execute phased
turn-off
 Convert clients to
use Knox and Hive
Provision for
HDP Readiness
 AD Integration
 Enable Kerberos
 Standard
HDFS/Local
filesystem layout
 Establish DB / FTP
connections
Transition
Cut-Over &
Retirement
Data
Migration
Component
Migration
Environment
Set Up
 Documentation
 Access provisioning
 Knowledge transfer
 Sign-off
Public Information 8
In-Scope Component Summary
Hive
Tables
4,700
Data
Volume
500TB
Env
Files
233
Python
Scripts
499
Linux
Scripts
4,000Pig
Scripts
243
Prod
Jobs
7,356
Public Information 8
Public Information 9
Workload Transition
Public Information 9
Public Information 10
 Modified pathing/dir structure on HDFS
 Quality checks
 Networked additional set of nodes and
leveraged HDFS client copy from local to
move data (due to GPFS  HDFS)
 Enterprise monitoring
 Code repository
 Automated code deploy
 Managed asset provisioning
 HA Services
 Lack of dedicated resources across enterprise
required third party assistance
 Extensive knowledge transfer sessions to aid
in transition
 Developed training plan
 Re-write all ingest utilities for HDFS
 Standardize metadata delivery to Atlas
 Standard asset request
 Adoption of data stewardship
 LDAP/AD integration
 Kerberos
 Ranger
 Knox – non Kerberos connectivity to Hive
Lessons Learned
Security and
Access
Ingest
Framework
Stakeholder
Buy-In
Table Data
Migration
Operational
Maturity
 Transition from hive CLI to beeline
 Optimized file format (ORC/Parquet/Avro)
 Conversion of all scripts to acquire Kerberos ticket
 Transition from GPFS to HDFS
Code
Management
Public Information

More Related Content

PPTX
Presto query optimizer: pursuit of performance
PPTX
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
PPT
Migrating legacy ERP data into Hadoop
PPTX
Bootstrapping state in Apache Flink
PPTX
Big data at United Airlines
PPTX
What’s new in Apache Spark 2.3
PDF
Present and future of unified, portable, and efficient data processing with A...
PPTX
Graphene – Microsoft SCOPE on Tez
Presto query optimizer: pursuit of performance
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Migrating legacy ERP data into Hadoop
Bootstrapping state in Apache Flink
Big data at United Airlines
What’s new in Apache Spark 2.3
Present and future of unified, portable, and efficient data processing with A...
Graphene – Microsoft SCOPE on Tez

What's hot (20)

PDF
Realizing the promise of portable data processing with Apache Beam
PPTX
Containers and Big Data
PPTX
How to deploy machine learning models into production
PDF
The Next Generation of Data Processing and Open Source
PPTX
Optimizing your SparkML pipelines using the latest features in Spark 2.3
PPTX
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
PPTX
Saving the elephant—now, not later
PPTX
Log I am your father
PDF
NoSQL and Spatial Database Capabilities using PostgreSQL
 
PPTX
Large-scaled telematics analytics
PPTX
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
PPTX
Insights into Real-world Data Management Challenges
PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
PDF
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
PDF
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
PPTX
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
PPTX
Sharing metadata across the data lake and streams
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Realizing the promise of portable data processing with Apache Beam
Containers and Big Data
How to deploy machine learning models into production
The Next Generation of Data Processing and Open Source
Optimizing your SparkML pipelines using the latest features in Spark 2.3
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
Saving the elephant—now, not later
Log I am your father
NoSQL and Spatial Database Capabilities using PostgreSQL
 
Large-scaled telematics analytics
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
Insights into Real-world Data Management Challenges
Data Gloveboxes: A Philosophy of Data Science Data Security
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
Present & Future of Greenplum Database A massively parallel Postgres Database...
Sharing metadata across the data lake and streams
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Ad

Similar to Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform (20)

PDF
Big Data at a Gaming Company: Spil Games
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
PDF
Apache Hadoop on the Open Cloud
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
PDF
Data pipelines from zero to solid
PDF
Implementing and running a secure datalake from the trenches
PPTX
The Implacable advance of the data
PPTX
201305 hadoop jpl-v3
PDF
Agile data science
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPTX
BICS empowers predictive analytics and customer centricity with a Hadoop base...
PPTX
Deutsche Telekom on Big Data
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
PPTX
Big Data Analytics - Is Your Elephant Enterprise Ready?
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
PDF
Hadoop meets Agile! - An Agile Big Data Model
KEY
Paris HUG - Agile Analytics Applications on Hadoop
KEY
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Big Data at a Gaming Company: Spil Games
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
GDPR Community Showcase for Apache Ranger and Apache Atlas
Apache Hadoop on the Open Cloud
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Data pipelines from zero to solid
Implementing and running a secure datalake from the trenches
The Implacable advance of the data
201305 hadoop jpl-v3
Agile data science
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
BICS empowers predictive analytics and customer centricity with a Hadoop base...
Deutsche Telekom on Big Data
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Big Data Analytics - Is Your Elephant Enterprise Ready?
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Hadoop meets Agile! - An Agile Big Data Model
Paris HUG - Agile Analytics Applications on Hadoop
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform

  • 1. Public Information Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform Data Works Summit 2018 Lisa Coleman – Data and Analytics Infrastructure Robert Tucker – HDP Platform Administrator Mica Glover – HDP Platform Administrator May 2018
  • 2. Public Information 2 OUR MISSION The mission of the association is to facilitate the financial security of its members, associates and their families through provision of a full range of highly competitive financial products and services; in so doing, USAA seeks to be the provider of choice for the military community. THE USAA STANDARD • Keep our membership and mission first • Live our core values: Service, Loyalty, Honesty, Integrity • Be authentic and build trust • Create conditions for people to succeed • Purposefully include diverse perspectives for superior results • Innovate and build for the future
  • 3. Public Information 3 Challenges & Drivers Scope of Work Lessons Learned Agenda Public Information 3
  • 4. Public Information 4 Platform Challenges and Drivers Interoperability Support Model Velocity of Change Compatibility IBM Strategy & Guidance  In-place upgrades not possible without GPFS  Stack integration – Platform Symphony, WLM, BigSQL  Common BI tools in the industry not certified for connectivity via GPFS  Limited ability to leverage Streaming Data Platforms as none were certified on our version of IBM BigInsights  Difficult resolution process  Evolution of documentation to support version upgrade  Potential added expenses for professional services  Inconsistency in stability of enhancements and break-fix releases  Difficulty securing the proper resources for assistance  Limits our ability to innovate and take advantage of new capabilities  Slow adoption of new ODPi projects Public Information 4
  • 5. Public Information 5 Apache Project Support Apache Project Usage Hortonworks 2.5 BigInsights 4.2 Required Capability                                                                        Hadoop Hive Sqoop Ambari Spark Zeppelin Atlas Ranger Knox Tez Nifi Flume Kafka Pig Storm HBase Phoenix Solr Accumulo Core Components SQL on Hadoop Data Ingestion Cluster Management IoT, Streaming Data Science Governance Security Security Improve Hive Performance IoT Data Ingestion IoT, Streaming ETL IoT, Streaming NoSQL SQL Interface for Hbase Social Media / NLP NoSQL 2.7.3 1.2.1 1.4.6 2.4 1.6.2 AND 2.0* 0.6.0 0.7.0 0.6.0 0.9.0 0.7.0 1 1.5.2 0.10.0.1 1.2.0 1.0.1 1.1.2 4.7.0 5.2.2 1.7.0 2.7.2 1.2.1 1.4.6 2.2.0 1.6.1 N/A N/A 0.5.2 0.7.0 N/A N/A 1.6.0 0.9.0.1 0.15.0 N/A 1.2.0 4.6.1 5.5 N/A
  • 6. Public Information 6 Hortonworks 2.5 BigInsights 4.2Capability Key Innovations Focus Operations Security / Governance Security Rolling Upgrade – Zero cluster downtime / Full cluster HA Express Upgrade – controlled cluster downtime Cluster preventative maintenance Role Based Access Control (RBAC) Basic Tag policy – Access and entitlements can be based on attributes Geo-based policy – Access policy based on location Time-based policy – Access policy based on time windows Prohibitions – Restrictions on combining two data sets which might be in compliance originally, but not when combined together Yes. Proven. GA multiple releases Yes. Proven. GA multiple releases Yes, Hortonworks SmartSense Yes, via Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger Yes, via Atlas and Ranger No, equivalent offering New, Unproven No, equivalent offering Yes, via Ranger No, Atlas not part of distribution No, Atlas not part of distribution No, Atlas not part of distribution No, Atlas not part of distribution
  • 7. Public Information 7 Scope of Work Batch Jobs & Scripts  DDL replication  Convert Posix to HDFS commands  Covert Hive CLI to Beeline  Convert existing Ingest Utilities Historic & Incremental Loads  Re-evaluated all data assets  Bulk Data loads required specialized configuration  Data validation Hand-Off to Support Teams Phased Releases  Planning with data support teams  Parallel runs  Execute phased turn-off  Convert clients to use Knox and Hive Provision for HDP Readiness  AD Integration  Enable Kerberos  Standard HDFS/Local filesystem layout  Establish DB / FTP connections Transition Cut-Over & Retirement Data Migration Component Migration Environment Set Up  Documentation  Access provisioning  Knowledge transfer  Sign-off
  • 8. Public Information 8 In-Scope Component Summary Hive Tables 4,700 Data Volume 500TB Env Files 233 Python Scripts 499 Linux Scripts 4,000Pig Scripts 243 Prod Jobs 7,356 Public Information 8
  • 9. Public Information 9 Workload Transition Public Information 9
  • 10. Public Information 10  Modified pathing/dir structure on HDFS  Quality checks  Networked additional set of nodes and leveraged HDFS client copy from local to move data (due to GPFS  HDFS)  Enterprise monitoring  Code repository  Automated code deploy  Managed asset provisioning  HA Services  Lack of dedicated resources across enterprise required third party assistance  Extensive knowledge transfer sessions to aid in transition  Developed training plan  Re-write all ingest utilities for HDFS  Standardize metadata delivery to Atlas  Standard asset request  Adoption of data stewardship  LDAP/AD integration  Kerberos  Ranger  Knox – non Kerberos connectivity to Hive Lessons Learned Security and Access Ingest Framework Stakeholder Buy-In Table Data Migration Operational Maturity  Transition from hive CLI to beeline  Optimized file format (ORC/Parquet/Avro)  Conversion of all scripts to acquire Kerberos ticket  Transition from GPFS to HDFS Code Management

Editor's Notes

  • #5: ODPi – open data platform initiative: goal was to contribute to open source collaboratively and provide interoperability across different distributions
  • #8: Prework to evaluate all data prior to migration – removed the need to migrate some data •Delete any data you no longer need. •Delete duplicate data •Compress all data with Bzip2 •Identify which Hive tables do not need to be migrated •Delete any temporary tables •Understand who is consuming your data and how they are consuming it •Understand whether your project conforms to the Ingest Framework methodology   HDFS client needed on BigInsights cluster Clients of SAS big consumers of BigSQL had to be converted Dedicated team to address a majority of the script conversions Post migration security model implementation required for some assets which did not have an assigned owner. KT – office hours, data consumption through Hive, search of available assets
  • #11: Add Code section, e.g. transition from hive cli to beeline Add Ranger in Security section I removed code management section GPFS shielded USAA form having to implement Kerberos Memory management – JVMs, Spark Case mismatch between Unix accounts Code Migration Conversion of all scripts to acquire Kerberos tickets Hive cli to beeline GPFS to HDFS Python libraries Table Data Pathing changed Quality checks Networked additional set of nodes and leveraged HDFS client copy from local to move data