SlideShare a Scribd company logo
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Next-gen Data Flow Platform for the Enterprise
Santosh Bardwaj
Vice President, Advanced Analytics
The opinions expressed in this presentation are those of the presenters,
in their individual capacities, and not necessarily those of Discover.
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Agenda
2
What it
takes to
build an
enterprise-
ready
platform
Discover’s
next-gen data
ingestion
platform
built on NiFi
Challenges
and how we
overcame
them
1 32
Next steps
with the
platform
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
$37Bn Consumer Deposits
$9Bn Private Student Loans
$7Bn Personal Loans
1 in 4 Households1
$60Bn in Credit Card Receivables
Leading Cash
Rewards
 $183Bn Payment Services Volume
 185+ Countries/Territories
Discover is a leading U.S. direct bank & payments partner
Note(s)
Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits
1. TNS’ Consumer Payment Strategies Study
3
Deposits & Lending
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Advancing our
data-analytic
capabilities
Ingest, classify
and transform data
from “source to
insight” in minutes
Centralize data,
next-generation
analytic tools and
reporting on the
Hadoop Data Lake
Extend the
Data Lake and
advanced
analytic stack
on the Cloud to
enable speed
to market
Operationalize
business use
cases leveraging
advanced
analytic
capabilities
Provide real-time
customer insight and
rapid deployment of
new strategies into
the decision engines
Advanced
Analytics
Capabilities
1
5
4
3
2
From hours
to minutes Built around a
foundation of a
continuous data
pipeline and hybrid
data-analytic lake
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Unified data ingestion platform built on NiFi
5
Unified data ingestion platform
 Ingest data from source systems
 Push to the Enterprise Data Lake
 Governed process leveraging
common-reusable templates
What is NiFi?
 Enables automated data flow
management
 Acquires data from producers
 Delivers to consumers while
orchestrating the flow
Scalable and Customizable
Provenance
Promotes reuse
Secure
User Interface (drag & drop)
Why we chose NiFi to build our
data ingestion platform
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
The next-gen platform built on NiFi and Spark is designed to
streamline our data pipeline into a near real-time paradigm
6
Operational
Database
Raw Data Lake
(flat file)
Limited user
access and tools
Source
of Truth
Enterprise
DW
Database file
extracts
SFTP
ETL Grid ETL Grid
~24 hours
Raw
data
Source
of truth
Source of truth
- Enriched
Enterprise Data Lake
Phase 1
“True Sourcing”
Phase 2
“Enriched Sourcing”
Minutes
Nightly batch to near real-time
NiFi
Spark
NiFi
Hortonworks
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
We are also extending the capability of into the cloud
7
Batch
sources
Event
Bus
Mini-batch
Real-time
On-premise Data Lake
Model scoring/
decisioning
Real-time
analytics
History
Operational Data
Store
Real-time
AWS Data Lake
Kafka
Hortonworks
Amazon S3
Hortonworks
Spark
7
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Data Flow Categorization within the Hadoop Data Lake
8
System of
Record
(SOR)
Source of Truth
(SOT)
Source of Truth
– Enriched
(SOT-E)
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Detail flow and foundational components
9
SOURCES RAW SOR SOT SOT-E
Source files
Landing
area
File
Catalog
Convert to
standard
format
Schema
evolution
Apply
schema
changes
Raw data
consumable
Technical
metadata
Business
metadata
DQ checks
Data enrichment
(Business
transformation)
Ability to
export data
out of Lake
Continuous
integration
Monitoring Data lineage
Data
governance
Exception
handling
Security
Data
reconciliation
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Ingesting complex data - How complex?
Format of files will vary, some are easy to consume, others hard
Example: Records with Dynamic arrays/vectors of primitives or strings
Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City
Data:
John, Doe, 2, Susie, Chris, Chicago
Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta
Frank, Smith, 1, Ralph, Toronto
Example: Records with an array of Struct data types
Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City,
CompanyStruct.YearsWorked, Age
Data:
John, 1, Discover, Chicago, 3 , 44
Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35
10
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Our solution – A custom NiFi processor to handle complex data types
11
Spark
Converter
Discover schema.json
Data File.001
Data File.avsc
Data File 001.avro
Ingestion Pipeline
Source of
Truth - Source
NiFi Process
Group
System of Record
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Continuous improvement of real-time data ingestion using NiFi
NiFi Ingestion Flow Version I
Source : Flat File Destination: Hadoop
24 hours
NiFi Ingestion Flow Version II
Source : Event Bus Destination: Hadoop
Complex logic, limited scale
NiFi Ingestion Flow Version III
Source : Event Bus Destination: Hadoop
Custom NiFi processor developed in-house, reusable and scalable
Seconds
112
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
ETL on Hadoop progression
Version I
Traditional
ETL tool
Version II
ETL on
HiveQL
Version III
ETL on Spark
(hand-coded)
Coming soon
Automated
(flow-based)
ETL on Spark
13
~18 hours ~8 hours
Data enrichment from SOR to SOT (~600 jobs)
~1 hourRun time:
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Upcoming enhancements to our data pipeline
Integrating
data
quality,
catalog into
NiFi flow
Custom
processors
to parse
complex
data
structures
Enterprise
scale ETL
on Hadoop
using
Spark
Self-
service
data
pipelines
Integrating
batch and
real-time
data
pipelines
14
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Hiring Data Engineers
Q & A

More Related Content

PDF
Databricks: A Tool That Empowers You To Do More With Data
PPTX
Strategic Business Requirements for Master Data Management Systems
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPT
Adopting a Process-Driven Approach to Master Data Management
PDF
Improving Data Literacy Around Data Architecture
PDF
PPTX
Launching a Data Platform on Snowflake
PDF
Observability for Data Pipelines With OpenLineage
Databricks: A Tool That Empowers You To Do More With Data
Strategic Business Requirements for Master Data Management Systems
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Adopting a Process-Driven Approach to Master Data Management
Improving Data Literacy Around Data Architecture
Launching a Data Platform on Snowflake
Observability for Data Pipelines With OpenLineage

What's hot (20)

PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PDF
Data Architecture Strategies: The Rise of the Graph Database
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
リクルートを支える横断データ基盤と機械学習の適用事例
PDF
Migrate and Modernize Hadoop-Based Security Policies for Databricks
PDF
Graph database Use Cases
PDF
リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介
PDF
DAS Slides: Self-Service Reporting and Data Prep – Benefits & Risks
PDF
Enterprise Architecture vs. Data Architecture
PPTX
データ分析基盤を支えるエンジニアリング
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Data Lake: A simple introduction
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Data Analytics Strategies & Solutions for SAP customers
PPTX
ビッグデータ処理データベースの全体像と使い分け
2018年version
PDF
DAS Slides: Enterprise Architecture vs. Data Architecture
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
How to Build & Sustain a Data Governance Operating Model
PPTX
Data platform modernization with Databricks.pptx
How a Semantic Layer Makes Data Mesh Work at Scale
Data Architecture Strategies: The Rise of the Graph Database
Architect’s Open-Source Guide for a Data Mesh Architecture
リクルートを支える横断データ基盤と機械学習の適用事例
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Graph database Use Cases
リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介
DAS Slides: Self-Service Reporting and Data Prep – Benefits & Risks
Enterprise Architecture vs. Data Architecture
データ分析基盤を支えるエンジニアリング
DW Migration Webinar-March 2022.pptx
Data Lake: A simple introduction
Building Modern Data Platform with Microsoft Azure
Introduction SQL Analytics on Lakehouse Architecture
Data Analytics Strategies & Solutions for SAP customers
ビッグデータ処理データベースの全体像と使い分け
2018年version
DAS Slides: Enterprise Architecture vs. Data Architecture
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
How to Build & Sustain a Data Governance Operating Model
Data platform modernization with Databricks.pptx
Ad

Similar to Continuous Data Ingestion pipeline for the Enterprise (20)

PDF
GraphTalk Helsinki - Introduction to Graphs and Neo4j
PDF
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
PDF
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
PDF
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
PPTX
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
The State of the Data Warehouse in 2017 and Beyond
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
PPTX
The new dominant companies are running on data
PDF
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PDF
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
PDF
Keynote: GraphTour Toronto
PPTX
In-Memory Computing Webcast. Market Predictions 2017
PPTX
Market Research Meets Big Data Analytics for Business Transformation
PDF
BAR360 open data platform presentation at DAMA, Sydney
PDF
Building Sessionization Pipeline at Scale with Databricks Delta
PPTX
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
GraphTalk Helsinki - Introduction to Graphs and Neo4j
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Neo4j GraphDay Seattle- Sept19- Connected data imperative
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
The State of the Data Warehouse in 2017 and Beyond
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
The new dominant companies are running on data
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
Accelerating Data Lakes and Streams with Real-time Analytics
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Keynote: GraphTour Toronto
In-Memory Computing Webcast. Market Predictions 2017
Market Research Meets Big Data Analytics for Business Transformation
BAR360 open data platform presentation at DAMA, Sydney
Building Sessionization Pipeline at Scale with Databricks Delta
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

Continuous Data Ingestion pipeline for the Enterprise

  • 1. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Next-gen Data Flow Platform for the Enterprise Santosh Bardwaj Vice President, Advanced Analytics The opinions expressed in this presentation are those of the presenters, in their individual capacities, and not necessarily those of Discover.
  • 2. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Agenda 2 What it takes to build an enterprise- ready platform Discover’s next-gen data ingestion platform built on NiFi Challenges and how we overcame them 1 32 Next steps with the platform 4
  • 3. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute $37Bn Consumer Deposits $9Bn Private Student Loans $7Bn Personal Loans 1 in 4 Households1 $60Bn in Credit Card Receivables Leading Cash Rewards  $183Bn Payment Services Volume  185+ Countries/Territories Discover is a leading U.S. direct bank & payments partner Note(s) Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits 1. TNS’ Consumer Payment Strategies Study 3 Deposits & Lending
  • 4. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Advancing our data-analytic capabilities Ingest, classify and transform data from “source to insight” in minutes Centralize data, next-generation analytic tools and reporting on the Hadoop Data Lake Extend the Data Lake and advanced analytic stack on the Cloud to enable speed to market Operationalize business use cases leveraging advanced analytic capabilities Provide real-time customer insight and rapid deployment of new strategies into the decision engines Advanced Analytics Capabilities 1 5 4 3 2 From hours to minutes Built around a foundation of a continuous data pipeline and hybrid data-analytic lake 4
  • 5. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Unified data ingestion platform built on NiFi 5 Unified data ingestion platform  Ingest data from source systems  Push to the Enterprise Data Lake  Governed process leveraging common-reusable templates What is NiFi?  Enables automated data flow management  Acquires data from producers  Delivers to consumers while orchestrating the flow Scalable and Customizable Provenance Promotes reuse Secure User Interface (drag & drop) Why we chose NiFi to build our data ingestion platform
  • 6. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute The next-gen platform built on NiFi and Spark is designed to streamline our data pipeline into a near real-time paradigm 6 Operational Database Raw Data Lake (flat file) Limited user access and tools Source of Truth Enterprise DW Database file extracts SFTP ETL Grid ETL Grid ~24 hours Raw data Source of truth Source of truth - Enriched Enterprise Data Lake Phase 1 “True Sourcing” Phase 2 “Enriched Sourcing” Minutes Nightly batch to near real-time NiFi Spark NiFi Hortonworks
  • 7. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute We are also extending the capability of into the cloud 7 Batch sources Event Bus Mini-batch Real-time On-premise Data Lake Model scoring/ decisioning Real-time analytics History Operational Data Store Real-time AWS Data Lake Kafka Hortonworks Amazon S3 Hortonworks Spark 7
  • 8. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Data Flow Categorization within the Hadoop Data Lake 8 System of Record (SOR) Source of Truth (SOT) Source of Truth – Enriched (SOT-E)
  • 9. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Detail flow and foundational components 9 SOURCES RAW SOR SOT SOT-E Source files Landing area File Catalog Convert to standard format Schema evolution Apply schema changes Raw data consumable Technical metadata Business metadata DQ checks Data enrichment (Business transformation) Ability to export data out of Lake Continuous integration Monitoring Data lineage Data governance Exception handling Security Data reconciliation
  • 10. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Ingesting complex data - How complex? Format of files will vary, some are easy to consume, others hard Example: Records with Dynamic arrays/vectors of primitives or strings Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City Data: John, Doe, 2, Susie, Chris, Chicago Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta Frank, Smith, 1, Ralph, Toronto Example: Records with an array of Struct data types Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City, CompanyStruct.YearsWorked, Age Data: John, 1, Discover, Chicago, 3 , 44 Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35 10
  • 11. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Our solution – A custom NiFi processor to handle complex data types 11 Spark Converter Discover schema.json Data File.001 Data File.avsc Data File 001.avro Ingestion Pipeline Source of Truth - Source NiFi Process Group System of Record
  • 12. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Continuous improvement of real-time data ingestion using NiFi NiFi Ingestion Flow Version I Source : Flat File Destination: Hadoop 24 hours NiFi Ingestion Flow Version II Source : Event Bus Destination: Hadoop Complex logic, limited scale NiFi Ingestion Flow Version III Source : Event Bus Destination: Hadoop Custom NiFi processor developed in-house, reusable and scalable Seconds 112
  • 13. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute ETL on Hadoop progression Version I Traditional ETL tool Version II ETL on HiveQL Version III ETL on Spark (hand-coded) Coming soon Automated (flow-based) ETL on Spark 13 ~18 hours ~8 hours Data enrichment from SOR to SOT (~600 jobs) ~1 hourRun time:
  • 14. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Upcoming enhancements to our data pipeline Integrating data quality, catalog into NiFi flow Custom processors to parse complex data structures Enterprise scale ETL on Hadoop using Spark Self- service data pipelines Integrating batch and real-time data pipelines 14
  • 15. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Hiring Data Engineers Q & A

Editor's Notes

  • #5: Discover has a tradition of operating on mature data – analytic platforms such as TD, SAS – Platforms are proprietary, expensive Since the beginning of this decade , there are 3 key trends that have influenced the future of the industry: Big Data, Open source tools, Real-Time analytics and  cloud Business – Reinvent our key decisioning platforms such as Fraud, Credit decisioning, Collections – Faster, Richer data, better quality insights , Faster development & deployment  Technology foundation consists of – Hadoop, a new Data pipeline Collectively should help improved our speed to market from days/ hours to minutes
  • #11: Multiple record formats within a single file Records will contain complex data structures (sub-records, dynamic arrays/vectors) Fixed width, single and multiple delimited, Mainframe
  • #12: Systematically convert source files to a standard format with schema information attached Apply our own “Discover Schema” (stored in json) to the raw source file (or use CopyBook for mainframe files) Feed the source data and our “Discover Schema” into a Spark application “Discover Schema” is needed so our convertor knows how to parse the incoming data file Output is an AVRO data file along with corresponding .avsc schema Avro data and schema is then passed on to the ingestion pipeline for further Hive Loading and processing