SlideShare a Scribd company logo
1Copyright © Capgemini 2016. All Rights Reserved
Bigdata Architecture Overview
2Copyright © Capgemini 2016. All Rights Reserved
Gartner Hype Cycle – Emerging Technologies
3Copyright © Capgemini 2016. All Rights Reserved
Benefits
4Copyright © Capgemini 2016. All Rights Reserved
Big Data and its Dimensions
Extracting insight from an immense volume, variety and velocity of data, in context, beyond
what was previously possible
Manage the complexity of data in many different
structures, ranging from relational, to logs, to raw
text
Streaming data and large volume data movement
Scale from Terabytes to Petabytes
(1K TBs) to Zetabytes (1B TBs)
Having a lot of data in different volumes coming in
at high speed is worthless if that data is incorrect.
Organizations need to ensure that the data is
correct as well as the analyses performed on the
data are correct.
Discovering value from multichannel datasets
Variety:
Velocity:
Volume:
Veracity:
Value:
5Copyright © Capgemini 2016. All Rights Reserved
Applications for Big Data Analytics
Homeland Security
FinanceSmarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn
6Copyright © Capgemini 2016. All Rights Reserved
Manage
 Data governance and security
 Data privacy
 Compliance
 Collaboration
 Value generation
 Program delivery
 Data-driven culture
 Information strategy
 Skill development
 Master data mgmt
 Metadata mgmt
 Data quality mgmt
 Operations, SLA’s
 Orchestration
General reference architecture for Big Data Analytics
ValueActInsightAnalyzeInformationProcessSource
data
Customer
profitability
Operational cost
cutting
Risk prevention
Market share
increase
Business Applications
 Customer
campaign
 Trigger activity
Business Processes
 Trigger event
 Adjust process
Decision makers
 Approve/reject
business
opportunities
 Develop new
business models
and products
Customer
Experience
Operational Process
Optimization
Risk, Fraud
Disruptive Business
Model
Search
What is relevant?
Explorative
How does it work?
Descriptive
What
happened?
Diagnostic
Why did it happen?
Predictive
What
will happen?
Prescriptive
How to
act next?
Data asset
descriptions
Processed data
 Measures, KPI’s
 Dimensions,
Master data
Granular data
 Events
 Context
information
Ingest
Catalog
Stream
Store
Prepare
Refine, blend
Manage lifecycle
Internal data
 IT managed
applications (ERP,
SCM, CRM)
 Master and
reference data
 Business owned
informal data
 Documents, mail,
images, voice,
video
 Web and mobile
apps
 B2B
 Internet, Social,
Internet of Things
(machine, sensor)
 Third party data:
market, weather,
climate,
geolocation
 Open data
External Data
Business
performance
Performance
improvement
Mask
7Copyright © Capgemini 2016. All Rights Reserved
The BDL is also aligned with our principles 
Unleash Data and Insights
as-a-service
Make Insight-driven
Value a Crucial
Business KPI
Empower your People
with Insights at the
Point of Action
Develop an Enterprise Data
Science Culture
Master Governance,
Security and Privacy of your
Data Assets
Enable your Data
Landscape for the Flood
coming from Connected
People and Things
Embark on the Journey
to Insights within your
Business and
Technology Context
1 2 3
7654
It concerns both
Business and
(disruptive) Technology
It works with high volumes of
all kinds of data
It integrates Unified Data
Management capabilities to
manage governance, security,
privacy, MDM, RDM, etc
it also comes with a new,
specific mindset that has to
be addressed at the
Enterprise level
We (Capgemini) intend to
offer the BDL as-a-Service
Bringing Business Value by
delivering Insights at the Point
of Action is the motto of the
BDL
1
2 3
7
654
8Copyright © Capgemini 2016. All Rights Reserved
Business Data Lake Reference Architecture - Conceptual
Characteristics
 Store-anything; analyze everything
 Blend traditional data elements with
new data types
 Manage centrally, govern locally
 Future-proof design
 Highly scalable and available
Data Access Layer
Data Distillation Layer
Data Quality Governance Framework (Business Rules, Transformation, Aggregation)
Customer Master (CRM)
Data Lake Layer
Landing
Self-service
4
Data Ingestion LayerExtract & Load Streams
3
Structured data
Sources
2
1
ODS
SandboxSQL-on-Hadoop In-Memory Grid
Data Visualization and
Reporting
Advanced
Analytics
Data Virtualization
Or Blending
Marts
DataGovernance(Audit,Lineage)
7
MetadataManagement
Transactional
Systems(RES/CRM) Un/Semi-Structured Data Sources
Data Dissemination Layer Data Provisioning Layer
HR
Mart
1 HR
Mart
2
Distributed Compute Layer
/ Services
Distributed Storage Layer
Data Governance
Integration
APILayer
11 6 5
DataSecurity(Authentication,Authorization,Kerberos)
8 9
10
9Copyright © Capgemini 2016. All Rights Reserved
Business Data Lake Reference Architecture - Logical
Talend 6.3 or
latest
Data Access Layer
Data Distillation Layer
Data Quality Governance Framework (Business Rules, Transformation, Aggregation)
Customer Master (CRM)
Data Lake Layer
Landing
4
Data Ingestion LayerExtract & Load Streams
3
Structured data
Sources
2
1
ODS
SandboxSQL-on-Hadoop In-Memory Grid
Data Virtualization
Or Blending
Marts
DataGovernance(Audit,Lineage)
7
MetadataManagement
Transactional
Systems(RES/CRM) Un/Semi-Structured Data Sources
Data Dissemination Layer Data Provisioning Layer
HR
Mart
1 HR
Mart
2
APILayer
11 6 5
DataSecurity(Authentication,Authorization,Kerberos)
8 9
10
Ranger, Knox
Atlas
Hortonworks HDP 2.5
or latest
Spark
HBASE Hive
HBASE / Hive
Datamarts
Redshift
Zeppellin
RESTful
Service
Self-serviceData Visualization and
Reporting
Advanced
Analytics
Spark
Streaming/Storm
Kafka
10Copyright © Capgemini 2016. All Rights Reserved
Detailed layer breakup
11Copyright © Capgemini 2016. All Rights Reserved
Reference architecture for data ingestion - Indicative
Functionality: Ingest Data from a variety of sources and with varying latency, into the Data Lake
Data Integration Services
S/FTP based push
(Logs, text, other file based)
Changed Data Management
(Delta extracts, event mgmt)
Data
Sourcing
Source Extraction Services
(XML, Relational, Other extracts)
DataTransformation
Transformation Services
Fast Data
Manipulation
• Sorting
• File Merges
• Joins
• File Splitting
• Others
Transform
Routines
• Aggregation
• Mappings
• Lookups
• Calculations
• others
Metadata
Management
Automation
Services
Deployment
(Job & others)
Error Handling
Clustering &
Capacity
Common
Services
Data Sources (Structured, Semi-Structured, Unstructured)
DataState
Data at Rest
(ETL pushdown, batch using
standard DI tools or Sqoop)
Data in Motion
(Fast data, processed via tools like
Flume, Storm, Spark, etc)
Data Persistence
Big Data
Transformations
• User-defined
functions / custom
MR code (Java,
Python etc.) for
complex logic
ETL Pushdown Processing
(Execute mapping jobs on Hadoop cluster on
HDFS/Hive/Spark….)
Characteristics
 The Data Ingestion design principles are
based on integrating raw data
characterized by extreme scale and
variability, and making provisions for
both ‘data at rest’ (batch) and ‘data in
motion’ (low latency)
 The framework combines traditional
data integration methodologies
leveraging the Extract-Transform-Load
approach and extends it to also process
semi-structured and unstructured data
elements.
 The classical model of tracking data
elements through their lifecycle and
providing for lineage can be added in
this framework.
12Copyright © Capgemini 2016. All Rights Reserved
Data Acquisition and Reconciliation
The Data Reconciliation is part of data quality and ensures data
integrity in the data lake. Reconciliation process checks if the data has
been loaded properly to ensure accuracy and completeness of the data
Master Data – This is a fairly simple process as the Master Data is not
subject to frequent changes. The granularity of the data remains the
same in the source and the target
Transactional Data – Reconciliation of the Transactional Data is
instrumental to the success of the big data systems. Reconciliation can
happen on the entire data set or on the incremental data based on the
method by which the data is ingested
Separate metadata tables / files are designed specifically for
reconciliation. These tables/ files are populated with reconciliation
queries and reconciliation reports are generated after data is loaded
into the data lake.
Data Reconciliation (Optional)
The Data Acquisition can be described as combination of Landing Zone &
Data validation, Delta Detection & Data Enrichment
Landing Zone – It is an area wherein data from all the source systems
across client’s landscape will land for the utilization/consumption by
downstream systems
Data validation – It is the first check point or zone wherein the MDM
based checks will be applied on the incoming source data files.
Delta Detection : This will be applicable to the data feeds from those
source systems which have the capability to send/provide incremental
delta data for the regular ongoing data processing into data lake solution.
Data Enrichment : Data enrichment refers to processes used to enhance,
refine or otherwise improve raw data. Data from various enrichment
sources will be pushed to data lake via Landing zone for enrichment of
existing data.
Data Acquisition
13Copyright © Capgemini 2016. All Rights Reserved
Data Distillation in the Data Lake: approach to provisioning for
data consumption
Characteristics
 Uniform approach for distillation of information from
the data lake
 A centralized Data Quality engine for application of
uniform data quality rules across the enterprise
 An Integrated Data Quality function to cleanse,
standardize, enrich and de-duplicate data
 Console for Design, Development & Validation of
rules
 Data Quality Services for Integration with
operational systems, MDM
 A Exception Management solution for resolving data
issues and errors.
 Data quality process running on the data will be
translated into MapReduce for faster processing.
Data Persistence Layer
Distillation Layer
AGGREGATION
EXTRACT
TRANSFORM
Σ
SECURE
DATA QUALITY STORE
DATA QUALITY CONSOLE
DATA QUALITY ENGINE
DATA
PROFILING
DATA
CLEANSING
MATCH
& MERGE
DATA
ENRICHMENT
RULE MANAGER
DQ META-DATA
DATA
DASHBOARD
EXCEPTION
MANAGEMENT
DATA QUALITY
CONFIGURATOR
EXCEPTION
REPOSITORY
DQ MART
Functionality: Ability to ingest data from the storage tier and convert it to structured data for easier analysis by downstream applications.
This is done through a combination of Extraction, transformation and aggregation of high quality data from the Data Lake and making it
available for Analytical and Reporting Applications. Transformation will also involve data quality checks and corrections like profiling,
validating, cleansing structured and unstructured data based on Business rules. Data is distilled (or prepared) on a per-function basis, and
made available for consumption. This is consistent with the design practice of ‘manage data centrally and provision locally’
14Copyright © Capgemini 2016. All Rights Reserved
Data Persistence Layer : Schema on Read & Distill on Demand
Namenode
Hadoop Distributed File System (HDFS)
Datanodes Replication
Job / Task
Tracker
Storage Cluster/Rack
Characteristics
 Deliver a single, comprehensive view of all data,
across functional areas – to conduct deep
analysis
 Multi-tiered Data Lake that serves distinct
functionalities – e.g., Landing, staging and
curated stores
 A landing area containing both traditional data
as well as non-traditional data – characterized
by attributes of value, veracity, volume, velocity
and variety
 Eliminate the need for upfront schema design
and rigid pre-configured models
 Easy and cost-effective configuration for scale
up and scale down
 Store everything, distill on demand
Landing Staging
Data Lake
Curated
Audit Metadata Search
Data Ingestion
Functionality: Create a single repository for information and deliver a single, silo-less store to handle all types of data for all reporting,
analysis and discovery requirements
15Copyright © Capgemini 2016. All Rights Reserved
Approach to Data Provisioning
DataAccessLayer
Data provisioning
Discovery
Platform
/ Sandboxes
Analytical
Views
Data
Virtualization
DataDissemination
HR
Mart
1
HR
Mart
2
HR
Mart
3
HR
Mart
4
Characteristics
 The Data Marts & Aggregate Structures layer will
include subject specific data mart structures which
can be used by various tools to retrieve data and
information. This layer will also support User specific
Sandbox for power users to perform various
activities such as data mining, identifying data
patterns, running analytical and statistical model
using various tools
 If required, there will be multiple versions of the
subject areas for different production streams
 Data marts and aggregate structures such as
summary tables will be created based on business
and performance requirements. As far as possible,
database managed aggregates such as computed
views and indexes will be created to reduce ETL
based data movement
 Data Virtualization will address combining datasets
from multiple data stores across various layers in the
data lake stack.
Functionality: Provision data-sets to create various combinations of custom views – by specific functions/departments and also cross-
functional access
16Copyright © Capgemini 2016. All Rights Reserved
© David Feinleib
16

More Related Content

PPTX
Boosting Innovation and Value for Your Subsidiaries with SAP S/4HANA Cloud
PPTX
The Need for Speed
PDF
Connected Autonomous Planning: a continuous touchless model enabling an agile...
PDF
Digital manufacturing cwin18-milan
PPTX
Top Trends in Wealth Management 2020
PPTX
Digital manufacturing cwin18 mexico
PDF
Ai and data migration as a service subhash bhat cwin18-india
PPTX
Top Trends in Payments 2022
Boosting Innovation and Value for Your Subsidiaries with SAP S/4HANA Cloud
The Need for Speed
Connected Autonomous Planning: a continuous touchless model enabling an agile...
Digital manufacturing cwin18-milan
Top Trends in Wealth Management 2020
Digital manufacturing cwin18 mexico
Ai and data migration as a service subhash bhat cwin18-india
Top Trends in Payments 2022

What's hot (20)

PPT
Introducing Gartner
PDF
UNLIMITED by Capgemini: Foundation of Digital Business
PDF
Pluto7 - Tableau Webinar on enabling Organization to be Data Driven in 201...
PDF
The Perfect Storm & Your Information Strategy
PDF
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
PPTX
Top Trends in Commercial Banking: 2020
PDF
Invenio content financials
PDF
Make it a valuable experience, think design
PDF
20151014 Presentation Conferência Banca e Seguros Portugal
PDF
Software-Defined Storage Accelerates Storage Cost Reduction and Service-Level...
PPTX
Achieving GxP compliance with SAP S/4HANA in the AWS Cloud
PDF
Hampshire City Council and Capgemini at SAPPHIRENOW
PDF
Infographic-Unlocking Customer Satisfaction: Why Digital Holds the key for Te...
PDF
Construction Viz Project Tracker
PDF
CWIN17 New-York / insurance spotlight building the digital core
PDF
CWIN17 san francisco-shawn kelly-iot business value
PPTX
Enabling and accelerating multi-tenancy with Capgemini Digital Cloud Platform...
PDF
Future of service
PDF
A strategic review of the top five offshore vendors
PDF
Digitally Outsmart the Competition During the Recession
Introducing Gartner
UNLIMITED by Capgemini: Foundation of Digital Business
Pluto7 - Tableau Webinar on enabling Organization to be Data Driven in 201...
The Perfect Storm & Your Information Strategy
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
Top Trends in Commercial Banking: 2020
Invenio content financials
Make it a valuable experience, think design
20151014 Presentation Conferência Banca e Seguros Portugal
Software-Defined Storage Accelerates Storage Cost Reduction and Service-Level...
Achieving GxP compliance with SAP S/4HANA in the AWS Cloud
Hampshire City Council and Capgemini at SAPPHIRENOW
Infographic-Unlocking Customer Satisfaction: Why Digital Holds the key for Te...
Construction Viz Project Tracker
CWIN17 New-York / insurance spotlight building the digital core
CWIN17 san francisco-shawn kelly-iot business value
Enabling and accelerating multi-tenancy with Capgemini Digital Cloud Platform...
Future of service
A strategic review of the top five offshore vendors
Digitally Outsmart the Competition During the Recession
Ad

Similar to CWIN17 India / Bigdata architecture yashowardhan sowale (20)

PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PDF
Capgemini Leap Data Transformation Framework with Cloudera
PDF
Achieve data democracy in data lake with data integration
PDF
Performance management capability
PPTX
Building the enterprise data architecture
PPTX
Deliveinrg explainable AI
PDF
Fathoming Data for Competitive Advantage
PDF
Harness the power of Data in a Big Data Lake
PPTX
Data Science Salon 2018 - Building a true enterprise data governance platform...
PPTX
Эволюция Big Data и Information Management. Reference Architecture.
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
PDF
Setting Up the Data Lake
PPTX
Navigating the World of User Data Management and Data Discovery
PDF
Big Data - A Real Life Revolution
PPTX
Creating an Enterprise AI Strategy
PPTX
Modern data warehouse
PDF
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
PDF
02.BigDataAnalytics curso de Legsi (1).pdf
PDF
Workable Enteprise Data Governance
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Capgemini Leap Data Transformation Framework with Cloudera
Achieve data democracy in data lake with data integration
Performance management capability
Building the enterprise data architecture
Deliveinrg explainable AI
Fathoming Data for Competitive Advantage
Harness the power of Data in a Big Data Lake
Data Science Salon 2018 - Building a true enterprise data governance platform...
Эволюция Big Data и Information Management. Reference Architecture.
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
Setting Up the Data Lake
Navigating the World of User Data Management and Data Discovery
Big Data - A Real Life Revolution
Creating an Enterprise AI Strategy
Modern data warehouse
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
02.BigDataAnalytics curso de Legsi (1).pdf
Workable Enteprise Data Governance
Ad

More from Capgemini (20)

PPTX
Top Healthcare Trends 2022
PPTX
Top P&C Insurance Trends 2022
PPTX
Commercial Banking Trends book 2022
PPTX
Top Trends in Wealth Management 2022
PPTX
Retail Banking Trends book 2022
PPTX
Top Life Insurance Trends 2022
PPTX
キャップジェミニ、あなたの『RISE WITH SAP』のパートナーです
PPTX
Property & Casualty Insurance Top Trends 2021
PPTX
Life Insurance Top Trends 2021
PPTX
Top Trends in Commercial Banking: 2021
PPTX
Top Trends in Wealth Management: 2021
PPTX
Top Trends in Payments: 2021
PPTX
Health Insurance Top Trends 2021
PPTX
Top Trends in Retail Banking: 2021
PDF
Capgemini’s Connected Autonomous Planning
PPTX
Top Trends in Retail Banking: 2020
PPTX
Top Trends in Life Insurance: 2020
PPTX
Top Trends in Health Insurance: 2020
PPTX
Top Trends in Payments: 2020
PPTX
How to get off the white elephant of physical and leverage the true benefits ...
Top Healthcare Trends 2022
Top P&C Insurance Trends 2022
Commercial Banking Trends book 2022
Top Trends in Wealth Management 2022
Retail Banking Trends book 2022
Top Life Insurance Trends 2022
キャップジェミニ、あなたの『RISE WITH SAP』のパートナーです
Property & Casualty Insurance Top Trends 2021
Life Insurance Top Trends 2021
Top Trends in Commercial Banking: 2021
Top Trends in Wealth Management: 2021
Top Trends in Payments: 2021
Health Insurance Top Trends 2021
Top Trends in Retail Banking: 2021
Capgemini’s Connected Autonomous Planning
Top Trends in Retail Banking: 2020
Top Trends in Life Insurance: 2020
Top Trends in Health Insurance: 2020
Top Trends in Payments: 2020
How to get off the white elephant of physical and leverage the true benefits ...

Recently uploaded (20)

DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PDF
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
PPTX
Self management and self evaluation presentation
PPTX
Human Mind & its character Characteristics
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
Primary and secondary sources, and history
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPTX
Effective_Handling_Information_Presentation.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Tour Presentation Educational Activity.pptx
PPTX
Introduction to Effective Communication.pptx
PPTX
The spiral of silence is a theory in communication and political science that...
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
Self management and self evaluation presentation
Human Mind & its character Characteristics
oil_refinery_presentation_v1 sllfmfls.pdf
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
Primary and secondary sources, and history
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
Tablets And Capsule Preformulation Of Paracetamol
An Unlikely Response 08 10 2025.pptx
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
Effective_Handling_Information_Presentation.pptx
Instagram's Product Secrets Unveiled with this PPT
Tour Presentation Educational Activity.pptx
Introduction to Effective Communication.pptx
The spiral of silence is a theory in communication and political science that...
2025-08-10 Joseph 02 (shared slides).pptx
_ISO_Presentation_ISO 9001 and 45001.pptx

CWIN17 India / Bigdata architecture yashowardhan sowale

  • 1. 1Copyright © Capgemini 2016. All Rights Reserved Bigdata Architecture Overview
  • 2. 2Copyright © Capgemini 2016. All Rights Reserved Gartner Hype Cycle – Emerging Technologies
  • 3. 3Copyright © Capgemini 2016. All Rights Reserved Benefits
  • 4. 4Copyright © Capgemini 2016. All Rights Reserved Big Data and its Dimensions Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Discovering value from multichannel datasets Variety: Velocity: Volume: Veracity: Value:
  • 5. 5Copyright © Capgemini 2016. All Rights Reserved Applications for Big Data Analytics Homeland Security FinanceSmarter Healthcare Multi-channel sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn
  • 6. 6Copyright © Capgemini 2016. All Rights Reserved Manage  Data governance and security  Data privacy  Compliance  Collaboration  Value generation  Program delivery  Data-driven culture  Information strategy  Skill development  Master data mgmt  Metadata mgmt  Data quality mgmt  Operations, SLA’s  Orchestration General reference architecture for Big Data Analytics ValueActInsightAnalyzeInformationProcessSource data Customer profitability Operational cost cutting Risk prevention Market share increase Business Applications  Customer campaign  Trigger activity Business Processes  Trigger event  Adjust process Decision makers  Approve/reject business opportunities  Develop new business models and products Customer Experience Operational Process Optimization Risk, Fraud Disruptive Business Model Search What is relevant? Explorative How does it work? Descriptive What happened? Diagnostic Why did it happen? Predictive What will happen? Prescriptive How to act next? Data asset descriptions Processed data  Measures, KPI’s  Dimensions, Master data Granular data  Events  Context information Ingest Catalog Stream Store Prepare Refine, blend Manage lifecycle Internal data  IT managed applications (ERP, SCM, CRM)  Master and reference data  Business owned informal data  Documents, mail, images, voice, video  Web and mobile apps  B2B  Internet, Social, Internet of Things (machine, sensor)  Third party data: market, weather, climate, geolocation  Open data External Data Business performance Performance improvement Mask
  • 7. 7Copyright © Capgemini 2016. All Rights Reserved The BDL is also aligned with our principles  Unleash Data and Insights as-a-service Make Insight-driven Value a Crucial Business KPI Empower your People with Insights at the Point of Action Develop an Enterprise Data Science Culture Master Governance, Security and Privacy of your Data Assets Enable your Data Landscape for the Flood coming from Connected People and Things Embark on the Journey to Insights within your Business and Technology Context 1 2 3 7654 It concerns both Business and (disruptive) Technology It works with high volumes of all kinds of data It integrates Unified Data Management capabilities to manage governance, security, privacy, MDM, RDM, etc it also comes with a new, specific mindset that has to be addressed at the Enterprise level We (Capgemini) intend to offer the BDL as-a-Service Bringing Business Value by delivering Insights at the Point of Action is the motto of the BDL 1 2 3 7 654
  • 8. 8Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Conceptual Characteristics  Store-anything; analyze everything  Blend traditional data elements with new data types  Manage centrally, govern locally  Future-proof design  Highly scalable and available Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing Self-service 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Visualization and Reporting Advanced Analytics Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 Distributed Compute Layer / Services Distributed Storage Layer Data Governance Integration APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10
  • 9. 9Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Logical Talend 6.3 or latest Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10 Ranger, Knox Atlas Hortonworks HDP 2.5 or latest Spark HBASE Hive HBASE / Hive Datamarts Redshift Zeppellin RESTful Service Self-serviceData Visualization and Reporting Advanced Analytics Spark Streaming/Storm Kafka
  • 10. 10Copyright © Capgemini 2016. All Rights Reserved Detailed layer breakup
  • 11. 11Copyright © Capgemini 2016. All Rights Reserved Reference architecture for data ingestion - Indicative Functionality: Ingest Data from a variety of sources and with varying latency, into the Data Lake Data Integration Services S/FTP based push (Logs, text, other file based) Changed Data Management (Delta extracts, event mgmt) Data Sourcing Source Extraction Services (XML, Relational, Other extracts) DataTransformation Transformation Services Fast Data Manipulation • Sorting • File Merges • Joins • File Splitting • Others Transform Routines • Aggregation • Mappings • Lookups • Calculations • others Metadata Management Automation Services Deployment (Job & others) Error Handling Clustering & Capacity Common Services Data Sources (Structured, Semi-Structured, Unstructured) DataState Data at Rest (ETL pushdown, batch using standard DI tools or Sqoop) Data in Motion (Fast data, processed via tools like Flume, Storm, Spark, etc) Data Persistence Big Data Transformations • User-defined functions / custom MR code (Java, Python etc.) for complex logic ETL Pushdown Processing (Execute mapping jobs on Hadoop cluster on HDFS/Hive/Spark….) Characteristics  The Data Ingestion design principles are based on integrating raw data characterized by extreme scale and variability, and making provisions for both ‘data at rest’ (batch) and ‘data in motion’ (low latency)  The framework combines traditional data integration methodologies leveraging the Extract-Transform-Load approach and extends it to also process semi-structured and unstructured data elements.  The classical model of tracking data elements through their lifecycle and providing for lineage can be added in this framework.
  • 12. 12Copyright © Capgemini 2016. All Rights Reserved Data Acquisition and Reconciliation The Data Reconciliation is part of data quality and ensures data integrity in the data lake. Reconciliation process checks if the data has been loaded properly to ensure accuracy and completeness of the data Master Data – This is a fairly simple process as the Master Data is not subject to frequent changes. The granularity of the data remains the same in the source and the target Transactional Data – Reconciliation of the Transactional Data is instrumental to the success of the big data systems. Reconciliation can happen on the entire data set or on the incremental data based on the method by which the data is ingested Separate metadata tables / files are designed specifically for reconciliation. These tables/ files are populated with reconciliation queries and reconciliation reports are generated after data is loaded into the data lake. Data Reconciliation (Optional) The Data Acquisition can be described as combination of Landing Zone & Data validation, Delta Detection & Data Enrichment Landing Zone – It is an area wherein data from all the source systems across client’s landscape will land for the utilization/consumption by downstream systems Data validation – It is the first check point or zone wherein the MDM based checks will be applied on the incoming source data files. Delta Detection : This will be applicable to the data feeds from those source systems which have the capability to send/provide incremental delta data for the regular ongoing data processing into data lake solution. Data Enrichment : Data enrichment refers to processes used to enhance, refine or otherwise improve raw data. Data from various enrichment sources will be pushed to data lake via Landing zone for enrichment of existing data. Data Acquisition
  • 13. 13Copyright © Capgemini 2016. All Rights Reserved Data Distillation in the Data Lake: approach to provisioning for data consumption Characteristics  Uniform approach for distillation of information from the data lake  A centralized Data Quality engine for application of uniform data quality rules across the enterprise  An Integrated Data Quality function to cleanse, standardize, enrich and de-duplicate data  Console for Design, Development & Validation of rules  Data Quality Services for Integration with operational systems, MDM  A Exception Management solution for resolving data issues and errors.  Data quality process running on the data will be translated into MapReduce for faster processing. Data Persistence Layer Distillation Layer AGGREGATION EXTRACT TRANSFORM Σ SECURE DATA QUALITY STORE DATA QUALITY CONSOLE DATA QUALITY ENGINE DATA PROFILING DATA CLEANSING MATCH & MERGE DATA ENRICHMENT RULE MANAGER DQ META-DATA DATA DASHBOARD EXCEPTION MANAGEMENT DATA QUALITY CONFIGURATOR EXCEPTION REPOSITORY DQ MART Functionality: Ability to ingest data from the storage tier and convert it to structured data for easier analysis by downstream applications. This is done through a combination of Extraction, transformation and aggregation of high quality data from the Data Lake and making it available for Analytical and Reporting Applications. Transformation will also involve data quality checks and corrections like profiling, validating, cleansing structured and unstructured data based on Business rules. Data is distilled (or prepared) on a per-function basis, and made available for consumption. This is consistent with the design practice of ‘manage data centrally and provision locally’
  • 14. 14Copyright © Capgemini 2016. All Rights Reserved Data Persistence Layer : Schema on Read & Distill on Demand Namenode Hadoop Distributed File System (HDFS) Datanodes Replication Job / Task Tracker Storage Cluster/Rack Characteristics  Deliver a single, comprehensive view of all data, across functional areas – to conduct deep analysis  Multi-tiered Data Lake that serves distinct functionalities – e.g., Landing, staging and curated stores  A landing area containing both traditional data as well as non-traditional data – characterized by attributes of value, veracity, volume, velocity and variety  Eliminate the need for upfront schema design and rigid pre-configured models  Easy and cost-effective configuration for scale up and scale down  Store everything, distill on demand Landing Staging Data Lake Curated Audit Metadata Search Data Ingestion Functionality: Create a single repository for information and deliver a single, silo-less store to handle all types of data for all reporting, analysis and discovery requirements
  • 15. 15Copyright © Capgemini 2016. All Rights Reserved Approach to Data Provisioning DataAccessLayer Data provisioning Discovery Platform / Sandboxes Analytical Views Data Virtualization DataDissemination HR Mart 1 HR Mart 2 HR Mart 3 HR Mart 4 Characteristics  The Data Marts & Aggregate Structures layer will include subject specific data mart structures which can be used by various tools to retrieve data and information. This layer will also support User specific Sandbox for power users to perform various activities such as data mining, identifying data patterns, running analytical and statistical model using various tools  If required, there will be multiple versions of the subject areas for different production streams  Data marts and aggregate structures such as summary tables will be created based on business and performance requirements. As far as possible, database managed aggregates such as computed views and indexes will be created to reduce ETL based data movement  Data Virtualization will address combining datasets from multiple data stores across various layers in the data lake stack. Functionality: Provision data-sets to create various combinations of custom views – by specific functions/departments and also cross- functional access
  • 16. 16Copyright © Capgemini 2016. All Rights Reserved © David Feinleib 16