SlideShare a Scribd company logo
Informatica Intelligent Data Lake
Self Service for Data Analysts
Februar, 2017
Sören Eickhoff
Sales Consultant Central Europe
SEickhoff@informatica.com
Data Security
Cloud Data
Management
Big Data
Management
Data
Integration
Master Data
Management
Data Quality
#1 in 6 Data Categories …
Data Platform
Data Lake
Use Case: Data Lake / Data Platform Reference Architecture
Landing Zone
Structured and unstructured enterprise and external data is landed in its raw form,
normalized and ready for use
Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer
Discovery Zone
User sandbox for self-serve access to data for exploration, data blending, hypothesis
testing, analytics, and collaboration
Production Zone
Sanitized transactional, master, and reference data & enriched data models certified for
enterprise use
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs Improve
Predictive
Maintenance
Increase
Operational
Efficiency
Increase
Customer
Loyalty
Reduce
Security Risk
Improve
Fraud
Detection
• Can’t easily find trusted data
• Limited access to the data
• Frustrated by slow response from IT
due to long backlog
• Constrained by disparate desktop
tools, manual steps
• No way to collaborate, share, and
update curated datasets
• Can’t cope with growing demand
from the business
• No visibility into what the business
is doing with the data
• Struggling to deliver value to the
business
• Loosing the ability to govern and
manage data as an asset
Challenges Faced by the Business and IT Today
ITData Analysts
Informatica Data Lake Management
Data Lake Management
Enterprise
Information
Catalog
Intelligent
Data Lake
Secure@Source
TITAN
Blaze
Big Data
Management
Intelligent
Streaming
Live Data Map
(metadata integration)
Big Data Management
(data integration)
Data Architect /
Steward
Data Scientist /
Analyst
InfoSec Analyst Data Engineer
 Unified view into enterprise information assets
• Business-user oriented solution
• Semantic search with dynamic facets
• Detailed Lineage and Impact Analysis
• Business Glossary Integration
• Relationships discovery
• High level data profiling
• Automatic Classifications with Data domains
• Business classifications with Custom Attributes
• Broad metadata source connectivity
• Big data scale
Enterprise Information Catalog
 Self-service data preparation with collaborative data governance
• Collaborative project workspaces
• Automated data ingestion
• Search data asset catalog
• Rapid blend of datasets
• Crowd-sourced data asset, tagging & data
sharing
• Automated data asset discovery &
Recommendations
• Rapid ‘industrialization’ of preparation
steps into re-usable workflows
• Complete tracking of usage, lineage, and
security
• Easily support Data Discovery Platforms
Intelligent Data Lake
 Enterprise-wide visibility into sensitive data risks
• Sensitive data classification & discovery
• Sensitive data proliferation analysis
• Who has access to sensitive data
• User activity on sensitive data
• Sensitive Data policy-based alerting
• Multi-factor risk scoring
• Identification of highest risk areas
• Integrates data security information from 3rd parties:
 Data stores, owner, classification
 Protection status
 User access info (LDAP, IAM) and activity logs
(DB, Hadoop, Salesforce, DAM)
Secure@Source
 Easily integrate more data faster from more data sources
Big Data Management
Smart Executor
Informatica Big Data Management
ETL/DI
Servers
Informatica Data
Transformation
Engine on
dedicated DI
servers
Data
Connectivity
Data
Integration
Data
Masking
Data
Quality
Data
Governance
YARN
HDFS
Map
Reduce
Hive on
Map
Reduce
Tez
Spark
Core
Cluster
Aware
Hive
On
Tez
Spark Blaze
Hadoop Cluster
• Visual development interface accelerates
developer productivity
• Near universal data connectivity
• Complex data parsing on Hadoop
• Data profiling on Hadoop
• High-speed data ingestion and extraction
• Process and deliver data at scale on
Hadoop
• Dynamic schemas and mapping
templates
• Data Quality and Data Governance on
Hadoop
Take Big Data Management to the Next Level
Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic
Automatically profit from new technologies and choose best option - Smart Optimizer
MapReduce
Spark
Blaze
Generic source Generic targetRule based logic
Informatica Intelligent Streaming
• Streaming analytics capability
into the Intelligent Data Platform
• Unified UI with multiple engines
underneath the covers
• Frictionless integration
conversion/extension of batch
mappings into streaming context
• Abstracted from runtime
framework
 Collect, ingest and process data in realtime and streaming
Realtime
source
Realtime
target
Window
transformation
Spark
Streaming
code generated
Intelligent Datalake – Deep Dive
12
Data
Analyst / Scientist
Who?
Prepare & Publish
Search & Discover
Share and Collaborate
Intelligent Data Lake
How?
Applications &
Databases
Internet of Things
3rd Party Data
Data Modeling
Tools BI Tools CustomCloud
Data Access & Metadata Connectivity
Intelligent Metadata FoundationCatalog ClassifyIndex
Data
Lineage
Data
Relationships
Smart
Domains
Data
Profile
Data Discovery & Analysis Process
Recommend
Discover Collaborate
Publish
Operationalize/
Monitor
Prepare
Data
Analyst / Scientist
Intelligent Data Lake
Data Asset
- Data you work with as a unit
Project
- A project contains
data assets and worksheets.
Recipe
- The steps taken to prepare
data in a worksheet.
Data Publication
- the process of making prepared
data available in the data lake
Data Preparation
- The process of combining, cleansing,
transforming, and structuring data from one
or more data assets so that it is ready
for analysis.
Terminology
Intelligent Data Lake
Search and Discovery
Data discovery through a powerful search engine to find relevant data
Semantic
search
Fact filtering by
asset, resource Type,
latest , size, custom
attributes…
Data Asset Overview
Overview with asset attributes and integrated profiling stats
Asset attributes
collected from the
source system
Asset attributes
enriched by users to
add business context
Column profiling stats
including
Null/Unique/Duplicate
percentages, Inferred
data types and data
domains.
Details stats include
value and pattern
distributions
Add data asset
To Project from
any exploration
views
Business Glossary Integration
View Business
Glossary Assets
like Terms,
Policies and
Categories in the
Catalog
View and
navigate
to related
technical
and
business
assets in
the
catalog
Data Lineage
Interactively trace data origin through summarized lineage views for analysts
Use Lineage and Impact Sliders to drill down to
desired lineage levels on either side of the seed
object.
Relationship View
Shows ecosystem of the asset in the enterprise based on association to other assets
Get a 360 Degree View
of data asset using the
relationship view.
Includes related tables,
views, domains and
reports, users etc.
Ability to Zoom, find specific assets
in the view and filter by asset types
Expand relationship
circles to get more
details on relationship
types and objects.
Data Preparation continued…
Excel-based data preparation on Sample data
New formula
definition with
type-ahead
Large number of
functions
available for all
types of data
string, numeric,
date, statistical,
Math etc.
Advanced
functionality
such as Join,
Merge,
Aggregate,
Filter, Sort etc.
New values are
calculated and
shown right
away
Data Preparation continued…
Excel-based data preparation on Sample data
Column
level
summary
Column value
distributions
Column level
Suggestions
Data
preparation
steps
captured as
“Recipe”
Data Publication
Execution of data preparation steps on actual data using Infa mapping
Publish the output of
data preparation steps
back to the lake
Recipe steps are
translated into
Informatica mapping
Informatica mapping is
handed over to BDM
platform for execution on
actual data sources
BDM platform uses either
Map/Reduce or Blaze or
Spark to execute the
mapping
Mapping is available to
the ETL specialists to
open in Informatica
Developer tool to
operationalize
Users credentials are
used to access the
underlying database.
Organizations need ONE solution that helps them…
Easily Find &
Catalog Data &
Discover
Relationships
Rapidly Prepare &
Share Data Exactly
When it is Needed
Get instant Access to
Trusted & Secure
Data for Advanced
Analytics
Ingest, Cleanse, Integrate & protect data at scale
Forrester Wave™: Big Data Fabric, Q4 ’16
Questions ?

More Related Content

PDF
Modernizing to a Cloud Data Architecture
PDF
Modern Data Architecture
PDF
Democratizing Data Science on Kubernetes
PPTX
How to build a successful Data Lake
PPTX
Building big data solutions on azure
PDF
Analytics in a Day Virtual Workshop
 
PPTX
Microsoft Azure Big Data Analytics
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Modernizing to a Cloud Data Architecture
Modern Data Architecture
Democratizing Data Science on Kubernetes
How to build a successful Data Lake
Building big data solutions on azure
Analytics in a Day Virtual Workshop
 
Microsoft Azure Big Data Analytics
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...

What's hot (20)

PDF
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
PPTX
Extending Data Lake using the Lambda Architecture June 2015
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
Big Data Use Cases
PPTX
The modern analytics architecture
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PPTX
Opportunity: Data, Analytic & Azure
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PPTX
Enterprise 360 - Graphs at the Center of a Data Fabric
PDF
SplunkSummit 2015 - Real World Big Data Architecture
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
PDF
Data Quality in the Data Hub with RedPointGlobal
PDF
Modern Data Management for Federal Modernization
PPTX
Big Data with Azure
PPTX
Big data architectures and the data lake
PDF
Big Data Architecture
PPTX
Microsoft Power BI: AI Powered Analytics
PPTX
How OpenTable uses Big Data to impact growth by Raman Marya
PPTX
Big Data with Not Only SQL
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Extending Data Lake using the Lambda Architecture June 2015
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Use Cases
The modern analytics architecture
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Opportunity: Data, Analytic & Azure
Analytics in a Day Ft. Synapse Virtual Workshop
 
Enterprise 360 - Graphs at the Center of a Data Fabric
SplunkSummit 2015 - Real World Big Data Architecture
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Data Quality in the Data Hub with RedPointGlobal
Modern Data Management for Federal Modernization
Big Data with Azure
Big data architectures and the data lake
Big Data Architecture
Microsoft Power BI: AI Powered Analytics
How OpenTable uses Big Data to impact growth by Raman Marya
Big Data with Not Only SQL
Ad

Viewers also liked (14)

PPTX
Oliver Scheer, Technical Evangelist at Microsoft, "SQL Azure, Power BI (embed...
PPTX
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PDF
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
PDF
"Startupbootcamp and Data startups", Angel Garcia, Co-Founder and Tech Mentor...
PDF
Artificial Intelligence and The Future of Trust - Stéphane Bura
PDF
Cp ugm ucinf
PDF
Dr. Mihael Ankerst, Manager Customer Data Analytics at Allianz Deutschland - ...
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
PPTX
PDF
Overview of Blue Medora - New Relic Plugin for HP Blade Servers
PPTX
IBM InterConnect 2016 Greg Hodgkinson 2238 Thriving DevOps at BMI (Prolifics)
PDF
The TCS Brand
Oliver Scheer, Technical Evangelist at Microsoft, "SQL Azure, Power BI (embed...
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
"Startupbootcamp and Data startups", Angel Garcia, Co-Founder and Tech Mentor...
Artificial Intelligence and The Future of Trust - Stéphane Bura
Cp ugm ucinf
Dr. Mihael Ankerst, Manager Customer Data Analytics at Allianz Deutschland - ...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Overview of Blue Medora - New Relic Plugin for HP Blade Servers
IBM InterConnect 2016 Greg Hodgkinson 2238 Thriving DevOps at BMI (Prolifics)
The TCS Brand
Ad

Similar to Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts" (20)

PDF
intelligent-data-lake_executive-brief
PDF
Why an AI-Powered Data Catalog Tool is Critical to Business Success
PPTX
IBM Industry Models and Data Lake
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
PPT
Hadoop India Summit, Feb 2011 - Informatica
PDF
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
PDF
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
PDF
Big Data – wie aus Daten strategische Resourcen und Ihr Wettbewerbsvorteil we...
PPTX
Rick Mutsaers Informatica
PDF
Overview - IBM Big Data Platform
PDF
Data lake benefits
PDF
Total Data Industry Report
PPTX
Oil and gas big data edition
PPTX
Security, ETL, BI & Analytics, and Software Integration
PDF
Big Data LDN 2018: THE THIRD REVOLUTION IN ANALYTICS
PDF
Ibm big data-platform
PDF
Bringing Strategy to Life: Using an Intelligent Data Platform to Become Data ...
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
intelligent-data-lake_executive-brief
Why an AI-Powered Data Catalog Tool is Critical to Business Success
IBM Industry Models and Data Lake
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop India Summit, Feb 2011 - Informatica
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
The Maturity Model: Taking the Growing Pains Out of Hadoop
Big Data – wie aus Daten strategische Resourcen und Ihr Wettbewerbsvorteil we...
Rick Mutsaers Informatica
Overview - IBM Big Data Platform
Data lake benefits
Total Data Industry Report
Oil and gas big data edition
Security, ETL, BI & Analytics, and Software Integration
Big Data LDN 2018: THE THIRD REVOLUTION IN ANALYTICS
Ibm big data-platform
Bringing Strategy to Life: Using an Intelligent Data Platform to Become Data ...
How Hewlett Packard Enterprise Gets Real with IoT Analytics

More from Dataconomy Media (20)

PDF
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
PDF
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
PDF
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PPTX
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
PPTX
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
PPTX
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
PDF
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
PPTX
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
PDF
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
PPTX
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
PDF
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
PDF
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
PDF
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
PDF
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
PPTX
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
PDF
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
PPTX
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
PPTX
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
PPTX
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Lecture1 pattern recognition............
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Foundation of Data Science unit number two notes
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Lecture1 pattern recognition............
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
climate analysis of Dhaka ,Banglades.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"

  • 1. Informatica Intelligent Data Lake Self Service for Data Analysts Februar, 2017 Sören Eickhoff Sales Consultant Central Europe SEickhoff@informatica.com
  • 2. Data Security Cloud Data Management Big Data Management Data Integration Master Data Management Data Quality #1 in 6 Data Categories …
  • 3. Data Platform Data Lake Use Case: Data Lake / Data Platform Reference Architecture Landing Zone Structured and unstructured enterprise and external data is landed in its raw form, normalized and ready for use Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer Discovery Zone User sandbox for self-serve access to data for exploration, data blending, hypothesis testing, analytics, and collaboration Production Zone Sanitized transactional, master, and reference data & enriched data models certified for enterprise use Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Improve Predictive Maintenance Increase Operational Efficiency Increase Customer Loyalty Reduce Security Risk Improve Fraud Detection
  • 4. • Can’t easily find trusted data • Limited access to the data • Frustrated by slow response from IT due to long backlog • Constrained by disparate desktop tools, manual steps • No way to collaborate, share, and update curated datasets • Can’t cope with growing demand from the business • No visibility into what the business is doing with the data • Struggling to deliver value to the business • Loosing the ability to govern and manage data as an asset Challenges Faced by the Business and IT Today ITData Analysts
  • 5. Informatica Data Lake Management Data Lake Management Enterprise Information Catalog Intelligent Data Lake Secure@Source TITAN Blaze Big Data Management Intelligent Streaming Live Data Map (metadata integration) Big Data Management (data integration) Data Architect / Steward Data Scientist / Analyst InfoSec Analyst Data Engineer
  • 6.  Unified view into enterprise information assets • Business-user oriented solution • Semantic search with dynamic facets • Detailed Lineage and Impact Analysis • Business Glossary Integration • Relationships discovery • High level data profiling • Automatic Classifications with Data domains • Business classifications with Custom Attributes • Broad metadata source connectivity • Big data scale Enterprise Information Catalog
  • 7.  Self-service data preparation with collaborative data governance • Collaborative project workspaces • Automated data ingestion • Search data asset catalog • Rapid blend of datasets • Crowd-sourced data asset, tagging & data sharing • Automated data asset discovery & Recommendations • Rapid ‘industrialization’ of preparation steps into re-usable workflows • Complete tracking of usage, lineage, and security • Easily support Data Discovery Platforms Intelligent Data Lake
  • 8.  Enterprise-wide visibility into sensitive data risks • Sensitive data classification & discovery • Sensitive data proliferation analysis • Who has access to sensitive data • User activity on sensitive data • Sensitive Data policy-based alerting • Multi-factor risk scoring • Identification of highest risk areas • Integrates data security information from 3rd parties:  Data stores, owner, classification  Protection status  User access info (LDAP, IAM) and activity logs (DB, Hadoop, Salesforce, DAM) Secure@Source
  • 9.  Easily integrate more data faster from more data sources Big Data Management Smart Executor Informatica Big Data Management ETL/DI Servers Informatica Data Transformation Engine on dedicated DI servers Data Connectivity Data Integration Data Masking Data Quality Data Governance YARN HDFS Map Reduce Hive on Map Reduce Tez Spark Core Cluster Aware Hive On Tez Spark Blaze Hadoop Cluster • Visual development interface accelerates developer productivity • Near universal data connectivity • Complex data parsing on Hadoop • Data profiling on Hadoop • High-speed data ingestion and extraction • Process and deliver data at scale on Hadoop • Dynamic schemas and mapping templates • Data Quality and Data Governance on Hadoop
  • 10. Take Big Data Management to the Next Level Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic Automatically profit from new technologies and choose best option - Smart Optimizer MapReduce Spark Blaze Generic source Generic targetRule based logic
  • 11. Informatica Intelligent Streaming • Streaming analytics capability into the Intelligent Data Platform • Unified UI with multiple engines underneath the covers • Frictionless integration conversion/extension of batch mappings into streaming context • Abstracted from runtime framework  Collect, ingest and process data in realtime and streaming Realtime source Realtime target Window transformation Spark Streaming code generated
  • 12. Intelligent Datalake – Deep Dive 12
  • 13. Data Analyst / Scientist Who? Prepare & Publish Search & Discover Share and Collaborate Intelligent Data Lake
  • 14. How? Applications & Databases Internet of Things 3rd Party Data Data Modeling Tools BI Tools CustomCloud Data Access & Metadata Connectivity Intelligent Metadata FoundationCatalog ClassifyIndex Data Lineage Data Relationships Smart Domains Data Profile Data Discovery & Analysis Process Recommend Discover Collaborate Publish Operationalize/ Monitor Prepare Data Analyst / Scientist Intelligent Data Lake
  • 15. Data Asset - Data you work with as a unit Project - A project contains data assets and worksheets. Recipe - The steps taken to prepare data in a worksheet. Data Publication - the process of making prepared data available in the data lake Data Preparation - The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis. Terminology Intelligent Data Lake
  • 16. Search and Discovery Data discovery through a powerful search engine to find relevant data Semantic search Fact filtering by asset, resource Type, latest , size, custom attributes…
  • 17. Data Asset Overview Overview with asset attributes and integrated profiling stats Asset attributes collected from the source system Asset attributes enriched by users to add business context Column profiling stats including Null/Unique/Duplicate percentages, Inferred data types and data domains. Details stats include value and pattern distributions Add data asset To Project from any exploration views
  • 18. Business Glossary Integration View Business Glossary Assets like Terms, Policies and Categories in the Catalog View and navigate to related technical and business assets in the catalog
  • 19. Data Lineage Interactively trace data origin through summarized lineage views for analysts Use Lineage and Impact Sliders to drill down to desired lineage levels on either side of the seed object.
  • 20. Relationship View Shows ecosystem of the asset in the enterprise based on association to other assets Get a 360 Degree View of data asset using the relationship view. Includes related tables, views, domains and reports, users etc. Ability to Zoom, find specific assets in the view and filter by asset types Expand relationship circles to get more details on relationship types and objects.
  • 21. Data Preparation continued… Excel-based data preparation on Sample data New formula definition with type-ahead Large number of functions available for all types of data string, numeric, date, statistical, Math etc. Advanced functionality such as Join, Merge, Aggregate, Filter, Sort etc. New values are calculated and shown right away
  • 22. Data Preparation continued… Excel-based data preparation on Sample data Column level summary Column value distributions Column level Suggestions Data preparation steps captured as “Recipe”
  • 23. Data Publication Execution of data preparation steps on actual data using Infa mapping Publish the output of data preparation steps back to the lake Recipe steps are translated into Informatica mapping Informatica mapping is handed over to BDM platform for execution on actual data sources BDM platform uses either Map/Reduce or Blaze or Spark to execute the mapping Mapping is available to the ETL specialists to open in Informatica Developer tool to operationalize Users credentials are used to access the underlying database.
  • 24. Organizations need ONE solution that helps them… Easily Find & Catalog Data & Discover Relationships Rapidly Prepare & Share Data Exactly When it is Needed Get instant Access to Trusted & Secure Data for Advanced Analytics Ingest, Cleanse, Integrate & protect data at scale
  • 25. Forrester Wave™: Big Data Fabric, Q4 ’16

Editor's Notes

  • #3: If your customer thinks of Informatica as an ETL company, this is a chance to change their perception. We are the #1 leader in 6 important data categories: First, cloud data management – we have a full portfolio of data management services for all the major cloud ecosystems – either cloud only or hybrid Data integration – our bread and butter – we have been the best at it for a long time and we continue to set the bar Big Data Management – we are the leader in data management for Big Data platforms. We work closely with all the major Hadoop, NoSQL ecosystems and with all the latest Big Data technologies like Spark Master Data Management – we are the leader in MDM for customer data and any other data that is important to their business. Our secret sauce is our matching engine, ability to discover relationships, and scalability. We can do this on any data platform, either on-premise or in the Cloud. Data Quality – we are setting the bar in DQ. Whether it is for stand-alone initiatives like data governance or for embedding data quality into their business processes Data Security – we are pioneering a new approach in security. Security remains an unsolved problem, and we can address it at the data level
  • #4: Most organization are building out some version of a data lake or enterprise data hub concept. Really they are looking to get all their data into one place for next generation of analytics, ability for all people to have access to information They are usually divided up into multiple types of zones.
  • #6: To serve these market trends best Informatica developed a Big Data solution that addresses each of the trends. The EIC module helps people understand the data they are looking at providing context The IDL module allows business to be more self service by providing self service data preparation capabilities, yet also helps IT operationalize the data preparation steps at scale in a managed and governed way. Secure@Source gives insight in potential risks around privacy sensitive data, by providing insight in where this data is located, how it is proliferated across the Data Lake (and surrounding applications) and what the associated risks are. Big Data Management helps customers ingest, parse, cleanse, integrate and deliver big data at scale. Intelligent Streaming finally allows processes to use realtime and streaming data sources. All this fucntionality is built as part of the Intelligent Data Platform where we try to use as much open source tools as possible leveraging the power of the ecosystem. We use Hbase to store different types of metadata, we use Titan as a graph database to store the relationship information between data assets. We use Spark (incl Spark Streaming) and Blaze to process data at Scale, we use Kafka as a high speed data transfer mechanism and finally we use Solr to index metadata so it can be searched using a google like search interface.
  • #7: The Enterpsie Information Catalog (EIC) application allows business users to quicly find all information around the collection of data assets in their data lake. Since EIC can leverage the metadata provided by Cloudera Navigator we can even show Hive/Impala scripts and Pig scripts that are being used to process data.
  • #8: Intelligent data lake provides capabilities to enable business users to do data preparation
  • #9: Secure@Source gives insight in sensitive data risks.
  • #10: Secure@Source gives insight in sensitive data risks.
  • #11: Dynamic Mappings Build a template once – automate mapping execution for 1000’s of sources with different schemas automatically Mapping self adjusts dynamically to external schema changes and column characteristics Ability to process flat files with changing order (a,b,c or c,a,b) and number of columns dynamically Re-use PowerCenter and SQL logic Many customers have existing investments done in traditional powercenter and/or SQL scripts. To allow re-use of these components Informatica provides capabilities to migrate existing PowerCenter logic to run in Hadoop and to convert existing SQL code to Big Data mapping logic that can be executed at scale. Smart Optimizer In-built mapping optimizer that automatically tunes and re-arranges the mapping for high performance Early selection, Early projection, Mapping pruning, Semi-join, Join re-ordering, etc Automatic partitioning support based on statistics and other heuristics Advanced full pushdown optimization support including data ship join
  • #12: Intelligent streaming aims to bring the following capabilities into the Informatica Platform: Real-time data ingestion from streaming data sources Rule evaluation and event triggering on a real-time data stream Real-time Data Integration: complex transforms, lookups, joins etc. in real time
  • #15: Data Stewards are responsible for strategically managing data assets in the data lake and the enterprise ensuring high levels of data quality, integrity, availability, trustworthiness, and data security while emphasizing the business value of data. By building a catalog, classifying metadata and data definitions, maintaining technical and business rules and monitoring data quality, data stewards ensure data in the lake is consistent for use in the discovery zone and enterprise zone. As the inventory of technical and business metadata is established and data sets available, data architects must design robust scalable data lake architecture to meets the business goals of the marketing data lake.
  • #16: Before we dive into the demo, lets look at some terminology, I will be using these terms quite a bit in the demo: Data Lake A data lake is a centralized repository of large volumes of structured and unstructured data. A data lake can contain different types of data, including raw data, refined data, master data, transactional data, log file data, and machine data. In Intelligent Data Lake, the data lake is a Hadoop cluster. Data Asset A data asset is data that you work with as a unit. Data assets can include items such as a flat file, table, or view. A data asset can include data stored in or outside the data lake. Project A project is a container stores data assets and worksheets. Data Preparation The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis. Recipe A recipe includes the list of input sources and the steps taken to prepare data in a worksheet. Data Publication data publication is the process of making prepared data available in the data lake. When you publish prepared data, Intelligent Data Lake writes the transformed input source to a Hive table in the data lake. Other analysts can add the published data to their projects and create new data assets. Or analysts can use a third-party business intelligence or advanced analytic tool to run reports to further analyze the published data.