Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"

Informatica Intelligent Data Lake
Self Service for Data Analysts
Februar, 2017
Sören Eickhoff
Sales Consultant Central Europe
SEickhoff@informatica.com

Data Security
Cloud Data
Management
Big Data
Management
Data
Integration
Master Data
Management
Data Quality
#1 in 6 Data Categories …

Data Platform
Data Lake
Use Case: Data Lake / Data Platform Reference Architecture
Landing Zone
Structured and unstructured enterprise and external data is landed in its raw form,
normalized and ready for use
Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer
Discovery Zone
User sandbox for self-serve access to data for exploration, data blending, hypothesis
testing, analytics, and collaboration
Production Zone
Sanitized transactional, master, and reference data & enriched data models certified for
enterprise use
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs Improve
Predictive
Maintenance
Increase
Operational
Efficiency
Increase
Customer
Loyalty
Reduce
Security Risk
Improve
Fraud
Detection

• Can’t easily find trusted data
• Limited access to the data
• Frustrated by slow response from IT
due to long backlog
• Constrained by disparate desktop
tools, manual steps
• No way to collaborate, share, and
update curated datasets
• Can’t cope with growing demand
from the business
• No visibility into what the business
is doing with the data
• Struggling to deliver value to the
business
• Loosing the ability to govern and
manage data as an asset
Challenges Faced by the Business and IT Today
ITData Analysts

Informatica Data Lake Management
Data Lake Management
Enterprise
Information
Catalog
Intelligent
Data Lake
Secure@Source
TITAN
Blaze
Big Data
Management
Intelligent
Streaming
Live Data Map
(metadata integration)
Big Data Management
(data integration)
Data Architect /
Steward
Data Scientist /
Analyst
InfoSec Analyst Data Engineer

 Unified view into enterprise information assets
• Business-user oriented solution
• Semantic search with dynamic facets
• Detailed Lineage and Impact Analysis
• Business Glossary Integration
• Relationships discovery
• High level data profiling
• Automatic Classifications with Data domains
• Business classifications with Custom Attributes
• Broad metadata source connectivity
• Big data scale
Enterprise Information Catalog

 Self-service data preparation with collaborative data governance
• Collaborative project workspaces
• Automated data ingestion
• Search data asset catalog
• Rapid blend of datasets
• Crowd-sourced data asset, tagging & data
sharing
• Automated data asset discovery &
Recommendations
• Rapid ‘industrialization’ of preparation
steps into re-usable workflows
• Complete tracking of usage, lineage, and
security
• Easily support Data Discovery Platforms
Intelligent Data Lake

 Enterprise-wide visibility into sensitive data risks
• Sensitive data classification & discovery
• Sensitive data proliferation analysis
• Who has access to sensitive data
• User activity on sensitive data
• Sensitive Data policy-based alerting
• Multi-factor risk scoring
• Identification of highest risk areas
• Integrates data security information from 3rd parties:
 Data stores, owner, classification
 Protection status
 User access info (LDAP, IAM) and activity logs
(DB, Hadoop, Salesforce, DAM)
Secure@Source

 Easily integrate more data faster from more data sources
Big Data Management
Smart Executor
Informatica Big Data Management
ETL/DI
Servers
Informatica Data
Transformation
Engine on
dedicated DI
servers
Data
Connectivity
Data
Integration
Data
Masking
Data
Quality
Data
Governance
YARN
HDFS
Map
Reduce
Hive on
Map
Reduce
Tez
Spark
Core
Cluster
Aware
Hive
On
Tez
Spark Blaze
Hadoop Cluster
• Visual development interface accelerates
developer productivity
• Near universal data connectivity
• Complex data parsing on Hadoop
• Data profiling on Hadoop
• High-speed data ingestion and extraction
• Process and deliver data at scale on
Hadoop
• Dynamic schemas and mapping
templates
• Data Quality and Data Governance on
Hadoop

Take Big Data Management to the Next Level
Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic
Automatically profit from new technologies and choose best option - Smart Optimizer
MapReduce
Spark
Blaze
Generic source Generic targetRule based logic

Informatica Intelligent Streaming
• Streaming analytics capability
into the Intelligent Data Platform
• Unified UI with multiple engines
underneath the covers
• Frictionless integration
conversion/extension of batch
mappings into streaming context
• Abstracted from runtime
framework
 Collect, ingest and process data in realtime and streaming
Realtime
source
Realtime
target
Window
transformation
Spark
Streaming
code generated

Intelligent Datalake – Deep Dive
12

Data
Analyst / Scientist
Who?
Prepare & Publish
Search & Discover
Share and Collaborate

How?
Applications &
Databases
Internet of Things
3rd Party Data
Data Modeling
Tools BI Tools CustomCloud
Data Access & Metadata Connectivity
Intelligent Metadata FoundationCatalog ClassifyIndex
Data
Lineage
Data
Relationships
Smart
Domains
Data
Profile
Data Discovery & Analysis Process
Recommend
Discover Collaborate
Publish
Operationalize/
Monitor
Prepare
Data
Analyst / Scientist

Data Asset
- Data you work with as a unit
Project
- A project contains
data assets and worksheets.
Recipe
- The steps taken to prepare
data in a worksheet.
Data Publication
- the process of making prepared
data available in the data lake
Data Preparation
- The process of combining, cleansing,
transforming, and structuring data from one
or more data assets so that it is ready
for analysis.
Terminology

Search and Discovery
Data discovery through a powerful search engine to find relevant data
Semantic
search
Fact filtering by
asset, resource Type,
latest , size, custom
attributes…

Data Asset Overview
Overview with asset attributes and integrated profiling stats
Asset attributes
collected from the
source system
Asset attributes
enriched by users to
add business context
Column profiling stats
including
Null/Unique/Duplicate
percentages, Inferred
data types and data
domains.
Details stats include
value and pattern
distributions
Add data asset
To Project from
any exploration
views

Business Glossary Integration
View Business
Glossary Assets
like Terms,
Policies and
Categories in the
Catalog
View and
navigate
to related
technical
and
business
assets in
the
catalog

Data Lineage
Interactively trace data origin through summarized lineage views for analysts
Use Lineage and Impact Sliders to drill down to
desired lineage levels on either side of the seed
object.

Relationship View
Shows ecosystem of the asset in the enterprise based on association to other assets
Get a 360 Degree View
of data asset using the
relationship view.
Includes related tables,
views, domains and
reports, users etc.
Ability to Zoom, find specific assets
in the view and filter by asset types
Expand relationship
circles to get more
details on relationship
types and objects.

Data Preparation continued…
Excel-based data preparation on Sample data
New formula
definition with
type-ahead
Large number of
functions
available for all
types of data
string, numeric,
date, statistical,
Math etc.
Advanced
functionality
such as Join,
Merge,
Aggregate,
Filter, Sort etc.
New values are
calculated and
shown right
away

Data Preparation continued…
Excel-based data preparation on Sample data
Column
level
summary
Column value
distributions
Column level
Suggestions
Data
preparation
steps
captured as
“Recipe”

Data Publication
Execution of data preparation steps on actual data using Infa mapping
Publish the output of
data preparation steps
back to the lake
Recipe steps are
translated into
Informatica mapping
Informatica mapping is
handed over to BDM
platform for execution on
actual data sources
BDM platform uses either
Map/Reduce or Blaze or
Spark to execute the
mapping
Mapping is available to
the ETL specialists to
open in Informatica
Developer tool to
operationalize
Users credentials are
used to access the
underlying database.

Organizations need ONE solution that helps them…
Easily Find &
Catalog Data &
Discover
Relationships
Rapidly Prepare &
Share Data Exactly
When it is Needed
Get instant Access to
Trusted & Secure
Data for Advanced
Analytics
Ingest, Cleanse, Integrate & protect data at scale

Forrester Wave™: Big Data Fabric, Q4 ’16

Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts" (20)

More from Dataconomy Media (20)

Recently uploaded (20)

Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"

Editor's Notes