Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra

Populating your Enterprise
Data Hub for Next Gen
Analytics
Sushree Mishra
Senior Sales Engineer
Presented By:
August 2018

Agenda
• Company Overview
• Biggest Implementation Challenges
• Data Integration in Big Data
• Data Quality Functional Examples
• Demonstration

3
Trusted Industry Leadership
500+
Experienced & Talented
Data Professionals
>7,000
Customers
1968
50 Years of Market Leadership
& Award-Winning Customer Support
84
of Fortune 100 are Customers
3x
Revenue Growth
In Last 12 Months
The global leader in Big Iron to Big Data

4
Differentiated Product Portfolio & Technical
Expertise
Data
Infrastructure Optimization
Data
Availability
Data
Integration
Data
Quality
Market-leading
data quality capability
Best-in-class resource utilization
and performance, on premise
or in the cloud
#1 in high availability for
IBM i and AIX Power Systems
Industry-leading mainframe
data access and highest
performing ETL
• Trillium Software System
• Trillium Quality for Big Data
• Trillium Precise
• Trillium Cloud
• Trillium Global Locator
• Trillium Quality for Dynamics
CRM
• DL/2
• Zen Suite
• MFX® for z/OS
• ZPSaver Suite
• EZ-DB2
• EZ-IDMS
• DMX & DMX-h
• DMX AppMod
• athene®
• athene
SaaS®
• MIMIX Availability & DR
• MIMIX Move
• MIMIX Share
• iTera Availability
• Enforcive IBM i Security
• Ironstream®
• Ironstream® Transaction
Tracing
• DMX & DMX-h
• DMX Change Data Capture
Big Iron to Big Data
A fast-growing market segment composed of solutions that optimize traditional data systems and
deliver mission-critical data from these systems to next-generation analytic environments.

Biggest Implementation Challenges
1. Data Quality: Assessing and improving quality of data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills, Hadoop programmers are difficult to find and/or expensive.
3. Data Governance: Including data lake in governance initiatives and meeting regulatory compliance.
4. Rapid Change: Frameworks and tools evolve fast, and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out of Hadoop/Spark.
0
10
20
30
40
50
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance Rapid Change CDC
Mainframe Data Movement Cost Connectivity Uncertainty

7
Offload Data and ELT Workloads out of Legacy DW
Data Sources Data Warehouse Business Intelligence
ETL
ETL
ELT
Analytic
Query &
Reporting
After
Data Sources Data Warehouse
Analytic Query & Reporting
ETL
DMX-h ETL
Business Intelligence
Before

8
Simplify: Design Once, Deploy Anywhere
• Use existing ETL skills
• No need to worry about mappers, reducers, big side or small side of joins, etc
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark 2
• Run multiple execution frameworks in a single job
Single GUI Execute
8Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.

9
High Performance ETL Architecture (DMX-h)
The DMX-h engine is installed on the workstation, edge node and all cluster nodes. The DMX-h engine gets invoked
as an executable only when a job is submitted.
The Job editor and Task editor used to design DMX-h jobs are installed only on the windows workstation. These
editors can connect to local or remote DMX-h agents.
DMX-h agent is a daemon running actively only on edge node. It is needed to serve requests from DMX-h GUI
editors.
1
2
3
1
1
1
2 2
a
3

10
Job Execution Choices
Edge Node Single Node in Cluster Cluster

11
A quick refresher on DMX DataFunnel
Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
• Funnels hundreds of tables at once into your data lake or RDBMS
‒ Extract, map and move whole DB schemas in one invocation
‒ Extract from DB2, Oracle, Teradata, Netezza, S3, Redshift …
‒ To SQL Server, Postgres, Hive, Redshift and HDFS
‒ Automatically create target tables
• Process multiple funnels in parallel on edge node or data nodes
‒ Leverages the DMX-h high performance data processing engine
• Filter unwanted data before extraction
‒ Data type filtering
‒ Table, record or column exclusion / inclusion
• In-flight transformations and cleansing
‒ Append strings to target table names
‒ Transform columns based on their data types

12
DMX-h Increases Business Agility at IHG with Up-To-Date Data
• Create an analytics platform that
standardizes data ingestion from over
5,000 properties globally
• Enable real-time updates as inventory
changes
• Provide more timely access to data for
room availability, inventory and other
hotel data from all global properties
• Regularly update Property Policy
information. Reports with stale data
can lead to incorrect analysis
• Current processes are being refreshed
infrequently or less than once a day
• Property information and house policy
data sent via Kafka Topics
• Hortonworks Hadoop cluster on
Google Cloud Platform to access and
integrate property and policy data
• Syncsort DMX-h is the only solution
that integrates Kafka, Google Cloud
Platform, Spark and the existing EDW
• DMX-h ingests 30 different types of
JSON Kafka messages every 30
minutes and writes to HDFS
• DMX-h transforms the dataset and
loads to the EDW as well as ORC files
in a Google Bucket.
• Simplicity – The entire process was
visually depicted in DMX-h jobs which
made process understanding really
easy.
• Time-to-Value –Syncsort DMX-h
drastically reduced development and
maintenance times
• Future Proofing – DMX-h will allow
IHG to move seamlessly to Spark when
ready
• Insight – Up-to-date results in better
business decisions
• Agility – Ability to respond quickly
based on current and comprehensive
information across the portfolio
• Reduce Risk – The Modern Data
Architecture allows IHG to easily
develop and maintain the data
pipeline with minimal effort
Business Challenge Solution Benefit Business Value
IHG is a global organization with a broad portfolio of hotel brands. IHG franchises, leases, manages
or owns more than 5,000 hotels and 742,000 guest rooms in almost 100 countries,
with nearly 1,400 hotels in its development pipeline. IHG uses cutting edge technologies to take
advantage of the value inherent in their data – including inventory, booking and membership details.
Prior to DMX-h, data could only by refreshed once a day – With DMX-h, the Data Warehouse is refreshed every 30 minutes!

14
Trillium Software Product Portfolio
Realtime Applications
Trillium Software System
On Premise or via Trillium Cloud
Deploy any or all products to the cloud
Completely managed SaaS in AWS or Azure deployed in 30 days or less
TS Discovery 15.7
Automated data profiling and discovery tool that
identifies data quality issues, facilitates business
rule management, and provides data quality
metrics
TS Quality 15.7, Series 7
Data quality engine that provides data cleansing,
matching, and enrichment for multi-domain, global
data (including global address validation)
Global Locator 15.7
Geolocation tool that standardizes and validates
address data and assigns corresponding latitude
and longitude coordinates
Trillium Precise
Data enrichment, validation, and verification
services including global postal addresses, email,
phone, and internet connectivity
Trillium Solutions
CRM, ERP, MDM
Customized solutions for leading platforms:
• Trillium for Microsoft Dynamics CRM 2.2
• Trillium for SAP ERP
• Trillium for SAP MDG 1.1
• Trillium for Oracle/Siebel
TS Director 15.7
Enables real-time, secure data quality within any
application
TSI Web Services 15.7
TS Web Services allows you to send data to TSS for
cleansing
(formatting and enhancing) and matching
(identifying potential duplicates)
using industry-standard SOAP requests.

15
The Data Quality Process Delivers Trusted Data
Data Profiling
Data Quality ProcessingData Discovery
Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Matching & more
Data
Verification &
Enrichment
• CRM
• Customer
360
Operational Integrations
Analytics &
Reporting
Data Governance
Trillium Discovery Trillium Quality; Trillium Quality for Big Data
+ Global Address Verification

16
Trillium Data Quality for Big Data:
Run quality processes directly within Hadoop
“Design once, deploy anywhere”
• Visually design data quality jobs once and run anywhere (MapReduce,
Spark, Linux, Unix, Windows; on premise or in the cloud)
• Use-case templates to fast-track development
• Test & debug locally in Windows/Linux; deploy to Big Data
• Intelligent Execution dynamically optimizes data processing at run-time
based on the chosen compute framework; no changes or tuning required
Benefit: Significantly reduce manual data preparation
• Major time sink for data scientists, architects and analysts
• Risk of inconsistent or incomplete data preparation
Benefit: Significantly increase trust in data
• Major time sink for executives
• Risk of poor data-based business decisions
Single
GUI
Execute
Anywhere!

17
Trillium Quality for Big Data – Execution
Architecture
TSS Control Center GUI - Simply click to publish the project to be run in Hadoop.
tsqbd utility processes the exported project generating a TQBD job to run locally on the edge node. Local execution used for Dev and QA.
tsqbd utility processes the exported project generating a TQBD job to run on MapReduce or Spark.
Each map and reduce task executes the job by invoking DMX-h engine (which in turn invokes TSQ engine) as a child process within the JVM.
DMX-h engine is used to provide a vertical and horizontally scaled execution environment for the TSQ engine on each data node.
Linux edge
node

18
Use Case: Customer 360
360 Degree View of the Customer (or any data entity)
• Bringing everything known about the customer into the
data lake … this is a lot of data!
• Advanced data quality processes essential to consolidate
information associated for a given customer
• Data validation and enrichment to complete customer
record
• Executing these processes requires a lot of resources!
• Insights help reduce customer churn, improve customer
loyalty and campaign effectiveness
• Leveraging the massive scalability of Big Data
frameworks like Hadoop and Spark makes it possible!
• ROI = the estimate of increased sales due to reduced
churn and better campaign performance, including
better up-selling/cross-selling
Internal Data
 Customer Master Data
 Point-of-Sale Data
 Contact Form Data
 Loyalty Program Data
 ecommerce Data
 Customer Service Data
Global Data
 Postal data for 230 countries,
regions, principalities
 Single/Double-Byte language
support
Third-Party Data
 Age
 Occupation
 Education
 Gender
 Income
 Geographic

19
Use Case: Advanced Analytics
Enabling predictive analytics/machine learning
• Algorithms and/or machine learning models to
detect anomalies, predict behaviors, such as:
• Customer behavior analysis
• Root cause analysis
• Predictive maintenance/Optimizing downtime
• Requires huge volumes of customer, product and/or
equipment profile data, real-time sensor data,
complex event processing data, geolocation,
weather/operating conditions
• Leveraging the massive scalability of Big Data
frameworks like Hadoop and Spark make it possible!
• ROI = Estimated reductions in downtime,
breakdowns, lost revenue and savings in parts,
labor and other costs
Internal Data
 Customer Master Data
 Customer Service Data
 Sales/eCommerce Data
 Product Master Data
 Fleet/Machinery
Maintenance Data
 Field Service Notes
Mobile Data
 Field Worker Devices
 Location
 Sensor Data
Third-Party Data
 Weather/Local Operating
Conditions
 Fleet/Machinery Maintenance
Schedules
 Warranty Data

Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra

More Related Content

What's hot (20)

Similar to Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra (20)

More from Data Con LA (20)

Recently uploaded (20)

Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra

Editor's Notes