SlideShare a Scribd company logo
Populating your Enterprise
Data Hub for Next Gen
Analytics
Sushree Mishra
Senior Sales Engineer
Presented By:
August 2018
Agenda
• Company Overview
• Biggest Implementation Challenges
• Data Integration in Big Data
• Data Quality Functional Examples
• Demonstration
3
Trusted Industry Leadership
500+
Experienced & Talented
Data Professionals
>7,000
Customers
1968
50 Years of Market Leadership
& Award-Winning Customer Support
84
of Fortune 100 are Customers
3x
Revenue Growth
In Last 12 Months
The global leader in Big Iron to Big Data
4
Differentiated Product Portfolio & Technical
Expertise
Data
Infrastructure Optimization
Data
Availability
Data
Integration
Data
Quality
Market-leading
data quality capability
Best-in-class resource utilization
and performance, on premise
or in the cloud
#1 in high availability for
IBM i and AIX Power Systems
Industry-leading mainframe
data access and highest
performing ETL
• Trillium Software System
• Trillium Quality for Big Data
• Trillium Precise
• Trillium Cloud
• Trillium Global Locator
• Trillium Quality for Dynamics
CRM
• DL/2
• Zen Suite
• MFX® for z/OS
• ZPSaver Suite
• EZ-DB2
• EZ-IDMS
• DMX & DMX-h
• DMX AppMod
• athene®
• athene
SaaS®
• MIMIX Availability & DR
• MIMIX Move
• MIMIX Share
• iTera Availability
• Enforcive IBM i Security
• Ironstream®
• Ironstream® Transaction
Tracing
• DMX & DMX-h
• DMX Change Data Capture
Big Iron to Big Data
A fast-growing market segment composed of solutions that optimize traditional data systems and
deliver mission-critical data from these systems to next-generation analytic environments.
Biggest Implementation Challenges
1. Data Quality: Assessing and improving quality of data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills, Hadoop programmers are difficult to find and/or expensive.
3. Data Governance: Including data lake in governance initiatives and meeting regulatory compliance.
4. Rapid Change: Frameworks and tools evolve fast, and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out of Hadoop/Spark.
0
10
20
30
40
50
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance Rapid Change CDC
Mainframe Data Movement Cost Connectivity Uncertainty
Data Integration
in Big Data
7
Offload Data and ELT Workloads out of Legacy DW
Data Sources Data Warehouse Business Intelligence
ETL
ETL
ELT
Analytic
Query &
Reporting
After
Data Sources Data Warehouse
Analytic Query & Reporting
ETL
DMX-h ETL
Business Intelligence
Before
8
Simplify: Design Once, Deploy Anywhere
• Use existing ETL skills
• No need to worry about mappers, reducers, big side or small side of joins, etc
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark 2
• Run multiple execution frameworks in a single job
Single GUI Execute
8Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
9
High Performance ETL Architecture (DMX-h)
The DMX-h engine is installed on the workstation, edge node and all cluster nodes. The DMX-h engine gets invoked
as an executable only when a job is submitted.
The Job editor and Task editor used to design DMX-h jobs are installed only on the windows workstation. These
editors can connect to local or remote DMX-h agents.
DMX-h agent is a daemon running actively only on edge node. It is needed to serve requests from DMX-h GUI
editors.
1
2
3
1
1
1
2 2
a
3
10
Job Execution Choices
Edge Node Single Node in Cluster Cluster
11
A quick refresher on DMX DataFunnel
Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
• Funnels hundreds of tables at once into your data lake or RDBMS
‒ Extract, map and move whole DB schemas in one invocation
‒ Extract from DB2, Oracle, Teradata, Netezza, S3, Redshift …
‒ To SQL Server, Postgres, Hive, Redshift and HDFS
‒ Automatically create target tables
• Process multiple funnels in parallel on edge node or data nodes
‒ Leverages the DMX-h high performance data processing engine
• Filter unwanted data before extraction
‒ Data type filtering
‒ Table, record or column exclusion / inclusion
• In-flight transformations and cleansing
‒ Append strings to target table names
‒ Transform columns based on their data types
12
DMX-h Increases Business Agility at IHG with Up-To-Date Data
• Create an analytics platform that
standardizes data ingestion from over
5,000 properties globally
• Enable real-time updates as inventory
changes
• Provide more timely access to data for
room availability, inventory and other
hotel data from all global properties
• Regularly update Property Policy
information. Reports with stale data
can lead to incorrect analysis
• Current processes are being refreshed
infrequently or less than once a day
• Property information and house policy
data sent via Kafka Topics
• Hortonworks Hadoop cluster on
Google Cloud Platform to access and
integrate property and policy data
• Syncsort DMX-h is the only solution
that integrates Kafka, Google Cloud
Platform, Spark and the existing EDW
• DMX-h ingests 30 different types of
JSON Kafka messages every 30
minutes and writes to HDFS
• DMX-h transforms the dataset and
loads to the EDW as well as ORC files
in a Google Bucket.
• Simplicity – The entire process was
visually depicted in DMX-h jobs which
made process understanding really
easy.
• Time-to-Value –Syncsort DMX-h
drastically reduced development and
maintenance times
• Future Proofing – DMX-h will allow
IHG to move seamlessly to Spark when
ready
• Insight – Up-to-date results in better
business decisions
• Agility – Ability to respond quickly
based on current and comprehensive
information across the portfolio
• Reduce Risk – The Modern Data
Architecture allows IHG to easily
develop and maintain the data
pipeline with minimal effort
Business Challenge Solution Benefit Business Value
IHG is a global organization with a broad portfolio of hotel brands. IHG franchises, leases, manages
or owns more than 5,000 hotels and 742,000 guest rooms in almost 100 countries,
with nearly 1,400 hotels in its development pipeline. IHG uses cutting edge technologies to take
advantage of the value inherent in their data – including inventory, booking and membership details.
Prior to DMX-h, data could only by refreshed once a day – With DMX-h, the Data Warehouse is refreshed every 30 minutes!
Data Quality in
Big Data
14
Trillium Software Product Portfolio
Realtime Applications
Trillium Software System
On Premise or via Trillium Cloud
Deploy any or all products to the cloud
Completely managed SaaS in AWS or Azure deployed in 30 days or less
TS Discovery 15.7
Automated data profiling and discovery tool that
identifies data quality issues, facilitates business
rule management, and provides data quality
metrics
TS Quality 15.7, Series 7
Data quality engine that provides data cleansing,
matching, and enrichment for multi-domain, global
data (including global address validation)
Global Locator 15.7
Geolocation tool that standardizes and validates
address data and assigns corresponding latitude
and longitude coordinates
Trillium Precise
Data enrichment, validation, and verification
services including global postal addresses, email,
phone, and internet connectivity
Trillium Solutions
CRM, ERP, MDM
Customized solutions for leading platforms:
• Trillium for Microsoft Dynamics CRM 2.2
• Trillium for SAP ERP
• Trillium for SAP MDG 1.1
• Trillium for Oracle/Siebel
TS Director 15.7
Enables real-time, secure data quality within any
application
TSI Web Services 15.7
TS Web Services allows you to send data to TSS for
cleansing
(formatting and enhancing) and matching
(identifying potential duplicates)
using industry-standard SOAP requests.
15
The Data Quality Process Delivers Trusted Data
Data Profiling
Data Quality ProcessingData Discovery
Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Matching & more
Data
Verification &
Enrichment
• CRM
• Customer
360
Operational Integrations
Analytics &
Reporting
Data Governance
Trillium Discovery Trillium Quality; Trillium Quality for Big Data
+ Global Address Verification
16
Trillium Data Quality for Big Data:
Run quality processes directly within Hadoop
“Design once, deploy anywhere”
• Visually design data quality jobs once and run anywhere (MapReduce,
Spark, Linux, Unix, Windows; on premise or in the cloud)
• Use-case templates to fast-track development
• Test & debug locally in Windows/Linux; deploy to Big Data
• Intelligent Execution dynamically optimizes data processing at run-time
based on the chosen compute framework; no changes or tuning required
Benefit: Significantly reduce manual data preparation
• Major time sink for data scientists, architects and analysts
• Risk of inconsistent or incomplete data preparation
Benefit: Significantly increase trust in data
• Major time sink for executives
• Risk of poor data-based business decisions
Single
GUI
Execute
Anywhere!
17
Trillium Quality for Big Data – Execution
Architecture
TSS Control Center GUI - Simply click to publish the project to be run in Hadoop.
tsqbd utility processes the exported project generating a TQBD job to run locally on the edge node. Local execution used for Dev and QA.
tsqbd utility processes the exported project generating a TQBD job to run on MapReduce or Spark.
Each map and reduce task executes the job by invoking DMX-h engine (which in turn invokes TSQ engine) as a child process within the JVM.
DMX-h engine is used to provide a vertical and horizontally scaled execution environment for the TSQ engine on each data node.
Linux edge
node
18
Use Case: Customer 360
360 Degree View of the Customer (or any data entity)
• Bringing everything known about the customer into the
data lake … this is a lot of data!
• Advanced data quality processes essential to consolidate
information associated for a given customer
• Data validation and enrichment to complete customer
record
• Executing these processes requires a lot of resources!
• Insights help reduce customer churn, improve customer
loyalty and campaign effectiveness
• Leveraging the massive scalability of Big Data
frameworks like Hadoop and Spark makes it possible!
• ROI = the estimate of increased sales due to reduced
churn and better campaign performance, including
better up-selling/cross-selling
Internal Data
 Customer Master Data
 Point-of-Sale Data
 Contact Form Data
 Loyalty Program Data
 ecommerce Data
 Customer Service Data
Global Data
 Postal data for 230 countries,
regions, principalities
 Single/Double-Byte language
support
Third-Party Data
 Age
 Occupation
 Education
 Gender
 Income
 Geographic
19
Use Case: Advanced Analytics
Enabling predictive analytics/machine learning
• Algorithms and/or machine learning models to
detect anomalies, predict behaviors, such as:
• Customer behavior analysis
• Root cause analysis
• Predictive maintenance/Optimizing downtime
• Requires huge volumes of customer, product and/or
equipment profile data, real-time sensor data,
complex event processing data, geolocation,
weather/operating conditions
• Leveraging the massive scalability of Big Data
frameworks like Hadoop and Spark make it possible!
• ROI = Estimated reductions in downtime,
breakdowns, lost revenue and savings in parts,
labor and other costs
Internal Data
 Customer Master Data
 Customer Service Data
 Sales/eCommerce Data
 Product Master Data
 Fleet/Machinery
Maintenance Data
 Field Service Notes
Mobile Data
 Field Worker Devices
 Location
 Sensor Data
Third-Party Data
 Weather/Local Operating
Conditions
 Fleet/Machinery Maintenance
Schedules
 Warranty Data
Demo

More Related Content

PDF
Hadoop and Your Enterprise Data Warehouse
PPTX
Skilwise Big data
PPTX
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
PDF
What's New in Syncsort's Trillium Line of Data Quality Software - TSS Enterpr...
PDF
Creating a Next-Generation Big Data Architecture
PPTX
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
PPTX
How to Operationalise Real-Time Hadoop in the Cloud
PPTX
Digital Business Transformation in the Streaming Era
Hadoop and Your Enterprise Data Warehouse
Skilwise Big data
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
What's New in Syncsort's Trillium Line of Data Quality Software - TSS Enterpr...
Creating a Next-Generation Big Data Architecture
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
How to Operationalise Real-Time Hadoop in the Cloud
Digital Business Transformation in the Streaming Era

What's hot (20)

PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
PPTX
Multi-tenant Hadoop - the challenge of maintaining high SLAS
PPTX
Building an Effective Data Warehouse Architecture
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Cloud Based Data Warehousing and Analytics
PPTX
StreamCentral Technical Overview
PPTX
Building a Big Data Solution
PPTX
Accelerating Big Data Analytics
PPTX
2012 10 bigdata_overview
PPTX
The Future of Data Warehousing: ETL Will Never be the Same
PPTX
SQL Server Disaster Recovery Implementation
PDF
Data engineering design patterns
PPTX
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
PPTX
Modernize & Automate Analytics Data Pipelines
PPTX
The modern analytics architecture
DOCX
BarbaraZigmanResume 2016
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
Analytics in a Day Virtual Workshop
 
PDF
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
PDF
Modern Data Architecture
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Building an Effective Data Warehouse Architecture
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Cloud Based Data Warehousing and Analytics
StreamCentral Technical Overview
Building a Big Data Solution
Accelerating Big Data Analytics
2012 10 bigdata_overview
The Future of Data Warehousing: ETL Will Never be the Same
SQL Server Disaster Recovery Implementation
Data engineering design patterns
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Modernize & Automate Analytics Data Pipelines
The modern analytics architecture
BarbaraZigmanResume 2016
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Analytics in a Day Virtual Workshop
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Modern Data Architecture
Ad

Similar to Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra (20)

PDF
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
PDF
The New Trillium DQ: Big Data Insights When and Where You Need Them
PDF
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
PDF
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
PDF
Customer Education Webcast: New Features in Data Integration and Streaming CDC
PPTX
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
PDF
The Changing Data Quality & Data Governance Landscape
PDF
Into dq ed wrazen
PDF
Simplifying Big Data Integration with Syncsort DMX and DMX-h
PPTX
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PDF
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
PPTX
Big Data IDEA 101 2019
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PDF
Big Data - A Real Life Revolution
PPTX
Big Data Expo 2015 - Pentaho The Future of Analytics
PDF
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
The New Trillium DQ: Big Data Insights When and Where You Need Them
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Customer Education Webcast: New Features in Data Integration and Streaming CDC
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
The Changing Data Quality & Data Governance Landscape
Into dq ed wrazen
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
Big Data IDEA 101 2019
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Big Data - A Real Life Revolution
Big Data Expo 2015 - Pentaho The Future of Analytics
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
SOPHOS-XG Firewall Administrator PPT.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics by Sushree Mishra

  • 1. Populating your Enterprise Data Hub for Next Gen Analytics Sushree Mishra Senior Sales Engineer Presented By: August 2018
  • 2. Agenda • Company Overview • Biggest Implementation Challenges • Data Integration in Big Data • Data Quality Functional Examples • Demonstration
  • 3. 3 Trusted Industry Leadership 500+ Experienced & Talented Data Professionals >7,000 Customers 1968 50 Years of Market Leadership & Award-Winning Customer Support 84 of Fortune 100 are Customers 3x Revenue Growth In Last 12 Months The global leader in Big Iron to Big Data
  • 4. 4 Differentiated Product Portfolio & Technical Expertise Data Infrastructure Optimization Data Availability Data Integration Data Quality Market-leading data quality capability Best-in-class resource utilization and performance, on premise or in the cloud #1 in high availability for IBM i and AIX Power Systems Industry-leading mainframe data access and highest performing ETL • Trillium Software System • Trillium Quality for Big Data • Trillium Precise • Trillium Cloud • Trillium Global Locator • Trillium Quality for Dynamics CRM • DL/2 • Zen Suite • MFX® for z/OS • ZPSaver Suite • EZ-DB2 • EZ-IDMS • DMX & DMX-h • DMX AppMod • athene® • athene SaaS® • MIMIX Availability & DR • MIMIX Move • MIMIX Share • iTera Availability • Enforcive IBM i Security • Ironstream® • Ironstream® Transaction Tracing • DMX & DMX-h • DMX Change Data Capture Big Iron to Big Data A fast-growing market segment composed of solutions that optimize traditional data systems and deliver mission-critical data from these systems to next-generation analytic environments.
  • 5. Biggest Implementation Challenges 1. Data Quality: Assessing and improving quality of data as it enters and/or in the data lake. 2. Skills/Staff: Need to learn a new set of skills, Hadoop programmers are difficult to find and/or expensive. 3. Data Governance: Including data lake in governance initiatives and meeting regulatory compliance. 4. Rapid Change: Frameworks and tools evolve fast, and it’s difficult to keep up with the latest tech. 5. Fresh Data (CDC): Difficult to keep data lake up-to-date with changes made on other platforms. 6. Mainframe: Difficult to move mainframe data in and out of Hadoop/Spark. 7. Data Movement: Difficult to move data in and out of Hadoop/Spark. 0 10 20 30 40 50 % of People Who Consider this a Top Challenge (Rated 1 or 2) Big Data Challenges Data Quality Skills Governance Rapid Change CDC Mainframe Data Movement Cost Connectivity Uncertainty
  • 7. 7 Offload Data and ELT Workloads out of Legacy DW Data Sources Data Warehouse Business Intelligence ETL ETL ELT Analytic Query & Reporting After Data Sources Data Warehouse Analytic Query & Reporting ETL DMX-h ETL Business Intelligence Before
  • 8. 8 Simplify: Design Once, Deploy Anywhere • Use existing ETL skills • No need to worry about mappers, reducers, big side or small side of joins, etc • Automatic optimization for best performance, load balancing, etc. • No changes or tuning required, even if you change execution frameworks • Future-proof job designs for emerging compute frameworks, e.g. Spark 2 • Run multiple execution frameworks in a single job Single GUI Execute 8Syncsort Confidential and Proprietary - do not copy or distribute Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
  • 9. 9 High Performance ETL Architecture (DMX-h) The DMX-h engine is installed on the workstation, edge node and all cluster nodes. The DMX-h engine gets invoked as an executable only when a job is submitted. The Job editor and Task editor used to design DMX-h jobs are installed only on the windows workstation. These editors can connect to local or remote DMX-h agents. DMX-h agent is a daemon running actively only on edge node. It is needed to serve requests from DMX-h GUI editors. 1 2 3 1 1 1 2 2 a 3
  • 10. 10 Job Execution Choices Edge Node Single Node in Cluster Cluster
  • 11. 11 A quick refresher on DMX DataFunnel Syncsort Confidential and Proprietary - do not copy or distribute DMX DataFunnel™ • Funnels hundreds of tables at once into your data lake or RDBMS ‒ Extract, map and move whole DB schemas in one invocation ‒ Extract from DB2, Oracle, Teradata, Netezza, S3, Redshift … ‒ To SQL Server, Postgres, Hive, Redshift and HDFS ‒ Automatically create target tables • Process multiple funnels in parallel on edge node or data nodes ‒ Leverages the DMX-h high performance data processing engine • Filter unwanted data before extraction ‒ Data type filtering ‒ Table, record or column exclusion / inclusion • In-flight transformations and cleansing ‒ Append strings to target table names ‒ Transform columns based on their data types
  • 12. 12 DMX-h Increases Business Agility at IHG with Up-To-Date Data • Create an analytics platform that standardizes data ingestion from over 5,000 properties globally • Enable real-time updates as inventory changes • Provide more timely access to data for room availability, inventory and other hotel data from all global properties • Regularly update Property Policy information. Reports with stale data can lead to incorrect analysis • Current processes are being refreshed infrequently or less than once a day • Property information and house policy data sent via Kafka Topics • Hortonworks Hadoop cluster on Google Cloud Platform to access and integrate property and policy data • Syncsort DMX-h is the only solution that integrates Kafka, Google Cloud Platform, Spark and the existing EDW • DMX-h ingests 30 different types of JSON Kafka messages every 30 minutes and writes to HDFS • DMX-h transforms the dataset and loads to the EDW as well as ORC files in a Google Bucket. • Simplicity – The entire process was visually depicted in DMX-h jobs which made process understanding really easy. • Time-to-Value –Syncsort DMX-h drastically reduced development and maintenance times • Future Proofing – DMX-h will allow IHG to move seamlessly to Spark when ready • Insight – Up-to-date results in better business decisions • Agility – Ability to respond quickly based on current and comprehensive information across the portfolio • Reduce Risk – The Modern Data Architecture allows IHG to easily develop and maintain the data pipeline with minimal effort Business Challenge Solution Benefit Business Value IHG is a global organization with a broad portfolio of hotel brands. IHG franchises, leases, manages or owns more than 5,000 hotels and 742,000 guest rooms in almost 100 countries, with nearly 1,400 hotels in its development pipeline. IHG uses cutting edge technologies to take advantage of the value inherent in their data – including inventory, booking and membership details. Prior to DMX-h, data could only by refreshed once a day – With DMX-h, the Data Warehouse is refreshed every 30 minutes!
  • 14. 14 Trillium Software Product Portfolio Realtime Applications Trillium Software System On Premise or via Trillium Cloud Deploy any or all products to the cloud Completely managed SaaS in AWS or Azure deployed in 30 days or less TS Discovery 15.7 Automated data profiling and discovery tool that identifies data quality issues, facilitates business rule management, and provides data quality metrics TS Quality 15.7, Series 7 Data quality engine that provides data cleansing, matching, and enrichment for multi-domain, global data (including global address validation) Global Locator 15.7 Geolocation tool that standardizes and validates address data and assigns corresponding latitude and longitude coordinates Trillium Precise Data enrichment, validation, and verification services including global postal addresses, email, phone, and internet connectivity Trillium Solutions CRM, ERP, MDM Customized solutions for leading platforms: • Trillium for Microsoft Dynamics CRM 2.2 • Trillium for SAP ERP • Trillium for SAP MDG 1.1 • Trillium for Oracle/Siebel TS Director 15.7 Enables real-time, secure data quality within any application TSI Web Services 15.7 TS Web Services allows you to send data to TSS for cleansing (formatting and enhancing) and matching (identifying potential duplicates) using industry-standard SOAP requests.
  • 15. 15 The Data Quality Process Delivers Trusted Data Data Profiling Data Quality ProcessingData Discovery Business Rules & Data Quality Assessment Data Validation, Standardization, Matching & more Data Verification & Enrichment • CRM • Customer 360 Operational Integrations Analytics & Reporting Data Governance Trillium Discovery Trillium Quality; Trillium Quality for Big Data + Global Address Verification
  • 16. 16 Trillium Data Quality for Big Data: Run quality processes directly within Hadoop “Design once, deploy anywhere” • Visually design data quality jobs once and run anywhere (MapReduce, Spark, Linux, Unix, Windows; on premise or in the cloud) • Use-case templates to fast-track development • Test & debug locally in Windows/Linux; deploy to Big Data • Intelligent Execution dynamically optimizes data processing at run-time based on the chosen compute framework; no changes or tuning required Benefit: Significantly reduce manual data preparation • Major time sink for data scientists, architects and analysts • Risk of inconsistent or incomplete data preparation Benefit: Significantly increase trust in data • Major time sink for executives • Risk of poor data-based business decisions Single GUI Execute Anywhere!
  • 17. 17 Trillium Quality for Big Data – Execution Architecture TSS Control Center GUI - Simply click to publish the project to be run in Hadoop. tsqbd utility processes the exported project generating a TQBD job to run locally on the edge node. Local execution used for Dev and QA. tsqbd utility processes the exported project generating a TQBD job to run on MapReduce or Spark. Each map and reduce task executes the job by invoking DMX-h engine (which in turn invokes TSQ engine) as a child process within the JVM. DMX-h engine is used to provide a vertical and horizontally scaled execution environment for the TSQ engine on each data node. Linux edge node
  • 18. 18 Use Case: Customer 360 360 Degree View of the Customer (or any data entity) • Bringing everything known about the customer into the data lake … this is a lot of data! • Advanced data quality processes essential to consolidate information associated for a given customer • Data validation and enrichment to complete customer record • Executing these processes requires a lot of resources! • Insights help reduce customer churn, improve customer loyalty and campaign effectiveness • Leveraging the massive scalability of Big Data frameworks like Hadoop and Spark makes it possible! • ROI = the estimate of increased sales due to reduced churn and better campaign performance, including better up-selling/cross-selling Internal Data  Customer Master Data  Point-of-Sale Data  Contact Form Data  Loyalty Program Data  ecommerce Data  Customer Service Data Global Data  Postal data for 230 countries, regions, principalities  Single/Double-Byte language support Third-Party Data  Age  Occupation  Education  Gender  Income  Geographic
  • 19. 19 Use Case: Advanced Analytics Enabling predictive analytics/machine learning • Algorithms and/or machine learning models to detect anomalies, predict behaviors, such as: • Customer behavior analysis • Root cause analysis • Predictive maintenance/Optimizing downtime • Requires huge volumes of customer, product and/or equipment profile data, real-time sensor data, complex event processing data, geolocation, weather/operating conditions • Leveraging the massive scalability of Big Data frameworks like Hadoop and Spark make it possible! • ROI = Estimated reductions in downtime, breakdowns, lost revenue and savings in parts, labor and other costs Internal Data  Customer Master Data  Customer Service Data  Sales/eCommerce Data  Product Master Data  Fleet/Machinery Maintenance Data  Field Service Notes Mobile Data  Field Worker Devices  Location  Sensor Data Third-Party Data  Weather/Local Operating Conditions  Fleet/Machinery Maintenance Schedules  Warranty Data
  • 20. Demo

Editor's Notes

  • #6: Source: Syncsort Annual Big Data Survey 2017