SlideShare a Scribd company logo
Introducing:
Trillium DQ for Big Data
Harald Smith, Director Product Marketing
Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• We will answer them during our Q&A session following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus
on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog on InfoWorld: “Data Democratized”
3
Data challenges across the business
Business Leaders
Lack trust in data needed to
make rapid, accurate
decisions that grow business
Business Analysts
Can’t access or understand
data and spend excessive
time on investigating
Information Leaders
Must facilitate business
collaboration and data
transparency and governance
Chief Data Officers
Make data a strategic
business asset utilizing
scientific skills from basic
spreadsheet knowledge
4
Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
New survey indicates
nearly 80% of AI/ML
projects stalling due to
poor data quality
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
Big Data Needs
Data Quality
6
Data Quality Challenges of Big Data
Profiling Data
• Organizations are storing vast amounts of data in data lakes and the Cloud –
from many different sources – but that data isn’t usable unless it is understood
and to understand it, the business users who work with the data must be able
to access and profile it without constant IT help
Matching Entities Accurately
• Distinguishing matches that indicate a single specific entity across so much data
requires sophisticated multi-field matching algorithms – that need to be
understandable by business users to be meaningful
Scalability
• Distinguishing matches across massive datasets requires a lot of compute
power - compare everything has to be compared to everything else, multiple
times in multiple ways
• Taking advantage of Big Data processing for scalability requires specialized skills
and takes a long time – and requires tuning, re-writing as technology changes
• Traditional data quality tools are not designed to work on that scale of data
Trillium DQ for Big Data
Understand, Evaluate, and Resolve Big Data Quality Problems
Trillium Discovery for Big Data
Data Profiling
Gain a complete picture of your data before
use
• Understand the data
• Analyze the data
• Find data quality problems
• Build and evaluate data quality rules
7
Trillium DQ for Big Data
On Premises or via Trillium Cloud
Deploy any or all products to the cloud - Completely managed SaaS in AWS or Azure
Trillium Quality for Big Data
Data Cleansing and Matching
Cleanse, standardize, and connect
data in accordance with your predefined
standards
• Entity matching and resolution
• Data cleansing and correction
• Data record enrichment
Feature-rich data profiling and data quality processing engines
• Leveraging over two decades of data quality expertise
An efficient orchestration of this engine in Big Data distributed
frameworks
• Powered by an architecture that has been in production with very large
(2000+ node) environments running natively across the cluster
• Partnered with Cloudera and Hortonworks closely, native integration with the stack
• Syncsort has been a major contributor to Apache Hadoop open source project
• With efficient orchestration, we can process any number of attributes with a handful
of MapReduce jobs
• Same architecture is used for Apache Spark
“Design once, deploy anywhere” architecture
• Native connectivity providing breadth and performance
• “Intelligent Execution” to optimize process execution at run-time
(MapReduce, Spark 1.x, Spark 2.x)
• On-premise and in the cloud (e.g. Amazon EMR)
8
Data Quality for Your Big Data Needs
Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Big Data platforms
• Allow business analysts who understand the data but have little
technical expertise to quickly find data and run data profiling in
three steps
• Let analysts explore results and drilldown to details within 2-5
seconds per view to review and then report on data issues to
business leaders
• Scale to large volumes of data sources & attributes so that business
analysts can understand the contents of any data source needed for
business decisions
• Data is always secured in process and at rest and only available to
authorized users to comply with regulations and avoid fines
9
Trillium Discovery for Big Data
10
Trillium Discovery for Big Data
• Delivers enterprise trusted Trillium Discovery on distributed big data
platforms (e.g. Hadoop, Spark) for high-volume, scalable data profiling
• Provides complete Trillium Discovery data profiling for analysis & review
• Attribute metadata, value & pattern frequencies, key & dependency analysis,
cross-source join analysis, drill down to any outlier or issue, and more…
• Provides easily configured native connectivity for Big Data sources
• Provides managing and monitoring for task execution
• Integrates with the security frameworks (Kerberos, AD, LDAP) of
Big Data platforms
Run Profiling
1
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Trillium Discovery for Big Data – Data Profiling at Scale
Select Source Explore ProfilesRun Profiling
Stored Profiling Results
▪ Metadata & Statistics
▪ Frequency Distributions
▪ Drilldown Indices
Share &
Govern
Results
Integration
(APIs)
Notification
Collaboration
Native Connectors
▪ HDFS source directories
▪ …
Drilldown to IssuesEvaluate Business Rules
Key Outcomes
• Match and link any data entity – customers, suppliers, products, etc. –
into a trusted single view to support a broad array of business-critical
use cases (e.g. Customer 360, fraud, AML)
• Parse and standardize complex multi-domain data, extended with
enrichment and verification of critical address and geolocation data –
all leveraging out-of-the-box templates
• Utilize “design once, deploy anywhere” approach to speed time-to-
value and focus on building data quality business logic while letting the
product handle the technical aspects of framework execution with no
coding or tuning required
• Leverage the high-performance compute power of distributed Big Data
frameworks including Hadoop MapReduce and Spark to process high
volumes within targeted time windows to meet critical Service Level
Agreements (SLA’s)
12
Trillium Quality for Big Data
13
Trillium Quality for Big Data
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Provide high-quality entity resolution through multi-domain deduplication
and matching with the most comprehensive set of match comparisons
available, including fuzzy matching, distance comparisons, and more.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized, matched data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
Big Data Platform
14
Syncsort Trillium Delivers Data You can Trust
Data Profiling Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Enrichment & more
Matching, Entity
Resolution &
Verification
•Customer 360
•AI/ML
Operational Integrations
•Analytics &
Reporting
Data Governance
Trillium Discovery for Big Data
Trillium Quality for Big Data
+ Global Address Verification
Trillium DQ for Big Data
15
Trillium DQ for Big Data
Use Cases
16
Turn your Big Data
into a trusted view
of your customers,
products and more
Power machine
learning and
advanced analytics
with reliable, fit-for-
purpose data
Gain actionable
business insights
from high-volume
disparate data sets
from across the
enterprise
Deploy industry-
leading data quality
processes at massive
scale, with no coding
or Big Data skills
required
Trillium DQ for Big
Data evaluates &
transforms your Big
Data for trusted
business insights
Anti-Money
Laundering on
Hadoop at
Global Bank
S O LU T I O N
CHAL L ENGE
• Must provide highly accurate
entity resolution
• Must be secure – Kerberos, LDAP
• Must have lineage – data origin
to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance results at
massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas
and ASG Data Intelligence
• Cluster-native data
verification, enrichment,
and demanding multi-field
entity resolution on Spark
• Unmodified mainframe
“Golden Records” stored
on Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Trillium DQ for Big Data
• Connect CDC
• Connect for Big Data
18
Trillium DQ for
Big Data Cleanses
Credit Data for
Creditsafe
C H A L L E N G E
Ensure ALL DATA on each company is
analyzed – and NO DATA from another
company is accidentally included –
to get accurate corporate credit ratings.
• Need to profile, cleanse and enhance
data to evaluate credit ratings for
80 million companies in U.S. alone
• Existing solution lacked flexible
de-dupe matching rules, scalability
• Millions of records to analyze per
company, in multiple inconsistent
data sources, about 800 million/day
total and growing
• Solution must scale!
S O LU T I O N
• Amazon EMR Cloud
• Trillium DQ for Big Data cleansed,
standardized and matched over
130 million recs/hour on basic
10-node test cluster– met the
business SLA with room to grow
96% Address Matching Accuracy
after Trillium cleansing,
standardization
Saved software costs – Replaced
multiple solutions and tools
Saved Amazon cluster costs and
left room for company growth
“We can’t afford to miss
information, or mix up information
about businesses with similar
names. Companies count on our
highly accurate predictive scoring
to provide fast, accurate ratings
for their potential customers
and vendors.”
19
Next Steps
For more information on Trillium DQ for Big Data and our other
Syncsort Trillium data quality solutions, please visit:
https://www.syncsort.com/en/products/trillium-dq-for-big-data
And:
https://www.syncsort.com/en/integrate
Q & A
21
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

More Related Content

PPT
Data Architecture for Data Governance
PPT
Data Governance in a big data era
PDF
Introduction to Data Governance
PDF
Data Architecture Strategies: Data Architecture for Digital Transformation
PPTX
PDF
Data governance Program PowerPoint Presentation Slides
PDF
Improving Data Literacy Around Data Architecture
PPTX
Enterprise Data Architecture Deliverables
Data Architecture for Data Governance
Data Governance in a big data era
Introduction to Data Governance
Data Architecture Strategies: Data Architecture for Digital Transformation
Data governance Program PowerPoint Presentation Slides
Improving Data Literacy Around Data Architecture
Enterprise Data Architecture Deliverables

What's hot (20)

PDF
Data Quality Best Practices
PDF
Glossaries, Dictionaries, and Catalogs Result in Data Governance
PPTX
Identity & access management
PDF
Business Intelligence & Data Analytics– An Architected Approach
PDF
Data Governance Best Practices, Assessments, and Roadmaps
PDF
Why is Customer Data Platform (CDP) ?
PDF
Data Mesh for Dinner
PDF
Make Data Work for You
PPTX
Master Data Management methodology
PDF
Building a Data Governance Strategy
PDF
DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi...
PDF
AI Data Acquisition and Governance: Considerations for Success
PPTX
Finding business value in Big Data
PDF
Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...
PDF
Modern Data architecture Design
PDF
Data Loss Prevention (DLP) - Fundamental Concept - Eryk
PDF
How to govern and secure a Data Mesh?
PDF
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
PPTX
Breakdown of Microsoft Purview Solutions
PDF
Data Governance
Data Quality Best Practices
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Identity & access management
Business Intelligence & Data Analytics– An Architected Approach
Data Governance Best Practices, Assessments, and Roadmaps
Why is Customer Data Platform (CDP) ?
Data Mesh for Dinner
Make Data Work for You
Master Data Management methodology
Building a Data Governance Strategy
DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi...
AI Data Acquisition and Governance: Considerations for Success
Finding business value in Big Data
Cybersecurity Frameworks | NIST Cybersecurity Framework | Cybersecurity Certi...
Modern Data architecture Design
Data Loss Prevention (DLP) - Fundamental Concept - Eryk
How to govern and secure a Data Mesh?
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
Breakdown of Microsoft Purview Solutions
Data Governance
Ad

Similar to Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake (20)

PDF
The New Trillium DQ: Big Data Insights When and Where You Need Them
PDF
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
PPTX
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
PDF
When and How Data Lakes Fit into a Modern Data Architecture
PDF
2022 Trends in Enterprise Analytics
PPTX
Foundational Strategies for Trusted Data: Getting Your Data to the Cloud
PPTX
Deliveinrg explainable AI
PPTX
Empowering Business & IT Teams:  Modern Data Catalog Requirements
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Accelerate Cloud Migrations and Architecture with Data Virtualization
PPTX
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
PDF
Trends in Enterprise Advanced Analytics
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PDF
Bridging the Gap: Analyzing Data in and Below the Cloud
PDF
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
PPTX
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
PDF
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
PPTX
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
PDF
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
PDF
Challenges of Operationalising Data Science in Production
The New Trillium DQ: Big Data Insights When and Where You Need Them
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
When and How Data Lakes Fit into a Modern Data Architecture
2022 Trends in Enterprise Analytics
Foundational Strategies for Trusted Data: Getting Your Data to the Cloud
Deliveinrg explainable AI
Empowering Business & IT Teams:  Modern Data Catalog Requirements
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Accelerate Cloud Migrations and Architecture with Data Virtualization
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Trends in Enterprise Advanced Analytics
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Bridging the Gap: Analyzing Data in and Below the Cloud
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
Challenges of Operationalising Data Science in Production
Ad

More from Precisely (20)

PDF
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Introducing Syncsort™ Storage Management.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
PDF
The 2025 Guide on What's Next for Automation.pdf
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
PDF
The Changing Compliance Landscape in 2025.pdf
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
Unlock new opportunities with location data.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Introducing Syncsort™ Storage Management.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
The 2025 Guide on What's Next for Automation.pdf
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
The Changing Compliance Landscape in 2025.pdf

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

  • 1. Introducing: Trillium DQ for Big Data Harald Smith, Director Product Marketing
  • 2. Housekeeping Webcast Audio • Today’s webcast audio is streamed through your computer speakers. • If you need technical assistance with the web interface or audio, please reach out to us using the chat window. Questions Welcome • Submit your questions at any time during the presentation using the chat window. • We will answer them during our Q&A session following the presentation. Recording and slides • This webcast is being recorded. You will receive an email following the webcast with a link to download both the recording and the slides.
  • 3. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog on InfoWorld: “Data Democratized” 3
  • 4. Data challenges across the business Business Leaders Lack trust in data needed to make rapid, accurate decisions that grow business Business Analysts Can’t access or understand data and spend excessive time on investigating Information Leaders Must facilitate business collaboration and data transparency and governance Chief Data Officers Make data a strategic business asset utilizing scientific skills from basic spreadsheet knowledge 4
  • 5. Only 35% of senior executives have a high level of trust in the accuracy of their Big Data Analytics 92% of executives are concerned about the negative impact of data and analytics on corporate reputation New survey indicates nearly 80% of AI/ML projects stalling due to poor data quality 84% of CEOs are concerned about the quality of the data they’re basing decisions on Big Data Needs Data Quality
  • 6. 6 Data Quality Challenges of Big Data Profiling Data • Organizations are storing vast amounts of data in data lakes and the Cloud – from many different sources – but that data isn’t usable unless it is understood and to understand it, the business users who work with the data must be able to access and profile it without constant IT help Matching Entities Accurately • Distinguishing matches that indicate a single specific entity across so much data requires sophisticated multi-field matching algorithms – that need to be understandable by business users to be meaningful Scalability • Distinguishing matches across massive datasets requires a lot of compute power - compare everything has to be compared to everything else, multiple times in multiple ways • Taking advantage of Big Data processing for scalability requires specialized skills and takes a long time – and requires tuning, re-writing as technology changes • Traditional data quality tools are not designed to work on that scale of data
  • 7. Trillium DQ for Big Data Understand, Evaluate, and Resolve Big Data Quality Problems Trillium Discovery for Big Data Data Profiling Gain a complete picture of your data before use • Understand the data • Analyze the data • Find data quality problems • Build and evaluate data quality rules 7 Trillium DQ for Big Data On Premises or via Trillium Cloud Deploy any or all products to the cloud - Completely managed SaaS in AWS or Azure Trillium Quality for Big Data Data Cleansing and Matching Cleanse, standardize, and connect data in accordance with your predefined standards • Entity matching and resolution • Data cleansing and correction • Data record enrichment
  • 8. Feature-rich data profiling and data quality processing engines • Leveraging over two decades of data quality expertise An efficient orchestration of this engine in Big Data distributed frameworks • Powered by an architecture that has been in production with very large (2000+ node) environments running natively across the cluster • Partnered with Cloudera and Hortonworks closely, native integration with the stack • Syncsort has been a major contributor to Apache Hadoop open source project • With efficient orchestration, we can process any number of attributes with a handful of MapReduce jobs • Same architecture is used for Apache Spark “Design once, deploy anywhere” architecture • Native connectivity providing breadth and performance • “Intelligent Execution” to optimize process execution at run-time (MapReduce, Spark 1.x, Spark 2.x) • On-premise and in the cloud (e.g. Amazon EMR) 8 Data Quality for Your Big Data Needs
  • 9. Key Outcomes • Reduce the time for business analysts to discover and understand data on Big Data platforms • Allow business analysts who understand the data but have little technical expertise to quickly find data and run data profiling in three steps • Let analysts explore results and drilldown to details within 2-5 seconds per view to review and then report on data issues to business leaders • Scale to large volumes of data sources & attributes so that business analysts can understand the contents of any data source needed for business decisions • Data is always secured in process and at rest and only available to authorized users to comply with regulations and avoid fines 9 Trillium Discovery for Big Data
  • 10. 10 Trillium Discovery for Big Data • Delivers enterprise trusted Trillium Discovery on distributed big data platforms (e.g. Hadoop, Spark) for high-volume, scalable data profiling • Provides complete Trillium Discovery data profiling for analysis & review • Attribute metadata, value & pattern frequencies, key & dependency analysis, cross-source join analysis, drill down to any outlier or issue, and more… • Provides easily configured native connectivity for Big Data sources • Provides managing and monitoring for task execution • Integrates with the security frameworks (Kerberos, AD, LDAP) of Big Data platforms
  • 11. Run Profiling 1 n . . . . . . . . . . . . . . . . . . . . . . 11 Trillium Discovery for Big Data – Data Profiling at Scale Select Source Explore ProfilesRun Profiling Stored Profiling Results ▪ Metadata & Statistics ▪ Frequency Distributions ▪ Drilldown Indices Share & Govern Results Integration (APIs) Notification Collaboration Native Connectors ▪ HDFS source directories ▪ … Drilldown to IssuesEvaluate Business Rules
  • 12. Key Outcomes • Match and link any data entity – customers, suppliers, products, etc. – into a trusted single view to support a broad array of business-critical use cases (e.g. Customer 360, fraud, AML) • Parse and standardize complex multi-domain data, extended with enrichment and verification of critical address and geolocation data – all leveraging out-of-the-box templates • Utilize “design once, deploy anywhere” approach to speed time-to- value and focus on building data quality business logic while letting the product handle the technical aspects of framework execution with no coding or tuning required • Leverage the high-performance compute power of distributed Big Data frameworks including Hadoop MapReduce and Spark to process high volumes within targeted time windows to meet critical Service Level Agreements (SLA’s) 12 Trillium Quality for Big Data
  • 13. 13 Trillium Quality for Big Data • Integrate, parse, standardize, and match new and legacy customer data from multiple disparate sources. • Provide high-quality entity resolution through multi-domain deduplication and matching with the most comprehensive set of match comparisons available, including fuzzy matching, distance comparisons, and more. • Standardize, enhance, and match international data sets with postal and country-code validation. • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency. • Process hundreds of millions of records of data. • Increase processing efficiency. • Support failover through Hadoop’s fault-tolerant design; during a node failure, processing is redirected to another node.
  • 14. Trillium Quality for Big Data – Data Cleansing at Scale Boost effectiveness of machine learning, AI with complete, standardized, matched data. 1. Visually create and test data quality processes locally 2. Execute in MapReduce or Spark On premise or in the Cloud Big Data Platform 14
  • 15. Syncsort Trillium Delivers Data You can Trust Data Profiling Business Rules & Data Quality Assessment Data Validation, Standardization, Enrichment & more Matching, Entity Resolution & Verification •Customer 360 •AI/ML Operational Integrations •Analytics & Reporting Data Governance Trillium Discovery for Big Data Trillium Quality for Big Data + Global Address Verification Trillium DQ for Big Data 15
  • 16. Trillium DQ for Big Data Use Cases 16
  • 17. Turn your Big Data into a trusted view of your customers, products and more Power machine learning and advanced analytics with reliable, fit-for- purpose data Gain actionable business insights from high-volume disparate data sets from across the enterprise Deploy industry- leading data quality processes at massive scale, with no coding or Big Data skills required Trillium DQ for Big Data evaluates & transforms your Big Data for trusted business insights
  • 18. Anti-Money Laundering on Hadoop at Global Bank S O LU T I O N CHAL L ENGE • Must provide highly accurate entity resolution • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Machine learning can detect patterns, but … Requires large amounts of current, clean data. • Trillium DQ for Big Data • Connect CDC • Connect for Big Data 18
  • 19. Trillium DQ for Big Data Cleanses Credit Data for Creditsafe C H A L L E N G E Ensure ALL DATA on each company is analyzed – and NO DATA from another company is accidentally included – to get accurate corporate credit ratings. • Need to profile, cleanse and enhance data to evaluate credit ratings for 80 million companies in U.S. alone • Existing solution lacked flexible de-dupe matching rules, scalability • Millions of records to analyze per company, in multiple inconsistent data sources, about 800 million/day total and growing • Solution must scale! S O LU T I O N • Amazon EMR Cloud • Trillium DQ for Big Data cleansed, standardized and matched over 130 million recs/hour on basic 10-node test cluster– met the business SLA with room to grow 96% Address Matching Accuracy after Trillium cleansing, standardization Saved software costs – Replaced multiple solutions and tools Saved Amazon cluster costs and left room for company growth “We can’t afford to miss information, or mix up information about businesses with similar names. Companies count on our highly accurate predictive scoring to provide fast, accurate ratings for their potential customers and vendors.” 19
  • 20. Next Steps For more information on Trillium DQ for Big Data and our other Syncsort Trillium data quality solutions, please visit: https://www.syncsort.com/en/products/trillium-dq-for-big-data And: https://www.syncsort.com/en/integrate