Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

Introducing:
Trillium DQ for Big Data
Harald Smith, Director Product Marketing

Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• We will answer them during our Q&A session following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.

Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus
on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog on InfoWorld: “Data Democratized”
3

Data challenges across the business
Business Leaders
Lack trust in data needed to
make rapid, accurate
decisions that grow business
Business Analysts
Can’t access or understand
data and spend excessive
time on investigating
Information Leaders
Must facilitate business
collaboration and data
transparency and governance
Chief Data Officers
Make data a strategic
business asset utilizing
scientific skills from basic
spreadsheet knowledge
4

Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
New survey indicates
nearly 80% of AI/ML
projects stalling due to
poor data quality
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
Big Data Needs
Data Quality

6
Data Quality Challenges of Big Data
Profiling Data
• Organizations are storing vast amounts of data in data lakes and the Cloud –
from many different sources – but that data isn’t usable unless it is understood
and to understand it, the business users who work with the data must be able
to access and profile it without constant IT help
Matching Entities Accurately
• Distinguishing matches that indicate a single specific entity across so much data
requires sophisticated multi-field matching algorithms – that need to be
understandable by business users to be meaningful
Scalability
• Distinguishing matches across massive datasets requires a lot of compute
power - compare everything has to be compared to everything else, multiple
times in multiple ways
• Taking advantage of Big Data processing for scalability requires specialized skills
and takes a long time – and requires tuning, re-writing as technology changes
• Traditional data quality tools are not designed to work on that scale of data

Trillium DQ for Big Data
Understand, Evaluate, and Resolve Big Data Quality Problems
Trillium Discovery for Big Data
Data Profiling
Gain a complete picture of your data before
use
• Understand the data
• Analyze the data
• Find data quality problems
• Build and evaluate data quality rules
7
On Premises or via Trillium Cloud
Deploy any or all products to the cloud - Completely managed SaaS in AWS or Azure
Trillium Quality for Big Data
Data Cleansing and Matching
Cleanse, standardize, and connect
data in accordance with your predefined
standards
• Entity matching and resolution
• Data cleansing and correction
• Data record enrichment

Feature-rich data profiling and data quality processing engines
• Leveraging over two decades of data quality expertise
An efficient orchestration of this engine in Big Data distributed
frameworks
• Powered by an architecture that has been in production with very large
(2000+ node) environments running natively across the cluster
• Partnered with Cloudera and Hortonworks closely, native integration with the stack
• Syncsort has been a major contributor to Apache Hadoop open source project
• With efficient orchestration, we can process any number of attributes with a handful
of MapReduce jobs
• Same architecture is used for Apache Spark
“Design once, deploy anywhere” architecture
• Native connectivity providing breadth and performance
• “Intelligent Execution” to optimize process execution at run-time
(MapReduce, Spark 1.x, Spark 2.x)
• On-premise and in the cloud (e.g. Amazon EMR)
8
Data Quality for Your Big Data Needs

Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Big Data platforms
• Allow business analysts who understand the data but have little
technical expertise to quickly find data and run data profiling in
three steps
• Let analysts explore results and drilldown to details within 2-5
seconds per view to review and then report on data issues to
business leaders
• Scale to large volumes of data sources & attributes so that business
analysts can understand the contents of any data source needed for
business decisions
• Data is always secured in process and at rest and only available to
authorized users to comply with regulations and avoid fines
9

10
• Delivers enterprise trusted Trillium Discovery on distributed big data
platforms (e.g. Hadoop, Spark) for high-volume, scalable data profiling
• Provides complete Trillium Discovery data profiling for analysis & review
• Attribute metadata, value & pattern frequencies, key & dependency analysis,
cross-source join analysis, drill down to any outlier or issue, and more…
• Provides easily configured native connectivity for Big Data sources
• Provides managing and monitoring for task execution
• Integrates with the security frameworks (Kerberos, AD, LDAP) of
Big Data platforms

Run Profiling
1
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Trillium Discovery for Big Data – Data Profiling at Scale
Select Source Explore ProfilesRun Profiling
Stored Profiling Results
▪ Metadata & Statistics
▪ Frequency Distributions
▪ Drilldown Indices
Share &
Govern
Results
Integration
(APIs)
Notification
Collaboration
Native Connectors
▪ HDFS source directories
▪ …
Drilldown to IssuesEvaluate Business Rules

Key Outcomes
• Match and link any data entity – customers, suppliers, products, etc. –
into a trusted single view to support a broad array of business-critical
use cases (e.g. Customer 360, fraud, AML)
• Parse and standardize complex multi-domain data, extended with
enrichment and verification of critical address and geolocation data –
all leveraging out-of-the-box templates
• Utilize “design once, deploy anywhere” approach to speed time-to-
value and focus on building data quality business logic while letting the
product handle the technical aspects of framework execution with no
coding or tuning required
• Leverage the high-performance compute power of distributed Big Data
frameworks including Hadoop MapReduce and Spark to process high
volumes within targeted time windows to meet critical Service Level
Agreements (SLA’s)
12

13
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Provide high-quality entity resolution through multi-domain deduplication
and matching with the most comprehensive set of match comparisons
available, including fuzzy matching, distance comparisons, and more.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.

Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized, matched data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
Big Data Platform
14

Syncsort Trillium Delivers Data You can Trust
Data Profiling Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Enrichment & more
Matching, Entity
Resolution &
Verification
•Customer 360
•AI/ML
Operational Integrations
•Analytics &
Reporting
Data Governance
Trillium Discovery for Big Data
+ Global Address Verification
Trillium DQ for Big Data
15

Use Cases
16

Turn your Big Data
into a trusted view
of your customers,
products and more
Power machine
learning and
advanced analytics
with reliable, fit-for-
purpose data
Gain actionable
business insights
from high-volume
disparate data sets
from across the
enterprise
Deploy industry-
leading data quality
processes at massive
scale, with no coding
or Big Data skills
required
Trillium DQ for Big
Data evaluates &
transforms your Big
Data for trusted
business insights

Anti-Money
Laundering on
Hadoop at
Global Bank
S O LU T I O N
CHAL L ENGE
• Must provide highly accurate
entity resolution
• Must be secure – Kerberos, LDAP
• Must have lineage – data origin
to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance results at
massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas
and ASG Data Intelligence
• Cluster-native data
verification, enrichment,
and demanding multi-field
entity resolution on Spark
• Unmodified mainframe
“Golden Records” stored
on Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Trillium DQ for Big Data
• Connect CDC
• Connect for Big Data
18

Trillium DQ for
Big Data Cleanses
Credit Data for
Creditsafe
C H A L L E N G E
Ensure ALL DATA on each company is
analyzed – and NO DATA from another
company is accidentally included –
to get accurate corporate credit ratings.
• Need to profile, cleanse and enhance
data to evaluate credit ratings for
80 million companies in U.S. alone
• Existing solution lacked flexible
de-dupe matching rules, scalability
• Millions of records to analyze per
company, in multiple inconsistent
data sources, about 800 million/day
total and growing
• Solution must scale!
S O LU T I O N
• Amazon EMR Cloud
• Trillium DQ for Big Data cleansed,
standardized and matched over
130 million recs/hour on basic
10-node test cluster– met the
business SLA with room to grow
96% Address Matching Accuracy
after Trillium cleansing,
standardization
Saved software costs – Replaced
multiple solutions and tools
Saved Amazon cluster costs and
left room for company growth
“We can’t afford to miss
information, or mix up information
about businesses with similar
names. Companies count on our
highly accurate predictive scoring
to provide fast, accurate ratings
for their potential customers
and vendors.”
19

Next Steps
For more information on Trillium DQ for Big Data and our other
Syncsort Trillium data quality solutions, please visit:
https://www.syncsort.com/en/products/trillium-dq-for-big-data
And:
https://www.syncsort.com/en/integrate

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

More Related Content

What's hot (20)

Similar to Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake (20)

More from Precisely (20)

Recently uploaded (20)

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake