Tamr | MDM and the Data Unification Imperative

Download as PPTX, PDF

3 likes3,147 views

The document discusses the challenges of data heterogeneity in large companies and presents options for managing this complexity, including traditional approaches and a probabilistic/model-based approach by Tamr. It emphasizes the importance of embracing data variety through advanced techniques for data cataloging and integration to improve analytics and governance. Tamr, founded by database software veterans, offers solutions that enhance business agility and reduce maintenance costs by optimizing data management processes.

Software

Tamr | MDM and the Data Unification Imperative

1. MDM AND THE DATA UNIFICATION IMPERATIVE JAMES MARKARIAN | ADVISOR, TAMR

2. Data Heterogeneity is Inherent in Large Companies Data sources are bound to applications with idiosyncratic bias Sales Marketing Manufacturing HR Support Finance AppsStoreApps Store

3. Sales Marketing Manufacturing HR Support Finance Aggregation of Data Creates Ambiguity/Complexity Broad analytics create need to bring data together from many sources

4. Outside Forces = More Confusion + Complexity Leadership Changes Mergers & Acquisitions Reorganizations

5. Result: Just 10% of Data is Consumable by Any One Person And 80% of data scientist time is spent preparing it 90% Dark Data

6. Expectations for Global Corporate IT as Data Broker Increasing quickly -- along with the hype about Big Data/Analytics 3.0 HR Sales Finance Divisions Marketing MFG ENG

7. Some Options Option #1 - Deny Variety - use information that is easiest/closest Option #2 - Manage Variety incrementally - using traditional approaches: ● Standardization ● Aggregation ● Master Data Management ● Rationalize Systems ● Throw Bodies at it ● Improve Individual Productivity Option #3 - Embrace Variety using probabalistic/model based approach - Tamr

8. Traditional Data Management Approaches: Necessary but not sufficient ● Standardization ● Aggregation ● Master Data Management ● Rationalize Systems ● Throw Bodies at it ● Improve Individual Productivity Option #2: “Manage” Variety Using Traditional Approaches

9. Logical Evolution to Probabilistic/Model-Based Approach Probabilistic Deterministic Probabilistic Deterministic Today Future Probabilistic (Tamr) complements, NOT Replaces, Deterministic (MDM)

10. INTRODUCING TAMR ▪ Founded in 2013 by enterprise database software veterans ▪ World-class engineering team ▪ Top tier venture backing (Google Ventures, NEA) Jerry Held, PhD Andy Palmer Mike Stonebraker, PhD Ihab Ilyas, PhD Kevin Burke Nidhi Aggarwal, PhD Min Xiao Nik Bates- Haus Kevin Willis 10

11. Managing enterprise information as an asset requires a new, bottom-up design pattern Catalog Connect Consume ALL your metadata and map it to logical entities Entities and attributes to remove information silos Unified data in the application of your choice via APIs “Embrace” Variety -- Tamr’s NextGen Approach

12. Tamr’s Design Pattern: “Back to the Future” 1990’s Web: Yahoo’s top-down organization 2020’s Enterprise: Probabilistic data source cataloging, connection and consumption

13. 13 ARCHITECTURE DATA & METADAT A SOURCES Analytics, visualization, Data Warehouse Expert Sourcing Data Profiling Schema Matching Record Deduplication Data Connection Activities Data Security Data Governance Machine Learning DB, ERP, CRM, CSV + DATA USES

14. TAMR WORKS WITH MDM SYSTEMS TO HANDLE EXTREME DATA VARIETY 14 MDM EDW Published Keys Schema map Few Well understood sources Long tail of disparate data sources Matches & Rules ● Cleansing ● Consolidation ● Survivorship ● Governance Rapid Analytics Benefits ● Business agility ● Faster MDM implementations (months -> weeks) ● Significantly lower ongoing maintenance

15. Fortune 50 company -- Optimized Sourcing Analysis Benefits ● Massive reductions in supplier list size & number of distinct suppliers ● Automated data maintenance; lower cost of ownership ● Powering strategic sourcing analytics and governance ● Empowering individual procurement team with global view of payment terms

16. Catalog Tamr helps you catalog metadata across the entire enterprise, providing a logical map of all of your information Find us at Booth #613 Connect Tamr helps match entities and attributes across the full variety of your sources, leveraging entity relationships for high accuracy Consume Tamr provides a consolidated view of entities and records for downstream applications via a set of RESTful APIs learn more at tamr.com Find us at Booth #613

Editor's Notes

#2: Key Messages: Introduce yourself as James Markarian I am currently an EIR at at Khosla ventures. Prior to Khosla, I spent 15 years as the CTO of Informatica, a leader in the ETL space, where I focused on <x> Recently, I joined Tamr, a company focused on unifying and enriching internal and external data for enterprise analytics, to advise them on product architecture and strategy. Today I’ll be speaking a bit about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
#3: e>>> Heterogeneity of information sources is natural in large companies Much of the roughly $3-4 trillion invested in enterprise software over the last 20 years, has gone toward building and deploying software systems and applications to automate and optimize key business processes in context of specific functions (sales, marketing, manufacturing) and/or geographies (countries, regions, states, etc) - essentially these are systems that produce data and do so in a very idiosyncratic manner. As each of these idiosyncratic applications are deployed - an equally idiosyncratic data source is created. The result: the data tied to enterprise investments in software is extremely heterogeneous and siloed - the broad use of the data has been 2ndary to the primary activity of automating business processes - producing the data. The data is almost like an idiosyncratic exhaust of all of these various applications. It’s not surprising (actually natural) that information across a large enterprise is disconnected and is managed more as the exhaust of 30+ years of business process automation. I think of this as a form of enterprise information entropy. The effort to standardize on single vendor platforms as well as creating enterprise-wide data warehouses has largely been an attempt to compensate for natural enterprise data variety/entropy and ironically - the top-down, approaches used to rationalize to a single platform or implement most warehouses (Deterministic ETL, Master Data Management and Waterfall Data Management Methods) - created not fewer silos - but just additional larger silos that increased the overall variety of data sources within an organization.
#4: >>> Heterogeneity of information sources is natural in large companies Much of the roughly $3-4 trillion invested in enterprise software over the last 20 years, has gone toward building and deploying software systems and applications to automate and optimize key business processes in context of specific functions (sales, marketing, manufacturing) and/or geographies (countries, regions, states, etc) - essentially these are systems that produce data and do so in a very idiosyncratic manner. As each of these idiosyncratic applications are deployed - an equally idiosyncratic data source is created. The result: the data tied to enterprise investments in software is extremely heterogeneous and siloed - the broad use of the data has been 2ndary to the primary activity of automating business processes - producing the data. The data is almost like an idiosyncratic exhaust of all of these various applications. It’s not surprising (actually natural) that information across a large enterprise is disconnected and is managed more as the exhaust of 30+ years of business process automation. I think of this as a form of enterprise information entropy. The effort to standardize on single vendor platforms as well as creating enterprise-wide data warehouses has largely been an attempt to compensate for natural enterprise data variety/entropy and ironically - the top-down, approaches used to rationalize to a single platform or implement most warehouses (Deterministic ETL, Master Data Management and Waterfall Data Management Methods) - created not fewer silos - but just additional larger silos that increased the overall variety of data sources within an organization.
#5: On top of the historical pull toward application and organization specific data sources - these systems get even more complicated and disconnected when you add the confusion and complexity that results from : M&A events every quarter Reorganizations every 6-12 months Changes in leadership every few years
#6: Objective estimates of the scale of this problem are surprising - specifically - industry analysts estimate that : 90% of big data is dark (not used or cataloged within the enterprise) 90% of collected data isn’t consumable (requires significant work to be useful) 80% of data scientist time is spent preparing the data for consumption Not being managed as an asset
#7: This challenge is only going to become more critical -- especially as expectations of Global Corporate IT as data broker are increasing quickly along with the hype around Big Data/Analytics 3.0 As we look forward to the next 20 years, most companies have begun investing heavily in Big Data Analytics – $44 billion in 2014 alone according to Gartner << insert reference to Data/Analytics being the top priority for CIOs >>. In this context, merely managing all of a company’s data as an asset presents a significant challenge for a globally missioned IT organization. But now - enter the trend toward proverbial Big Data and Analytics 3.0 -- and the already impossible problem of managing data variety becomes a strategic imperative for the IT organization who is now expected to integrate analytics and data seamlessly and quickly across all of these idiosyncratic silos so that all these users with great new democratized viz tools. We’d like to think that our data integration and preparation capabilities are advanced enough to service this great democratization. And that our “plumbing” is capable of treating the massive reserves of silo’d, heterogeneous data. However - these aspirations and the cool new viz tools that are available to everyone in the enterprise require clean, unified data that spans all the various silos. Most companies are finding this heterogeneity is a massive fundamental roadblock to effectively using state-of-the-art analytics and visualization tools. Basically Big Data Variety and heterogeneity is the dirty little secret of most enterprises and while it’s not sexy to spend time cleaning and preparing data - unified data is as important to enterprise analytics as reliable water treatment is to providing clean drinking water to the population. All of this leaves Corporate IT organizations several options to address the data variety problem as data brokers for their enterprise.
#8: Some orgs are simply ignoring the opportunity to convert variety into value – overwhelmed by the sheer volume of heterogeneous sources and data. So they go ahead and carve out their pile, go to their corner, and work with what they have.
#9: >>> Traditional approaches to managing data are necessary but not sufficient to address the broad enterprise data variety problem In order to realize the opportunity in variety – IT brokers need to recognize that their existing top-down tools/approaches are necessary but not sufficient to solve the variety problem. There is a long list of tools in the enterprise arsenal to try to tackle data variety - I’ve tried all of them over the years - specifically: Master Data Management - most of the efforts to do top-down deterministic data modeling results in useful taxonomies, controlled vocabularies and ontologies. This requires you to “tell” the various divisions what they are going to map to - which inevitably degrades into a debate about who is the Master and who is the “Slave”. These also are necessary - but not sufficient in order to manage the broad variety of tabular data in most enterprises. There are always deviations from whatever the 3 star wizards in labcoats who are responsible for the “Master” reference data.
#10: Multiple approaches have emerged to deal with the Data Variety problem, with the current state dominated by extreme top-down management (95% deterministic to 5% probabilistic). I predict that the shear number of data sources and complexity of change is going to drive us toward a bottom-up approach (80% probabilistic to 20% deterministic). The only viable way to tame enterprise data variety is through “bottom-up, collaborative data curation complements traditional MDM, ETL, data profiling and data quality methods.
#12: A Next-Gen Approach We believe that big companies should start by deploying a fundamentally new design pattern for data management which enables their organization to dynamically catalog, connect, curate ALL of their enterprise information sources from the bottom up using a scalable and agile approach. NOTE that Tamr operationalizes this approach at scale, across the enterprise -- NOT as another idiosyncratic solution -- AND work with existing data management and analytics tools]. Connect - Our emphasis has been on connecting diverse data sources across the enterprise, at scale. We are now expanding the platform to bring this level of scalable data unification and use across the enterprise. Catalog - At the front end, Tamr now solves a very common problem: What data do I use to solve this problem? Consume/Curate - Unified data doesn’t live in Tamr. We make it available to any downstream application or analytic tools -- including something as simple as spreadsheets - via a set of RESTful APIs.
#13: This design pattern is not new - it’s a mimic of the design patterns on the modern world wide web - but is designed to connect the primary information asset of the enterprise - tabular data. In the mid-1990’s - the early days of Yahoo!, they used library sciences professionals and top down information management practices and tools to organize websites and web content for search. Over time - it became clear that Google’s bottom-up probabilistic approach to matching web content with search terms - was going to be a much more scalable and effective approach - so much so that as most of you know - Yahoo! decided to license Google’s tech. Inside the enterprise, tabular data sources are the primary assets to be connected instead of websites … and companies need a new set of tools to register/catalog, connect and curate tabular data that is matched to the data/attributes that analytic users want/need. We believe that our technology at Tamr will be incorporated into existing legacy MDM, ETL and Data Management tools much in the way that Yahoo! licenced Google.
#15: Tamr automates schema mapping using a bottom-up approach Tamr is the master for probabilistic keys MDM MDM provides capabilities for Data cleansing Data consolidation Data survivorship Active and passive data governance Results Reduced MDM implementation time (weeks -> months) Reduce ongoing maintenance Use Tamr without MDM for analytical use cases which prioritize velocity of analysis
#16: Challenge With thousands of suppliers spanning many P&Ls and ERP systems, the company has been challenged to maintain an accurate supplier master file (SMF) to drive strategic sourcing analysis Solution Create a unified data model that leverages all relevant sources, including address, tax and government data Machine learning algorithms continuously evaluate & remove potential SMF duplicates Automated processing incrementally improves as validation is received from SMEs Benefits Massive reductions in supplier list size & number of distinct suppliers Automated data maintenance; lower cost of ownership in production Powering strategic sourcing analytics and governance at a corporate level Empowering individual procurement team with global view of payment terms Here’s the link for the long-form write up the team did, for background: https://docs.google.com/a/tamr.com/document/d/12JvLG4wr_PjpKOGlUyoDx6iVULCAkwm5bhHKMYP7vwU/edit?usp=sharing

Tamr | MDM and the Data Unification Imperative

More Related Content

What's hot (19)

Viewers also liked (11)

Similar to Tamr | MDM and the Data Unification Imperative (20)

Recently uploaded (20)

Tamr | MDM and the Data Unification Imperative

Editor's Notes