SlideShare a Scribd company logo
www.kensu.io
DATA SCIENCE GOVERNANCE
1
Turn GDPR’s accountability principles
into an added-value for your business
Data Science Meetup - Milan - March 18
www.kensu.io 2
- CEO & Founder -
Mathematics
Computer Science
ANDY PETRELLA
KENSU & ME
Started with an enterprise stack for Data Scientists: 

Agile Data Science Toolkit
Pivot on internal component:

Data Science Catalog
Main focus: 

Data Science Governance
Spark Notebook O’Reilly Training
www.kensu.io
TOPICS
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability and Transparency Principles
5. How to leverage GDPR and Data Science to improve
or disrupt the Business
3
www.kensu.io
SOME THOUGHTS ON “DATA SCIENCE”
4
www.kensu.io
MACHINE LEARNING
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
www.kensu.io
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
Computing
6
www.kensu.io
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
7
www.kensu.io
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms,
…
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
8
www.kensu.io
DON’T BELIEVE ME?
9https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
www.kensu.io 10
DON’T WANT TO READ THE PAPER?
What about this 3 minutes lecture in the
Google Machine Learning Crash Course
Talking about production of ML systems…
www.kensu.io 11
!!!
OR THIS ONE MAYBE?
Okay, it’s a 14 minutes lecture (probably as long as reading the paper ^^)
Talking about data dependencies
www.kensu.io
DATA SCIENCE GOVERNANCE: WHAT
12
www.kensu.io
DATA PIPELINE
Data pipeline is connecting activities on data, potentially involving
several technologies.
A pipeline is generally thought as an End-to-End processing line to
solve one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
13
www.kensu.io
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities.

(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
14
How not to be anxious at deploying the last piece used by the decision maker
www.kensu.io
SOURCES OF ANXIETY
What if:
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modified upstream?
• the insights are deviating from the reality?
• …
15
www.kensu.io
DEBUGGING?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug? 

Without the big picture, we may try to optimise a model for weeks for nothing
16
www.kensu.io
DATA SCIENCE GOVERNANCE
Data governance:
• controls that data meets precise standards
• involves monitoring against production data.
Data Science Governance:
• controls that data activity meets precise standards
• involves monitoring against production data activity.
A Data Activity is a phenomenon composed of
Technologies, Users, Systems, Data and Processing
17
www.kensu.io
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the
processes?
18
www.kensu.io
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a holistic view of their data,
system and processes.
They also have a control on their own results in production
They have the opportunity to analyse and debug any pipeline
involving all activities:
• independently of the technologies
• involving several units in the enterprise
19
www.kensu.io
DATA SCIENCE GOVERNANCE: HOW
20
www.kensu.io
CHALLENGES
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
21
www.kensu.io
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each data tool, do whatever it takes to collect information
describing its activities.
Information are metadata, lineage, statistics, accuracy measures, …
22
www.kensu.io
CONNECT THE DATA
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
23
This map tracks all data and their descendants
Data Science Governance needs the global picture.
www.kensu.io
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• pipeline optimisation
• data or model recommendation
24
www.kensu.io
GDPR
25
General Data Protection Regulation
www.kensu.io
ACCOUNTABILITY PRINCIPLE
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
26
www.kensu.io
TRANSPARENCY
As well as your obligation to provide comprehensive, clear
and transparent privacy policies, if your organisation has
more than 250 employees, you must maintain additional
internal records of your processing activities.
27
www.kensu.io
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
Or… building and maintaining the audit trails needed to
create measures that demonstrates accountability
28
www.kensu.io
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution: 

We have to monitor activities independently of the technologies
With this information we can reliably create automatically the
process registry composed of goals pursued and all data involved
29
www.kensu.io
BUSINESS: IMPROVE AND DISRUPT
30
Connect data and business
Spoiler attack: one-line ahead
www.kensu.io
DATA TO BUSINESS
31
Business KPIs are nothing but data!
www.kensu.io
BUSINESS TO DATA
32
Change the business to match the data
ADAPT!
www.kensu.io
KENSU
Making it real… yet, taking the idea even further
33
Kinda pitchy I know but meh… :-D
www.kensu.io 34
ARTIFICIAL INTELLIGENCE ON DATA SCIENCE
Solution
Scientist / Engineer
Manager

CTO

Business
CDO

DPO
DPO

Authority
Activities
API
Governance
Compliance
Transformation
Machine Learning
Performance
Artificial Intelligence
Actionable
Data
www.kensu.io
DATA FLOW, USERS, PROCESSES ALL-IN
35
Data sources, Schemas
Categories of data involved
Transitive lineage
Markers on privacy data involved
Users involved in the processes
Programs used to create/run the flow
www.kensu.io
INTUITIVE AND COMPREHENSIBLE REST API
36
dam_dependencies = [
ProcessLineageDepsBuilder(input_schema, output_schema)
.identity_from_output("data")
.append('f', ['name', 'last'], "data"),
ProcessLineageDepsBuilder(input_schema_2, output_schema)
.identity_from_output("data")
]
dam_create_process_run_and_lineage(process, user,
code_version, process_name,
dam_dependencies)
Example in Python
High level integration like for Spark
// Initializing library to hook up to Apache Spark
import io.kensu.dam.lineage.spark.lineage.Implicits._
spark.track()
Automatically tracks
- data transformations
- data stats
- machine learning models
- performance of models
www.kensu.io
MONITORING MACHINE LEARNING PERFORMANCE
37
Read cold data
DEV / Offline
Pick parameters
<train>
PROD / Online
Read hot data
Use parameters
<train>
Automated Monitoring
- Create data flow 

data -> prepared data -> model
- Register parameters
- Compute/Gather performance metrics
www.kensu.io
EASY TO INTEGRATE IN DEV ENVIRONMENT
38
Jupyter
Spark (Python, R, Scala)
Even notebooks !
Google Colab
Python, TensorFlow
www.kensu.io
OUR PRODUCT: KENSU DATA ACTIVITY MANAGEMENT
39
Data Science Governance
First Governance, Compliance and Performance solution for Data science
Feature Benefit Why it matters
Connect.Collect.Learn
Automatically captures all data
science relevant activities related to
governance, compliance and
performance within a given domain.
Provided end-to-end control and
insights into all relevant aspects of
data science related activities

#GDPR
DPO Dashboard
One-stop control center for all
potential data privacy violations
Near-realtime notifications and
actionable intelligence current state
of “compliance health”
#GDPR
Compliance Reporting
One-click reports for all relevant
governance and compliance reports
Guarantee for good relationship with
authorities in charge by respecting
their templates
#GDPR
www.kensu.io
BTW
Spark and Machine training in Roma in June:
http://www.technologytransfer.eu/event/1779/
Apache_Spark_and_Machine_Learning_Workshop.html
———————————————————————————————
Interested in our way to think about ML and DS?
We have another 3-days training on this (Spark, TensorFlow, H2O, …)
(One in Roma to be scheduled this Fall)
40
www.kensu.io
DATA SCIENCE GOVERNANCE
Andy Petrella
CEO and Co-Founder
@noootsab
andy.petrella@kensu.io
@kensuio
Let’s chat after the talk o/
Or, contact me for:
- DAM (demo, pilot, …)
- training!

More Related Content

PDF
Data science governance : what and how
PPTX
Data analytics introduction
PDF
DataEd Slides: Exorcising the Seven Deadly Data Sins
PDF
Data Management vs. Data Governance Program
PDF
Approaching Data Quality
PDF
DataEd Slides: Expressing Data Improvements as Business Outcomes
PDF
DataEd Slides: Getting Data Quality Right – Success Stories
PDF
DataEd Slides: Data Management Best Practices
Data science governance : what and how
Data analytics introduction
DataEd Slides: Exorcising the Seven Deadly Data Sins
Data Management vs. Data Governance Program
Approaching Data Quality
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Data Management Best Practices

What's hot (20)

PDF
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
PDF
Focus on Your Analysis, Not Your SQL Code
PDF
Digital Transformation Journey
PPT
Data governance, Information security strategy
PDF
DI&A Slides: Data Lake vs. Data Warehouse
PPTX
Key Elements for a Successful Service Analytics Program
PPTX
A Year in Review - Building a Comprehensive Data Management Program
PDF
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
PDF
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
PDF
Data-Ed Online Webinar: Data Architecture Requirements
PDF
DataOps - The Foundation for Your Agile Data Architecture
PDF
DataEd Slides: Data Management + Data Strategy = Interoperability
PDF
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
PDF
Data-Ed Webinar: Data Quality Success Stories
PDF
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
PDF
Strategic imperative the enterprise data model
PDF
Data Governance Best Practices and Lessons Learned
PDF
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
PDF
DataEd Slides: Getting (Re)Started with Data Stewardship
PDF
Data Governance - a work in progress
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
Focus on Your Analysis, Not Your SQL Code
Digital Transformation Journey
Data governance, Information security strategy
DI&A Slides: Data Lake vs. Data Warehouse
Key Elements for a Successful Service Analytics Program
A Year in Review - Building a Comprehensive Data Management Program
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
Data-Ed Online Webinar: Data Architecture Requirements
DataOps - The Foundation for Your Agile Data Architecture
DataEd Slides: Data Management + Data Strategy = Interoperability
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
Data-Ed Webinar: Data Quality Success Stories
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Strategic imperative the enterprise data model
Data Governance Best Practices and Lessons Learned
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
DataEd Slides: Getting (Re)Started with Data Stewardship
Data Governance - a work in progress
Ad

Similar to Data science governance and GDPR (20)

PDF
Governance compliance
PPTX
Top 10 Trends to Watch for In Data Science
PDF
Top 10 Trends to Watch for In Data Science.pdf
PPTX
Introduction to Data Science
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Data Science Introduction and Process in Data Science
PDF
EPF-datagov-part1-1.pdf
PDF
Essential+Data+Science+Notes+-+A+Concise+PDF+Guide.pdf
PDF
Data science mastery course in pitampura
PPTX
Fuel your Data-Driven Ambitions with Data Governance
PDF
Unit-I.pdf Data Science unit 1 Introduction of data science
PDF
Data Science at Scale - The DevOps Approach
PPTX
chapter_2_Data Science, Addis ababa_new.pptx
PDF
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
PDF
Gse uk-cedrinemadera-2018-shared
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
PDF
Intro to Data Science on Hadoop
PDF
Lecture_1_Intro.pdf
PPTX
Software engineering practices for the data science and machine learning life...
PPTX
Why Data Science Projects Fail
Governance compliance
Top 10 Trends to Watch for In Data Science
Top 10 Trends to Watch for In Data Science.pdf
Introduction to Data Science
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Data Science Introduction and Process in Data Science
EPF-datagov-part1-1.pdf
Essential+Data+Science+Notes+-+A+Concise+PDF+Guide.pdf
Data science mastery course in pitampura
Fuel your Data-Driven Ambitions with Data Governance
Unit-I.pdf Data Science unit 1 Introduction of data science
Data Science at Scale - The DevOps Approach
chapter_2_Data Science, Addis ababa_new.pptx
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
Gse uk-cedrinemadera-2018-shared
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Intro to Data Science on Hadoop
Lecture_1_Intro.pdf
Software engineering practices for the data science and machine learning life...
Why Data Science Projects Fail
Ad

More from Andy Petrella (20)

PPTX
Data Observability Best Pracices
PDF
How to Build a Global Data Mapping
PDF
Interactive notebooks
PDF
Scala: the unpredicted lingua franca for data science
PDF
Agile data science with scala
PDF
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
PDF
What is a distributed data science pipeline. how with apache spark and friends.
PDF
Towards a rebirth of data science (by Data Fellas)
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Leveraging mesos as the ultimate distributed data science platform
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Distributed machine learning 101 using apache spark from the browser
PPTX
Liège créative: Open Science
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Spark devoxx2014
PDF
Lightning fast genomics with Spark, Adam and Scala
PDF
Machine Learning and GraphX
Data Observability Best Pracices
How to Build a Global Data Mapping
Interactive notebooks
Scala: the unpredicted lingua franca for data science
Agile data science with scala
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
What is a distributed data science pipeline. how with apache spark and friends.
Towards a rebirth of data science (by Data Fellas)
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Spark Summit Europe: Share and analyse genomic data at scale
Leveraging mesos as the ultimate distributed data science platform
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Distributed machine learning 101 using apache spark from the browser
Liège créative: Open Science
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
What is Distributed Computing, Why we use Apache Spark
Spark devoxx2014
Lightning fast genomics with Spark, Adam and Scala
Machine Learning and GraphX

Recently uploaded (20)

PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Microsoft Core Cloud Services powerpoint
PDF
Lecture1 pattern recognition............
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Managing Community Partner Relationships
PPT
Predictive modeling basics in data cleaning process
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Business Analytics and business intelligence.pdf
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPT
ISS -ESG Data flows What is ESG and HowHow
DATA COLLECTION METHODS-ppt for nursing research
Microsoft Core Cloud Services powerpoint
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Optimise Shopper Experiences with a Strong Data Estate.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Managing Community Partner Relationships
Predictive modeling basics in data cleaning process
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Business Analytics and business intelligence.pdf
Pilar Kemerdekaan dan Identi Bangsa.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
A Complete Guide to Streamlining Business Processes
ISS -ESG Data flows What is ESG and HowHow

Data science governance and GDPR

  • 1. www.kensu.io DATA SCIENCE GOVERNANCE 1 Turn GDPR’s accountability principles into an added-value for your business Data Science Meetup - Milan - March 18
  • 2. www.kensu.io 2 - CEO & Founder - Mathematics Computer Science ANDY PETRELLA KENSU & ME Started with an enterprise stack for Data Scientists: 
 Agile Data Science Toolkit Pivot on internal component:
 Data Science Catalog Main focus: 
 Data Science Governance Spark Notebook O’Reilly Training
  • 3. www.kensu.io TOPICS 1. Some thoughts on “Data Science” 2. Data Science Governance: What 3. Data Science Governance: How 4. GDPR: Accountability and Transparency Principles 5. How to leverage GDPR and Data Science to improve or disrupt the Business 3
  • 4. www.kensu.io SOME THOUGHTS ON “DATA SCIENCE” 4
  • 5. www.kensu.io MACHINE LEARNING Pioneers in 1950s AI Winter in 1970s due pessimism Resurgence in 1980s Machine Learning (and related) is used since the 1990s (esp. SVM and RNN) Deep learning see widespread commercial use in 2000s Machine learning receives great publicity (read: buzz) in 2010s 5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
  • 6. www.kensu.io DATA SCIENCE: +ENGINEERING Claim: “Data Scientist” coined by DJ Patil in 2008. Pretty much where Machine Learning was part of Softwares In a way, when we added “engineering” to the mix Also, engineering is even more prominent with Big Data Distributed Computing 6
  • 7. www.kensu.io DATA SCIENCE: +EXPERIMENTATION So much data available So many tools, libraries, frameworks, … So many things we can try We have distributed computing now, right? => Let’s try everything Discover new insights (and potentially new businesses) 7
  • 8. www.kensu.io DATA SCIENCE: RECAP Maths: stats, machine learning and so on Engineering: ETL, Databases, Computing framework, Softwares, Platforms, … Creativity: “From business intelligence To intelligent business”- Michael Fergusson Data Science is an umbrella on top of all activities on data 8
  • 10. www.kensu.io 10 DON’T WANT TO READ THE PAPER? What about this 3 minutes lecture in the Google Machine Learning Crash Course Talking about production of ML systems…
  • 11. www.kensu.io 11 !!! OR THIS ONE MAYBE? Okay, it’s a 14 minutes lecture (probably as long as reading the paper ^^) Talking about data dependencies
  • 13. www.kensu.io DATA PIPELINE Data pipeline is connecting activities on data, potentially involving several technologies. A pipeline is generally thought as an End-to-End processing line to solve one problem. But, part of pipelines are reused to save computation, storage, time, … Thus interdependency between pipeline segments grows with initiatives 13
  • 14. www.kensu.io GOAL: TAKE DECISION Data Pipelines, connected together, aren’t created for the beauty of it. The ultimate goal is always to take decisions. Decisions are generally taken or linked to humans with responsibilities.
 (even for self driving cars, in case of problem) Given that pipelines are cut-and-wired, interleaved, … 14 How not to be anxious at deploying the last piece used by the decision maker
  • 15. www.kensu.io SOURCES OF ANXIETY What if: • one of the data used in the process has different patterns suddenly? • one of the tools, projects or similar is modified upstream? • the insights are deviating from the reality? • … 15
  • 16. www.kensu.io DEBUGGING? To reduce the anxiety or, actually, reducing the risks, we need ways to debug. In pure engineering, we have unit, function, integrations tests,… but How do we do when the problems come from the data themselves? We can’t generate all cases of data variations, right? How to debug? 
 Without the big picture, we may try to optimise a model for weeks for nothing 16
  • 17. www.kensu.io DATA SCIENCE GOVERNANCE Data governance: • controls that data meets precise standards • involves monitoring against production data. Data Science Governance: • controls that data activity meets precise standards • involves monitoring against production data activity. A Data Activity is a phenomenon composed of Technologies, Users, Systems, Data and Processing 17
  • 18. www.kensu.io GOVERNING DATA SCIENCE Who does what on which data and where it is done? What is the impact of a process on the global system? What are the performance metrics (quality, execution,…) of the processes? 18
  • 19. www.kensu.io CONTINUOUS INTEGRATION FOR DATA SCIENCE Data Scientists/Citizens have a holistic view of their data, system and processes. They also have a control on their own results in production They have the opportunity to analyse and debug any pipeline involving all activities: • independently of the technologies • involving several units in the enterprise 19
  • 21. www.kensu.io CHALLENGES So many tools are using data! The number of processing is growing impressively. We have to take care of the legacy… 21
  • 22. www.kensu.io GET THE DATA As usual, we have to collect the right data to take right decision. First run an assessment to create a high level map of all the tools involved into a company. For each data tool, do whatever it takes to collect information describing its activities. Information are metadata, lineage, statistics, accuracy measures, … 22
  • 23. www.kensu.io CONNECT THE DATA To do that we need to connect all data that can be collected. So that, it is possible to create a cartography of all on-going processes. 23 This map tracks all data and their descendants Data Science Governance needs the global picture.
  • 24. www.kensu.io USE THE DATA This is where the fun part starts… the map of data activities is an amazing source of information Here are a few things you can think of when using this kind of data: • impact analysis • dependency analysis • pipeline optimisation • data or model recommendation 24
  • 26. www.kensu.io ACCOUNTABILITY PRINCIPLE Implement appropriate technical and organisational measures that ensure and demonstrate that you comply. This may include internal data protection policies such as staff training, internal audits of processing activities, and reviews of internal HR policies. 26
  • 27. www.kensu.io TRANSPARENCY As well as your obligation to provide comprehensive, clear and transparent privacy policies, if your organisation has more than 250 employees, you must maintain additional internal records of your processing activities. 27
  • 28. www.kensu.io ACCOUNTABILITY: DATA SCIENCE GOVERNANCE To govern data science, we have to: • collect activities • connect activities Or… building and maintaining the audit trails needed to create measures that demonstrates accountability 28
  • 29. www.kensu.io TRANSPARENCY: DATA SCIENCE GOVERNANCE To govern data science seen as a continuous integration solution: 
 We have to monitor activities independently of the technologies With this information we can reliably create automatically the process registry composed of goals pursued and all data involved 29
  • 30. www.kensu.io BUSINESS: IMPROVE AND DISRUPT 30 Connect data and business Spoiler attack: one-line ahead
  • 31. www.kensu.io DATA TO BUSINESS 31 Business KPIs are nothing but data!
  • 32. www.kensu.io BUSINESS TO DATA 32 Change the business to match the data ADAPT!
  • 33. www.kensu.io KENSU Making it real… yet, taking the idea even further 33 Kinda pitchy I know but meh… :-D
  • 34. www.kensu.io 34 ARTIFICIAL INTELLIGENCE ON DATA SCIENCE Solution Scientist / Engineer Manager
 CTO
 Business CDO
 DPO DPO
 Authority Activities API Governance Compliance Transformation Machine Learning Performance Artificial Intelligence Actionable Data
  • 35. www.kensu.io DATA FLOW, USERS, PROCESSES ALL-IN 35 Data sources, Schemas Categories of data involved Transitive lineage Markers on privacy data involved Users involved in the processes Programs used to create/run the flow
  • 36. www.kensu.io INTUITIVE AND COMPREHENSIBLE REST API 36 dam_dependencies = [ ProcessLineageDepsBuilder(input_schema, output_schema) .identity_from_output("data") .append('f', ['name', 'last'], "data"), ProcessLineageDepsBuilder(input_schema_2, output_schema) .identity_from_output("data") ] dam_create_process_run_and_lineage(process, user, code_version, process_name, dam_dependencies) Example in Python High level integration like for Spark // Initializing library to hook up to Apache Spark import io.kensu.dam.lineage.spark.lineage.Implicits._ spark.track() Automatically tracks - data transformations - data stats - machine learning models - performance of models
  • 37. www.kensu.io MONITORING MACHINE LEARNING PERFORMANCE 37 Read cold data DEV / Offline Pick parameters <train> PROD / Online Read hot data Use parameters <train> Automated Monitoring - Create data flow 
 data -> prepared data -> model - Register parameters - Compute/Gather performance metrics
  • 38. www.kensu.io EASY TO INTEGRATE IN DEV ENVIRONMENT 38 Jupyter Spark (Python, R, Scala) Even notebooks ! Google Colab Python, TensorFlow
  • 39. www.kensu.io OUR PRODUCT: KENSU DATA ACTIVITY MANAGEMENT 39 Data Science Governance First Governance, Compliance and Performance solution for Data science Feature Benefit Why it matters Connect.Collect.Learn Automatically captures all data science relevant activities related to governance, compliance and performance within a given domain. Provided end-to-end control and insights into all relevant aspects of data science related activities
 #GDPR DPO Dashboard One-stop control center for all potential data privacy violations Near-realtime notifications and actionable intelligence current state of “compliance health” #GDPR Compliance Reporting One-click reports for all relevant governance and compliance reports Guarantee for good relationship with authorities in charge by respecting their templates #GDPR
  • 40. www.kensu.io BTW Spark and Machine training in Roma in June: http://www.technologytransfer.eu/event/1779/ Apache_Spark_and_Machine_Learning_Workshop.html ——————————————————————————————— Interested in our way to think about ML and DS? We have another 3-days training on this (Spark, TensorFlow, H2O, …) (One in Roma to be scheduled this Fall) 40
  • 41. www.kensu.io DATA SCIENCE GOVERNANCE Andy Petrella CEO and Co-Founder @noootsab andy.petrella@kensu.io @kensuio Let’s chat after the talk o/ Or, contact me for: - DAM (demo, pilot, …) - training!