SlideShare a Scribd company logo
Google Cloud Data Fusion
Drag & drop data pipelines
Balkan Misirli
Data Engineer @ Data Runs Deep
● Web Analytics (GA360) agency
● Google Cloud consulting partner
● Lots of BigQuery/Dataflow/Cloud Functions
Agenda (about 20 mins)
• What is Data Fusion?
• How does it compare?
• Demo
• Pricing + other details
• My first impressions
• Questions
A bit of background
● Data startup Cask developed open source
software CDAP (Cask Data App Platform)
● Google bought Cask last year
● GCP beta released Data Fusion as a
managed CDAP service last month
What is Data Fusion / CDAP ?
● A set of tools to
wrangle/explore data and
create pipelines
● Completely drag & drop
interface (no coding)
● Enables sharing of created
pipelines within organisation
How does it run pipelines?
● Converts GUI input into a DAG
to run as a Dataproc job
● Ephemeral Hadoop MR/Spark cluster
● Can also run on existing cluster (Terraform)
● Soon to be available for Dataflow execution
● All of this runs on GKE in the back end
● No AUS in-country option yet
Batch or streaming pipelines?
● Only batch for Basic edition
● Both batch and streaming for Enterprise edition
● Batch jobs run either Hadoop MR or Spark
● Streaming jobs run Spark Streaming
Demo !
Hub - Library of existing pipelines / plugins
Dashboard - shows all jobs that ran recently
Upload your own plugins / drivers / libraries
Wrangler - explore your data
Visual Pipeline Builder
My first impressions
● Instance creation takes up to 30 mins - slow!
● Hadoop execution is slow
● Web UI is pretty decent and intuitive
● Good (but maybe excessive) logging capability
● Quirky beta style errors
● Will definitely save labour hours
The good parts
● Pretty intuitive and easy
● Somewhat configurable (Env/CPUs/placeholder vars, etc)
● Stackdriver logging and monitoring available
● Open source, can import/export CDAP jobs - no vendor lock in
● Maybe cheaper than other enterprise alternatives
● Don’t have to operate your own Spark cluster!
The parts that have an exciting
journey of improvement ahead!
● PERMISSIONS!
● Wrangler only shows first 1000 rows - can be misleading when
filters/aggregations applied
● Doesn’t do input validation until runtime - annoying
● Java error stacktraces for a GUI based tool
Random. But at least it looks nice
Thorough Java stacktraces - perfect for GUI users!
Basic vs. Enterprise
Enterprise Only
● Streaming
● Can run in production
● Data lineage tool
● Choice of execution env
● Schedules & Triggers
● Unlimited simultaneous
pipeline execution
Both Editions
● Batch
● Can run in Dev/Sandbox
● Unlimited users
● Wrangler tool
● Visual pipeline builder
● (Basic) limit of 2
simultaneous pipelines
Pricing
● Priced in two parts: pipeline development + execution
● Development is USD $1.80 per hour (Basic) or
USD $4.20 per hour (Enterprise), billed by the minute
● First 120 hours of development on Basic edition is free
● Roughly $1100 per month for Basic, $3000 for Enterprise
● Execution is priced according to Dataproc VM pricing
Thanks !
I’ll share the slides on Linkedin SlideShare
Linkedin: linkedin.com/in/balkanmisirli
Email: balkan@datarunsdeep.com.au

More Related Content

PPTX
Azure serverless architectures
PDF
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
PPTX
Dev ops != Dev+Ops
PPTX
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
PPTX
DEVOPS 에 대한 전반적인 소개 및 자동화툴 소개
PDF
Kubeflow
PPTX
Cloud Adoption Plan - Planning phase
Azure serverless architectures
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Dev ops != Dev+Ops
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
DEVOPS 에 대한 전반적인 소개 및 자동화툴 소개
Kubeflow
Cloud Adoption Plan - Planning phase

What's hot (20)

PPTX
Deployment model Blue Green deployment
PPTX
Mobile cloud Computing
PDF
Cloud Migration Strategy and Best Practices
PDF
High–Performance Computing
PPTX
Migrating On-Premises Workloads with Azure Migrate
PPT
Virtualization.ppt
PDF
Microsoft Azure Overview
PPTX
Basics of Cloud Computing
PPTX
What Is A Docker Container? | Docker Container Tutorial For Beginners| Docker...
PPTX
Cloud Computing Presentation
PDF
Dockerfile Tutorial with Example | Creating your First Dockerfile | Docker Tr...
PPTX
Middleware Technologies ppt
PPTX
What is Serverless Computing?
PDF
DB Migration to Azure Database for PostgreSQL
PDF
IRJET- Blockchain based Certificate Issuing and Validation
PPTX
Introduction to Google Cloud Services / Platforms
PDF
Cloud Computing Using OpenStack
PPTX
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
PPTX
Introduction to Kubernetes
PPTX
Introduction to GCP (Google Cloud Platform)
Deployment model Blue Green deployment
Mobile cloud Computing
Cloud Migration Strategy and Best Practices
High–Performance Computing
Migrating On-Premises Workloads with Azure Migrate
Virtualization.ppt
Microsoft Azure Overview
Basics of Cloud Computing
What Is A Docker Container? | Docker Container Tutorial For Beginners| Docker...
Cloud Computing Presentation
Dockerfile Tutorial with Example | Creating your First Dockerfile | Docker Tr...
Middleware Technologies ppt
What is Serverless Computing?
DB Migration to Azure Database for PostgreSQL
IRJET- Blockchain based Certificate Issuing and Validation
Introduction to Google Cloud Services / Platforms
Cloud Computing Using OpenStack
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
Introduction to Kubernetes
Introduction to GCP (Google Cloud Platform)
Ad

Similar to Balkan - data eng meetup - data fusion (20)

PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Next.js with drupal, the good parts
PPTX
Kotlin REST & GraphQL API
PDF
High Performance Graphics - Introduction to WebGPU - Next Generation of High ...
PDF
GraphQL Bangkok Meetup 6.0
PDF
Collaborative environment with data science notebook
PDF
Why Go Lang?
PDF
Scaling up wso2 bam for billions of requests and terabytes of data
PDF
TDX2025 SFwelly April 2025 presented by David Smith
PDF
SCM Puppet: from an intro to the scaling
PDF
Grafana 7.0
PDF
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
PDF
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
PPTX
Using FME to Transform and Integrate Optical Connection Data Between Systems
PPTX
Dataflow.pptx
PDF
Logging in The World of DevOps
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Presto@Uber
PDF
Introduction to serverless computing on Google Cloud
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Next.js with drupal, the good parts
Kotlin REST & GraphQL API
High Performance Graphics - Introduction to WebGPU - Next Generation of High ...
GraphQL Bangkok Meetup 6.0
Collaborative environment with data science notebook
Why Go Lang?
Scaling up wso2 bam for billions of requests and terabytes of data
TDX2025 SFwelly April 2025 presented by David Smith
SCM Puppet: from an intro to the scaling
Grafana 7.0
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
Using FME to Transform and Integrate Optical Connection Data Between Systems
Dataflow.pptx
Logging in The World of DevOps
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Presto@Uber
Introduction to serverless computing on Google Cloud
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to machine learning and Linear Models
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to machine learning and Linear Models
IB Computer Science - Internal Assessment.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
Qualitative Qantitative and Mixed Methods.pptx
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
oil_refinery_comprehensive_20250804084928 (1).pptx

Balkan - data eng meetup - data fusion

  • 1. Google Cloud Data Fusion Drag & drop data pipelines
  • 2. Balkan Misirli Data Engineer @ Data Runs Deep ● Web Analytics (GA360) agency ● Google Cloud consulting partner ● Lots of BigQuery/Dataflow/Cloud Functions
  • 3. Agenda (about 20 mins) • What is Data Fusion? • How does it compare? • Demo • Pricing + other details • My first impressions • Questions
  • 4. A bit of background ● Data startup Cask developed open source software CDAP (Cask Data App Platform) ● Google bought Cask last year ● GCP beta released Data Fusion as a managed CDAP service last month
  • 5. What is Data Fusion / CDAP ? ● A set of tools to wrangle/explore data and create pipelines ● Completely drag & drop interface (no coding) ● Enables sharing of created pipelines within organisation
  • 6. How does it run pipelines? ● Converts GUI input into a DAG to run as a Dataproc job ● Ephemeral Hadoop MR/Spark cluster ● Can also run on existing cluster (Terraform) ● Soon to be available for Dataflow execution ● All of this runs on GKE in the back end ● No AUS in-country option yet
  • 7. Batch or streaming pipelines? ● Only batch for Basic edition ● Both batch and streaming for Enterprise edition ● Batch jobs run either Hadoop MR or Spark ● Streaming jobs run Spark Streaming
  • 9. Hub - Library of existing pipelines / plugins
  • 10. Dashboard - shows all jobs that ran recently
  • 11. Upload your own plugins / drivers / libraries
  • 12. Wrangler - explore your data
  • 14. My first impressions ● Instance creation takes up to 30 mins - slow! ● Hadoop execution is slow ● Web UI is pretty decent and intuitive ● Good (but maybe excessive) logging capability ● Quirky beta style errors ● Will definitely save labour hours
  • 15. The good parts ● Pretty intuitive and easy ● Somewhat configurable (Env/CPUs/placeholder vars, etc) ● Stackdriver logging and monitoring available ● Open source, can import/export CDAP jobs - no vendor lock in ● Maybe cheaper than other enterprise alternatives ● Don’t have to operate your own Spark cluster!
  • 16. The parts that have an exciting journey of improvement ahead! ● PERMISSIONS! ● Wrangler only shows first 1000 rows - can be misleading when filters/aggregations applied ● Doesn’t do input validation until runtime - annoying ● Java error stacktraces for a GUI based tool
  • 17. Random. But at least it looks nice
  • 18. Thorough Java stacktraces - perfect for GUI users!
  • 19. Basic vs. Enterprise Enterprise Only ● Streaming ● Can run in production ● Data lineage tool ● Choice of execution env ● Schedules & Triggers ● Unlimited simultaneous pipeline execution Both Editions ● Batch ● Can run in Dev/Sandbox ● Unlimited users ● Wrangler tool ● Visual pipeline builder ● (Basic) limit of 2 simultaneous pipelines
  • 20. Pricing ● Priced in two parts: pipeline development + execution ● Development is USD $1.80 per hour (Basic) or USD $4.20 per hour (Enterprise), billed by the minute ● First 120 hours of development on Basic edition is free ● Roughly $1100 per month for Basic, $3000 for Enterprise ● Execution is priced according to Dataproc VM pricing
  • 21. Thanks ! I’ll share the slides on Linkedin SlideShare Linkedin: linkedin.com/in/balkanmisirli Email: balkan@datarunsdeep.com.au