SlideShare a Scribd company logo
Treasure Data, Inc.
Founder & Software Architect
Sadayuki Furuhashi
Embulk Internals
Execution overview
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads

(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin

(parser plugin if input is file-based)
Output plugin

(formatter plugin if output is file-based)

More Related Content

PDF
Fighting Against Chaotically Separated Values with Embulk
PDF
Embulk - 進化するバルクデータローダ
PDF
Automating Workflows for Analytics Pipelines
PDF
Recent Updates at Embulk Meetup #3
PPTX
Data integration with embulk
PDF
Using Embulk at Treasure Data
PDF
Fluentd at Bay Area Kubernetes Meetup
PDF
Embulk at Treasure Data
Fighting Against Chaotically Separated Values with Embulk
Embulk - 進化するバルクデータローダ
Automating Workflows for Analytics Pipelines
Recent Updates at Embulk Meetup #3
Data integration with embulk
Using Embulk at Treasure Data
Fluentd at Bay Area Kubernetes Meetup
Embulk at Treasure Data

What's hot (20)

PDF
Logging for Production Systems in The Container Era
PDF
Scripting Embulk Plugins
PDF
Making KVS 10x Scalable
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Fluentd - road to v1 -
PDF
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
PDF
Prestogres internals
PDF
Data Analytics Service Company and Its Ruby Usage
PDF
Presto - Hadoop Conference Japan 2014
PDF
Embulk, an open-source plugin-based parallel bulk data loader
PPTX
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PDF
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PDF
Google App Engine With Java And Groovy
PDF
Build a Complex, Realtime Data Management App with Postgres 14!
PDF
Async and Non-blocking IO w/ JRuby
PPTX
MySQL Slow Query log Monitoring using Beats & ELK
PDF
ClickHouse Keeper
PDF
Nodejs Explained with Examples
PDF
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
Logging for Production Systems in The Container Era
Scripting Embulk Plugins
Making KVS 10x Scalable
Digdagによる大規模データ処理の自動化とエラー処理
Fluentd - road to v1 -
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Prestogres internals
Data Analytics Service Company and Its Ruby Usage
Presto - Hadoop Conference Japan 2014
Embulk, an open-source plugin-based parallel bulk data loader
High Performance, High Reliability Data Loading on ClickHouse
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Google App Engine With Java And Groovy
Build a Complex, Realtime Data Management App with Postgres 14!
Async and Non-blocking IO w/ JRuby
MySQL Slow Query log Monitoring using Beats & ELK
ClickHouse Keeper
Nodejs Explained with Examples
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
Ad

Similar to Embuk internals (9)

PDF
Using Embulk at Treasure Data
PPTX
Hadoop MapReduce Introduction and Deep Insight
DOCX
Big data unit iv and v lecture notes qb model exam
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
PPTX
MapReduce.pptx
PDF
Hadoop Internals
PDF
lec5_ref.pdf
PPTX
YARN (2).pptx
PPT
Anatomy of classic map reduce in hadoop
Using Embulk at Treasure Data
Hadoop MapReduce Introduction and Deep Insight
Big data unit iv and v lecture notes qb model exam
Cascading - A Java Developer’s Companion to the Hadoop World
MapReduce.pptx
Hadoop Internals
lec5_ref.pdf
YARN (2).pptx
Anatomy of classic map reduce in hadoop
Ad

More from Sadayuki Furuhashi (19)

PDF
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
PDF
DigdagはなぜYAMLなのか?
PDF
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
PDF
Plugin-based software design with Ruby and RubyGems
PDF
Understanding Presto - Presto meetup @ Tokyo #1
PDF
Presto+MySQLで分散SQL
PDF
Fluentd - Set Up Once, Collect More
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
PDF
How we use Fluentd in Treasure Data
PDF
Fluentd meetup at Slideshare
PDF
How to collect Big Data into Hadoop
PDF
Fluentd meetup
PDF
upload test 1
PDF
Programming Tools and Techniques #369 - The MessagePack Project
PDF
Gumi study7 messagepack
PDF
gumiStudy#7 The MessagePack Project
PDF
NoSQL afternoon in Japan kumofs & MessagePack
PDF
NoSQL afternoon in Japan Kumofs & MessagePack
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
DigdagはなぜYAMLなのか?
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Plugin-based software design with Ruby and RubyGems
Understanding Presto - Presto meetup @ Tokyo #1
Presto+MySQLで分散SQL
Fluentd - Set Up Once, Collect More
Prestogres, ODBC & JDBC connectivity for Presto
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
How we use Fluentd in Treasure Data
Fluentd meetup at Slideshare
How to collect Big Data into Hadoop
Fluentd meetup
upload test 1
Programming Tools and Techniques #369 - The MessagePack Project
Gumi study7 messagepack
gumiStudy#7 The MessagePack Project
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
AI in Product Development-omnex systems
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
System and Network Administration Chapter 2
PDF
Nekopoi APK 2025 free lastest update
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
AI in Product Development-omnex systems
How Creative Agencies Leverage Project Management Software.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Operating system designcfffgfgggggggvggggggggg
System and Network Administration Chapter 2
Nekopoi APK 2025 free lastest update
Design an Analysis of Algorithms I-SECS-1021-03
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
wealthsignaloriginal-com-DS-text-... (1).pdf
L1 - Introduction to python Backend.pptx
ai tools demonstartion for schools and inter college
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Choose the Right IT Partner for Your Business in Malaysia

Embuk internals