SlideShare a Scribd company logo
Fluentd and WebHDFS
      & what makes it possible to write out_webhdfs in 30min.

                            TAGOMORI Satoshi (@tagomoris)
                                                  NHN Japan
                               Fluentd meetup 3 (2012/11/08)




12年11月8日木曜日
@tagomoris
              NHN Japan Corp (Web Service Division)

              Fluentd committer, plugin developer
                      fluent-agent-lite, ...


12年11月8日木曜日
Usecase of Fluentd
         Monitoring, Notification and Visualization
              growthforecast, notifier, ikachan, ....
         Real-time aggregation
              datacounter, numeric-counter, numeric-aggregator, ..
         Real-time processing
              parser, exec_filter, ....



12年11月8日木曜日
Log Collection !




12年11月8日木曜日
Fluentd as log collector

         Many many output plugins for various storages
              file, file-alternative
              mongo, couch, cassandra, redis, s3, ....


         Hadoooooooooooooooooooooooooooooooooooooop




12年11月8日木曜日
Fluentd with HDFS
         To write data on HDFS:
              Java native protocol: HDFSClient.java
              hadoop fs -put
              libhdfs and its binding (like scribed)
              Cloudera Hoop (2011/07-)
              +WebHDFS (Apache 1.0-), +HttpFs (Apache 2.0-)



12年11月8日木曜日
fluent-plugin-webhdfs

         Output plugin to write data into HDFS


         Supports WebHDFS and HttpFs
         First release: 2012/05/20 by tagomoris
         v0.1.0 bundled within td-agent v1.1.10 (or later)




12年11月8日木曜日
WebHDFS
         HTTP REST API of HDFS
         Clients communicate all of NameNode and DataNodes
         (like HDFSClient)

                                     NameNode

                                                DataNode
              Client
                                                DataNode

                                                DataNode
                          HTTP

12年11月8日木曜日
HttpFs
         Proxy server 'httpfs', provides REST API for HDFS
         Same method set with WebHDFS (not like Hoop)
         Clients communicate with httpfs server only
                                            NameNode

                                                       DataNode
                              httpfs
              Client
                              server                   DataNode

                       HTTP            Java Native     DataNode




12年11月8日木曜日
WebHDFS or HttpFs
         WebHDFS: Peer-to-Peer communication
              Jetty based HTTP server
              High throughput and stability
         HttpFs: Proxyed and Centralized communication
              Tomcat based HTTP server
              Simple network topology
              Relatively low performance and SPOF

12年11月8日木曜日
Configuration: WebHDFS
         Use Apache 1.0.0(or later), CDH3u5 or CDH4(or later)
         In Namenode/Datanode
              dfs.webhdfs.enabled=true
              dfs.support.append=true      (only CDH3u5 ?)
              dfs.support.broken.append=true (only CDH3u5 ?)
         In fluent-plugin-webhdfs (type webhdfs)
              host hostname.of.namenode
              port 50070
              path /hdfs/access.%Y%m%d_%H.${hostname}.log

12年11月8日木曜日
WebHDFS in NHN Japan

         BEFORE: 1400 Timeouts/day with Hoop
         Tue Aug 14 15:04:34 2012 +0900
         "fix to use webhdfs to write into hdfs"
         "2012-08-14 15:08:18 +0900: starting fluentd-0.10.25"
         Wed Aug 15 13:11:04 2012 +0900
         "fix timeouts for busy AM2-5"
         AFTER: 130 Timeouts from 08/16 to 11/07
         1.2-1.5 TB/day from 10 fluentd nodes

12年11月8日木曜日
CONCLUSION 1

         WebHDFS is good enough for:
              continuous appending into log file
              daily operations to move/remove/copy/head/tail over
              client libraries (and your scripts)
         Fluentd and td-agent is good enough for:
              log collector before Hadoop/HDFS



12年11月8日木曜日
break




12年11月8日木曜日
fluent-plugin-webhdfs
      commit log
         Thu May 17 18:20:15 2012 on 'fluent-plugin-webhdfs'
              "writing code": in fact, no lines of ruby code....
         Sun May 20 19:01:26 2012 on 'xxxxx'
         (some commits)

         Sun May 20 19:35:34 2012 on 'fluent-plugin-webhdfs'
              "fix typo": tagged as v0.0.1



12年11月8日木曜日
30min!?
         fluent-plugin-webhdfs
              120 lines (including blank line and 'end')
              65 lines of configurations
              very few lines of actual code


              WebHDFS operations by 'webhdfs' gem
              Output formatting by 'PlainTextFormatterMixin'


12年11月8日木曜日
webhdfs gem commit log

         Sun May 20 17:00:57 2012
         (15 commits)
         Sun May 20 19:01:26 2012
         "v0.3: add WebHDFS::Client"




12年11月8日木曜日
fluent-mixin-*
         fluent-mixin-plaintextformatter
              output text data formatter
              webhdfs, file-alternative, hoop
         fluent-mixin-config-placeholders
              provide placeholders like '${hostname}', '${uuid}' in
              configurations
              webhdfs, ping-message


12年11月8日木曜日
CONCLUSION 2
         Output plugins have many (complex) problems:
              communication, formatting, configuration formats, ...
         We CAN/MUST depends on existing GEMS!
         We SHOULD write fluent-mixin gems for other plugin
         developers!
              many features/codes may be shared by many
              plugins
              unified syntax/features over plugins

12年11月8日木曜日
Questions?


               Thanks!



                      photo: crouton
                   thanks to @kbysmnr

12年11月8日木曜日

More Related Content

PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PPTX
Introduction to Apache Spark
PDF
IBM - Introduction to Cloudant
PDF
Best Practices in the Use of Columnar Databases
PDF
Fluentd Overview, Now and Then
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Real-time Analytics with Trino and Apache Pinot
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Introduction to Apache Spark
IBM - Introduction to Cloudant
Best Practices in the Use of Columnar Databases
Fluentd Overview, Now and Then
The evolution of Netflix's S3 data warehouse (Strata NY 2018)

What's hot (20)

PPTX
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ODP
Introduction To PostGIS
PPTX
Apache Pulsar First Overview
ODP
WMS Performance Shootout 2010
PDF
ETL Process
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
PPTX
Building an Event Streaming Architecture with Apache Pulsar
PDF
Building Better Data Pipelines using Apache Airflow
PPTX
Ozone: An Object Store in HDFS
PPTX
Migrating from InnoDB and HBase to MyRocks at Facebook
PPTX
Introduction to spark
PPTX
Architecting Snowflake for High Concurrency and High Performance
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Combining logs, metrics, and traces for unified observability
PDF
Spark overview
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PPTX
Splunk HTTP Event Collector
PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Introduction To PostGIS
Apache Pulsar First Overview
WMS Performance Shootout 2010
ETL Process
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Building an Event Streaming Architecture with Apache Pulsar
Building Better Data Pipelines using Apache Airflow
Ozone: An Object Store in HDFS
Migrating from InnoDB and HBase to MyRocks at Facebook
Introduction to spark
Architecting Snowflake for High Concurrency and High Performance
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Combining logs, metrics, and traces for unified observability
Spark overview
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Apache Arrow Flight: A New Gold Standard for Data Transport
Splunk HTTP Event Collector
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Ad

Similar to Fluentd and WebHDFS (20)

PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PDF
Plugins by tagomoris #fluentdcasual
PDF
Fluentd: Data streams in Ruby world #rdrc2014
PDF
Fluentdでログ収集「だけ」やる話 #study2study
PDF
Fluentd in Co-Work
PDF
Fluentd v1 and future at techtalk
PDF
Distributed Stream Processing on Fluentd / #fluentd
PDF
Fluentd Project Intro at Kubecon 2019 EU
PDF
Fluentd 101
PDF
Fluentd introduction at ipros
PDF
Fluentd - road to v1 -
PDF
How to collect Big Data into Hadoop
PDF
Interacting with hdfs
PDF
Fluentd meetup
PDF
hadoop
PDF
RuG Guest Lecture
PPTX
Hadoop Interacting with HDFS
PDF
Fluentd meetup in japan
PDF
upload test 1
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Plugins by tagomoris #fluentdcasual
Fluentd: Data streams in Ruby world #rdrc2014
Fluentdでログ収集「だけ」やる話 #study2study
Fluentd in Co-Work
Fluentd v1 and future at techtalk
Distributed Stream Processing on Fluentd / #fluentd
Fluentd Project Intro at Kubecon 2019 EU
Fluentd 101
Fluentd introduction at ipros
Fluentd - road to v1 -
How to collect Big Data into Hadoop
Interacting with hdfs
Fluentd meetup
hadoop
RuG Guest Lecture
Hadoop Interacting with HDFS
Fluentd meetup in japan
upload test 1
Ad

More from SATOSHI TAGOMORI (20)

PDF
Ractor's speed is not light-speed
PDF
Good Things and Hard Things of SaaS Development/Operations
PDF
Maccro Strikes Back
PDF
Invitation to the dark side of Ruby
PDF
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
PDF
Make Your Ruby Script Confusing
PDF
Hijacking Ruby Syntax in Ruby
PDF
Lock, Concurrency and Throughput of Exclusive Operations
PDF
Data Processing and Ruby in the World
PDF
Planet-scale Data Ingestion Pipeline: Bigdam
PDF
Technologies, Data Analytics Service and Enterprise Business
PDF
Ruby and Distributed Storage Systems
PDF
Perfect Norikra 2nd Season
PDF
To Have Own Data Analytics Platform, Or NOT To
PDF
The Patterns of Distributed Logging and Containers
PDF
How To Write Middleware In Ruby
PDF
Modern Black Mages Fighting in the Real World
PDF
Open Source Software, Distributed Systems, Database as a Cloud Service
PDF
How to Make Norikra Perfect
PDF
Distributed Logging Architecture in Container Era
Ractor's speed is not light-speed
Good Things and Hard Things of SaaS Development/Operations
Maccro Strikes Back
Invitation to the dark side of Ruby
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Make Your Ruby Script Confusing
Hijacking Ruby Syntax in Ruby
Lock, Concurrency and Throughput of Exclusive Operations
Data Processing and Ruby in the World
Planet-scale Data Ingestion Pipeline: Bigdam
Technologies, Data Analytics Service and Enterprise Business
Ruby and Distributed Storage Systems
Perfect Norikra 2nd Season
To Have Own Data Analytics Platform, Or NOT To
The Patterns of Distributed Logging and Containers
How To Write Middleware In Ruby
Modern Black Mages Fighting in the Real World
Open Source Software, Distributed Systems, Database as a Cloud Service
How to Make Norikra Perfect
Distributed Logging Architecture in Container Era

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx

Fluentd and WebHDFS

  • 1. Fluentd and WebHDFS & what makes it possible to write out_webhdfs in 30min. TAGOMORI Satoshi (@tagomoris) NHN Japan Fluentd meetup 3 (2012/11/08) 12年11月8日木曜日
  • 2. @tagomoris NHN Japan Corp (Web Service Division) Fluentd committer, plugin developer fluent-agent-lite, ... 12年11月8日木曜日
  • 3. Usecase of Fluentd Monitoring, Notification and Visualization growthforecast, notifier, ikachan, .... Real-time aggregation datacounter, numeric-counter, numeric-aggregator, .. Real-time processing parser, exec_filter, .... 12年11月8日木曜日
  • 5. Fluentd as log collector Many many output plugins for various storages file, file-alternative mongo, couch, cassandra, redis, s3, .... Hadoooooooooooooooooooooooooooooooooooooop 12年11月8日木曜日
  • 6. Fluentd with HDFS To write data on HDFS: Java native protocol: HDFSClient.java hadoop fs -put libhdfs and its binding (like scribed) Cloudera Hoop (2011/07-) +WebHDFS (Apache 1.0-), +HttpFs (Apache 2.0-) 12年11月8日木曜日
  • 7. fluent-plugin-webhdfs Output plugin to write data into HDFS Supports WebHDFS and HttpFs First release: 2012/05/20 by tagomoris v0.1.0 bundled within td-agent v1.1.10 (or later) 12年11月8日木曜日
  • 8. WebHDFS HTTP REST API of HDFS Clients communicate all of NameNode and DataNodes (like HDFSClient) NameNode DataNode Client DataNode DataNode HTTP 12年11月8日木曜日
  • 9. HttpFs Proxy server 'httpfs', provides REST API for HDFS Same method set with WebHDFS (not like Hoop) Clients communicate with httpfs server only NameNode DataNode httpfs Client server DataNode HTTP Java Native DataNode 12年11月8日木曜日
  • 10. WebHDFS or HttpFs WebHDFS: Peer-to-Peer communication Jetty based HTTP server High throughput and stability HttpFs: Proxyed and Centralized communication Tomcat based HTTP server Simple network topology Relatively low performance and SPOF 12年11月8日木曜日
  • 11. Configuration: WebHDFS Use Apache 1.0.0(or later), CDH3u5 or CDH4(or later) In Namenode/Datanode dfs.webhdfs.enabled=true dfs.support.append=true (only CDH3u5 ?) dfs.support.broken.append=true (only CDH3u5 ?) In fluent-plugin-webhdfs (type webhdfs) host hostname.of.namenode port 50070 path /hdfs/access.%Y%m%d_%H.${hostname}.log 12年11月8日木曜日
  • 12. WebHDFS in NHN Japan BEFORE: 1400 Timeouts/day with Hoop Tue Aug 14 15:04:34 2012 +0900 "fix to use webhdfs to write into hdfs" "2012-08-14 15:08:18 +0900: starting fluentd-0.10.25" Wed Aug 15 13:11:04 2012 +0900 "fix timeouts for busy AM2-5" AFTER: 130 Timeouts from 08/16 to 11/07 1.2-1.5 TB/day from 10 fluentd nodes 12年11月8日木曜日
  • 13. CONCLUSION 1 WebHDFS is good enough for: continuous appending into log file daily operations to move/remove/copy/head/tail over client libraries (and your scripts) Fluentd and td-agent is good enough for: log collector before Hadoop/HDFS 12年11月8日木曜日
  • 15. fluent-plugin-webhdfs commit log Thu May 17 18:20:15 2012 on 'fluent-plugin-webhdfs' "writing code": in fact, no lines of ruby code.... Sun May 20 19:01:26 2012 on 'xxxxx' (some commits) Sun May 20 19:35:34 2012 on 'fluent-plugin-webhdfs' "fix typo": tagged as v0.0.1 12年11月8日木曜日
  • 16. 30min!? fluent-plugin-webhdfs 120 lines (including blank line and 'end') 65 lines of configurations very few lines of actual code WebHDFS operations by 'webhdfs' gem Output formatting by 'PlainTextFormatterMixin' 12年11月8日木曜日
  • 17. webhdfs gem commit log Sun May 20 17:00:57 2012 (15 commits) Sun May 20 19:01:26 2012 "v0.3: add WebHDFS::Client" 12年11月8日木曜日
  • 18. fluent-mixin-* fluent-mixin-plaintextformatter output text data formatter webhdfs, file-alternative, hoop fluent-mixin-config-placeholders provide placeholders like '${hostname}', '${uuid}' in configurations webhdfs, ping-message 12年11月8日木曜日
  • 19. CONCLUSION 2 Output plugins have many (complex) problems: communication, formatting, configuration formats, ... We CAN/MUST depends on existing GEMS! We SHOULD write fluent-mixin gems for other plugin developers! many features/codes may be shared by many plugins unified syntax/features over plugins 12年11月8日木曜日
  • 20. Questions? Thanks! photo: crouton thanks to @kbysmnr 12年11月8日木曜日