SlideShare a Scribd company logo
Data Analytics with
Hadoop/Hive on
Multiple Data Centers.

               Hirotaka Niisato
               GMO Internet, Inc.
about myself
●
    Hirotaka Niisato(@hirotakaster)
●
    Programmer
●
    GMO Internet, SIProp Project
●
    Work
    Robotics Kinect Android Networking MAKE: Solr Volunteer ...
Data Analytics System
●
    KPI reporting system for Cloud System
●
    GMO Apps Cloud
●
    Over 500 Titles
    mobage, gree, mixi, Hangame, facebook, nikoniko … etc
●
    Data Center
    Japan, US(west coast)
Analytics Specification
●
    Social Game Data KPI
    DAU/PV, Play Time, Sales
    A/B Testing, Conversion … etc


●
    Hourly, Daily, Weekly, Monthly


●
    Since 2010/06 ~
System Architecture
  SNS                                                         Game
  User                            SNS Platform                Master




Cloud System                                     Management   Monitoring
                                                   System      System


            Cloud Server
           (Game Server)



      Logging
                    Scheduler          ・・・・・・・・
       Server


                      MySQL
    Hadoop/Hive
                     (for Hive)

         Data Center A                                   Data Center N
Specification, Statistics
●
    Multiple NameNode per Data Center
●
    Hardware Spacification
    CPU : 8~16CPU(HT)
    MEM: 12~64Gbyte
    HD : RAID 1, 5, 1+0
●
    Statistics
    6,000,000 blocks/44,000 jobs/day
    1,000 over AP servers logging
Data Flow
load data local inpath 'hogehoge-access_log.*.log.gz'
overwrite into table original_logs
partition (log_date='2012-07-26', log_number=13);

host      string from deserializer
identity   string from deserializer
user       string from deserializer               Cloud Server
time      string from deserializer               (Game Server)
method     string from deserializer
request    string from deserializer
status    string from deserializer                  Logging
size      string from deserializer                                            Management
                                                     Server                     System
referer   string from deserializer
agent      string from deserializer
log_date        string
log_number      tinyint
                                          Hadoop/Hive         Scheduler
host     string
time     string
method   string                                                  HiveDriver
request  string
userid   string
log_date     string                          Filter → Hourly, Daily, Weekly, Monthly Report
log_number tinyint                           (AB Testing, Conversion, DAU..etc)
Conversion Count HQL
INSERT OVERWRITE TABLE conversion_click
 PARTITION (log_date= :logDate, log_number=:logNumber)
   SELECT regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1),
             regexp_extract(request, 'convflg=(A|B){1}', 1),
             count(1),
             :logMonth,
             :logWeek
     FROM parsed_log
   WHERE request RLIKE 'convid=[a-zA-Z0-9%]'
      AND request RLIKE 'convflg=(A|B){1}'
      AND log_date = :logDate
      AND log_number = :logNumber
 GROUP BY regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1),
           regexp_extract(request, 'convflg=(A|B){1}', 1)
Monitoring/Management(Zabbix)
Memory Management
●
    Namenode Memory
    File, Block, Directory



●
    Hadoop Archive


●
    Server Memory
Trouble
●
    Re-Analytics
●
    Backup and Recovery
●
    NameNode HA
●
    Hive vs MapReduce
Thank you

More Related Content

PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
PDF
Scio - Moving to Google Cloud, A Spotify Story
PDF
Sorry - How Bieber broke Google Cloud at Spotify
PDF
PDF
Beautiful Monitoring With Grafana and InfluxDB
PPTX
Elk with Openstack
PDF
Barcelona MUG MongoDB + Hadoop Presentation
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Scio - Moving to Google Cloud, A Spotify Story
Sorry - How Bieber broke Google Cloud at Spotify
Beautiful Monitoring With Grafana and InfluxDB
Elk with Openstack
Barcelona MUG MongoDB + Hadoop Presentation

What's hot (19)

PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
ODP
Daniel Sikar: Hadoop MapReduce - 06/09/2010
PPTX
Need for Time series Database
PPTX
RethinkDB - the open-source database for the realtime web
PDF
Norikra: SQL Stream Processing In Ruby
PDF
Storing metrics at scale with Gnocchi
PDF
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
PDF
InfluxDB & Grafana
PDF
Time series database, InfluxDB & PHP
PPTX
MongoDB for Time Series Data Part 3: Sharding
PDF
Imply at Apache Druid Meetup in London 1-15-20
ODP
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
PPTX
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
PDF
Building real time analytics applications using pinot : A LinkedIn case study
PDF
Scalable real-time processing techniques
PPTX
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
PPT
Server side geo_tools_in_drupal_pnw_2012
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Need for Time series Database
RethinkDB - the open-source database for the realtime web
Norikra: SQL Stream Processing In Ruby
Storing metrics at scale with Gnocchi
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
InfluxDB & Grafana
Time series database, InfluxDB & PHP
MongoDB for Time Series Data Part 3: Sharding
Imply at Apache Druid Meetup in London 1-15-20
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Building real time analytics applications using pinot : A LinkedIn case study
Scalable real-time processing techniques
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
Server side geo_tools_in_drupal_pnw_2012
Ad

Viewers also liked (16)

PDF
20120830 DBリファクタリング読書会第三回
PDF
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PPTX
Future of HCatalog - Hadoop Summit 2012
PDF
Cloudera Manager4.0とNameNode-HAセミナー資料
PDF
Database smells
PDF
【17-E-3】 オンライン機械学習で実現する大規模データ処理
PDF
Lars George HBase Seminar with O'REILLY Oct.12 2012
PPTX
Writing Yarn Applications Hadoop Summit 2012
PDF
並列データベースシステムの概念と原理
PDF
あなたの知らないPostgreSQL監視の世界
PDF
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
KEY
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
PPTX
SQLチューニング入門 入門編
PDF
Datalogからsqlへの トランスレータを書いた話
PPTX
ならば(その弐)
PPTX
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
20120830 DBリファクタリング読書会第三回
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
Future of HCatalog - Hadoop Summit 2012
Cloudera Manager4.0とNameNode-HAセミナー資料
Database smells
【17-E-3】 オンライン機械学習で実現する大規模データ処理
Lars George HBase Seminar with O'REILLY Oct.12 2012
Writing Yarn Applications Hadoop Summit 2012
並列データベースシステムの概念と原理
あなたの知らないPostgreSQL監視の世界
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
SQLチューニング入門 入門編
Datalogからsqlへの トランスレータを書いた話
ならば(その弐)
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
Ad

Similar to Data analytics with hadoop hive on multiple data centers (20)

PPT
Hadoop & Zing
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
PDF
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
PDF
Building a Real-Time Gaming Analytics Service with Apache Druid
PDF
Transforming Mobile Push Notifications with Big Data
PPT
hadoop&zing
PDF
Siddhi - cloud-native stream processor
PPTX
Implementing Real-Time IoT Stream Processing in Azure
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PDF
WSO2 Analytics Platform: The one stop shop for all your data needs
PDF
CloudWatch hidden features for debugging serverless application
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PPTX
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
PPTX
Apache Avro in LivePerson [Hebrew]
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
PDF
Game Analytics at London Apache Druid Meetup
PDF
Presto GeoSpatial @ Strata New York 2017
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
Hadoop & Zing
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
Building a Real-Time Gaming Analytics Service with Apache Druid
Transforming Mobile Push Notifications with Big Data
hadoop&zing
Siddhi - cloud-native stream processor
Implementing Real-Time IoT Stream Processing in Azure
Hadoop & Hive Change the Data Warehousing Game Forever
WSO2 Analytics Platform: The one stop shop for all your data needs
CloudWatch hidden features for debugging serverless application
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
Apache Avro in LivePerson [Hebrew]
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
Game Analytics at London Apache Druid Meetup
Presto GeoSpatial @ Strata New York 2017
[WSO2Con EU 2018] The Rise of Streaming SQL

More from Hirotaka Niisato (20)

PDF
ジャンクスピーカーの再利用〜量子へと Maker Faire Tokyo 2021
PPTX
Manabiya session
PDF
品テク meetup-vol.10
PPTX
LINE dev meetup
PDF
Developer Summit 2017
PPTX
ポスト君とIoTとline bot
PPTX
WebとIoTとMake
PDF
おうちハックナイト
PDF
QS Tools for Emotions and Communication
PDF
Makeでも使われる色んなセンサー
PDF
How to MAKE HVC-C Protyping Application
PDF
ネット側からの物作り
PDF
Maker Faire Taipei 2014 workshop
PDF
android bazaar and conference 2014 spring
PDF
国内外のMaker faireに参加してみて
PDF
3 Dセンサーの活用
PDF
Interactive Application using Kinect and Android
PDF
Android and OpenNI - NUI Application Treasure Hunter Robot
PPTX
Androidで出来る!! KinectとiPadを使った亀ロボ
PDF
RandomSortFieldとMahoutのCtr比較について
ジャンクスピーカーの再利用〜量子へと Maker Faire Tokyo 2021
Manabiya session
品テク meetup-vol.10
LINE dev meetup
Developer Summit 2017
ポスト君とIoTとline bot
WebとIoTとMake
おうちハックナイト
QS Tools for Emotions and Communication
Makeでも使われる色んなセンサー
How to MAKE HVC-C Protyping Application
ネット側からの物作り
Maker Faire Taipei 2014 workshop
android bazaar and conference 2014 spring
国内外のMaker faireに参加してみて
3 Dセンサーの活用
Interactive Application using Kinect and Android
Android and OpenNI - NUI Application Treasure Hunter Robot
Androidで出来る!! KinectとiPadを使った亀ロボ
RandomSortFieldとMahoutのCtr比較について

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A comparative analysis of optical character recognition models for extracting...
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing

Data analytics with hadoop hive on multiple data centers

  • 1. Data Analytics with Hadoop/Hive on Multiple Data Centers. Hirotaka Niisato GMO Internet, Inc.
  • 2. about myself ● Hirotaka Niisato(@hirotakaster) ● Programmer ● GMO Internet, SIProp Project ● Work Robotics Kinect Android Networking MAKE: Solr Volunteer ...
  • 3. Data Analytics System ● KPI reporting system for Cloud System ● GMO Apps Cloud ● Over 500 Titles mobage, gree, mixi, Hangame, facebook, nikoniko … etc ● Data Center Japan, US(west coast)
  • 4. Analytics Specification ● Social Game Data KPI DAU/PV, Play Time, Sales A/B Testing, Conversion … etc ● Hourly, Daily, Weekly, Monthly ● Since 2010/06 ~
  • 5. System Architecture SNS Game User SNS Platform Master Cloud System Management Monitoring System System Cloud Server (Game Server) Logging Scheduler ・・・・・・・・ Server MySQL Hadoop/Hive (for Hive) Data Center A Data Center N
  • 6. Specification, Statistics ● Multiple NameNode per Data Center ● Hardware Spacification CPU : 8~16CPU(HT) MEM: 12~64Gbyte HD : RAID 1, 5, 1+0 ● Statistics 6,000,000 blocks/44,000 jobs/day 1,000 over AP servers logging
  • 7. Data Flow load data local inpath 'hogehoge-access_log.*.log.gz' overwrite into table original_logs partition (log_date='2012-07-26', log_number=13); host string from deserializer identity string from deserializer user string from deserializer Cloud Server time string from deserializer (Game Server) method string from deserializer request string from deserializer status string from deserializer Logging size string from deserializer Management Server System referer string from deserializer agent string from deserializer log_date string log_number tinyint Hadoop/Hive Scheduler host string time string method string HiveDriver request string userid string log_date string Filter → Hourly, Daily, Weekly, Monthly Report log_number tinyint (AB Testing, Conversion, DAU..etc)
  • 8. Conversion Count HQL INSERT OVERWRITE TABLE conversion_click PARTITION (log_date= :logDate, log_number=:logNumber) SELECT regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1), regexp_extract(request, 'convflg=(A|B){1}', 1), count(1), :logMonth, :logWeek FROM parsed_log WHERE request RLIKE 'convid=[a-zA-Z0-9%]' AND request RLIKE 'convflg=(A|B){1}' AND log_date = :logDate AND log_number = :logNumber GROUP BY regexp_extract(request, 'convid=([a-zA-Z0-9%])', 1), regexp_extract(request, 'convflg=(A|B){1}', 1)
  • 10. Memory Management ● Namenode Memory File, Block, Directory ● Hadoop Archive ● Server Memory
  • 11. Trouble ● Re-Analytics ● Backup and Recovery ● NameNode HA ● Hive vs MapReduce