SlideShare a Scribd company logo
淘宝Hadoop数据分析实践淘宝 数据平台与产品部周敏(周忱)
数据分析选型历程Hadoop简介系统架构集群介绍近期对Hadoop的改造实践主要内容
淘宝数据分析选型历程webalizer awstat 般若 & OracleAtpanel & Oracle RAC 日志最高达250GB/天最高达约50道作业每天运行20小时以上Oracle RAC集群最多20个节点HadoopHive
Hadoop是什么
目前架构天网调度系统Oracle 备库爬虫数据MySQL备库日志系统TimeTunnelDataExchangeDataSyncGateway ServersHadoop Cluster:云梯1Map Reduce Java JobsStreaming JobsHive Jobs数据平台搜索支付宝B2B云梯2口碑广告BI数据魔方量子统计淘数据推荐系统搜索排行…
规模总容量27.79PB, 利用率51.06%总共1600+台机器约6.6千万个文件每台机器12 TB/24TB约40000道作业/天扫描数据约1.7PB/天产生数据约255 TB/天用户数820人, 用户组67个
JobTracker优化YunTi调度器
Heartbeat锁粒度降低
JobHistory页面分离
Log4j配置及使用优化
MapReduce模拟器NameNode优化NFS配置Synchronized锁换读写锁RPC reader多线程为提速作业提交, 引入新的RPC乐观锁吞吐量提升20+倍, OPS达4w重启提速, 启动时间约为原来的1/3NNThroughputBechmarkMixedHDFS模拟器
存储优化极限存储采用增量存储表数据建立聚簇索引定位某天/某段时间内的快照压缩核心表在云梯的存储空间, 平均比率1/30已经节省3PB空间压缩历史数据采用BZip2压缩已经开发LZMA2压缩, 等待上线Hadoop RAID源于Facebook的版本, 添加Placement Mover正在上线, 预计可再节省3PB空间

More Related Content

PDF
Distributed Data Analytics at Taobao
PDF
准实时海量数据分析系统架构探究
PPTX
Hadoop hive
PDF
Hadoop大数据实践经验
PDF
Hadoop ecosystem - hadoop 生態系
PDF
罗李:构建一个跨机房的Hadoop集群
PDF
翟艳堂:腾讯大规模Hadoop集群实践
PPTX
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Distributed Data Analytics at Taobao
准实时海量数据分析系统架构探究
Hadoop hive
Hadoop大数据实践经验
Hadoop ecosystem - hadoop 生態系
罗李:构建一个跨机房的Hadoop集群
翟艳堂:腾讯大规模Hadoop集群实践
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構

What's hot (20)

PDF
2016-07-12 Introduction to Big Data Platform Security
PDF
大資料趨勢介紹與相關使用技術
PDF
Data Analyse Black Horse - ClickHouse
PDF
How to plan a hadoop cluster for testing and production environment
PDF
Java Concurrent Optimization: Concurrent Queue
PDF
Hadoop 2.0 之古往今來
PPTX
Hadoop 介紹 20141024
PDF
ClickHouse北京Meetup ClickHouse Best Practice @Sina
PDF
唯品会大数据实践 Sacc pub
PPT
Hadoop 與 SQL 的甜蜜連結
PDF
Hadoop Deployment Model @ OSDC.TW
PDF
Life of Big Data Technologies
PDF
When R meet Hadoop
PDF
Log collection
 
PDF
Hadoop 0.20 程式設計
PDF
用Python实现hadoop任务调度管理
PDF
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...
PPTX
Mesos-based Data Infrastructure @ Douban
PPT
Hadoop与数据分析
PPT
Hadoop Map Reduce 程式設計
2016-07-12 Introduction to Big Data Platform Security
大資料趨勢介紹與相關使用技術
Data Analyse Black Horse - ClickHouse
How to plan a hadoop cluster for testing and production environment
Java Concurrent Optimization: Concurrent Queue
Hadoop 2.0 之古往今來
Hadoop 介紹 20141024
ClickHouse北京Meetup ClickHouse Best Practice @Sina
唯品会大数据实践 Sacc pub
Hadoop 與 SQL 的甜蜜連結
Hadoop Deployment Model @ OSDC.TW
Life of Big Data Technologies
When R meet Hadoop
Log collection
 
Hadoop 0.20 程式設計
用Python实现hadoop任务调度管理
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...
Mesos-based Data Infrastructure @ Douban
Hadoop与数据分析
Hadoop Map Reduce 程式設計
Ad

Viewers also liked (20)

PDF
空望 推荐系统@淘宝
PDF
鹰眼下的淘宝_EagleEye with Taobao
PPTX
Evolving HDFS to Generalized Storage Subsystem
PDF
Path to 400M Members: LinkedIn’s Data Powered Journey
PPTX
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PPTX
Rebuilding Web Tracking Infrastructure for Scale
PPTX
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PPTX
Hive - 1455: Cloud Storage
PDF
Data science lifecycle with Apache Zeppelin
PDF
Case study of DevOps for Hadoop in Recruit.
PPTX
SEGA : Growth hacking by Spark ML for Mobile games
PPTX
The truth about SQL and Data Warehousing on Hadoop
PPTX
Scaling real time streaming architectures with HDF and Dell EMC Isilon
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PPTX
Why is my Hadoop cluster slow?
空望 推荐系统@淘宝
鹰眼下的淘宝_EagleEye with Taobao
Evolving HDFS to Generalized Storage Subsystem
Path to 400M Members: LinkedIn’s Data Powered Journey
Streamline Hadoop DevOps with Apache Ambari
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Using Hadoop to build a Data Quality Service for both real-time and batch data
Rebuilding Web Tracking Infrastructure for Scale
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Hive - 1455: Cloud Storage
Data science lifecycle with Apache Zeppelin
Case study of DevOps for Hadoop in Recruit.
SEGA : Growth hacking by Spark ML for Mobile games
The truth about SQL and Data Warehousing on Hadoop
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Data infrastructure architecture for medium size organization: tips for colle...
Security and Data Governance using Apache Ranger and Apache Atlas
Why is my Hadoop cluster slow?
Ad

Similar to 淘宝Hadoop数据分析实践 (6)

PPT
淘宝分布式数据处理实践
PPTX
Oceanbase-淘宝云存储实践
PDF
Qcon2013 罗李 - hadoop在阿里
PPTX
淘宝Ocean base云存储实践 2011架构师大会
PPT
Java@taobao
PPTX
05 杨志丰
淘宝分布式数据处理实践
Oceanbase-淘宝云存储实践
Qcon2013 罗李 - hadoop在阿里
淘宝Ocean base云存储实践 2011架构师大会
Java@taobao
05 杨志丰

More from Min Zhou (6)

PPTX
Big Data Analytics Infrastructure
PDF
Java trouble shooting
PDF
Hive
PDF
Java程序员也需要了解CPU
PPT
Anthill: A Distributed DBMS Based On MapReduce
PPT
Redpoll
Big Data Analytics Infrastructure
Java trouble shooting
Hive
Java程序员也需要了解CPU
Anthill: A Distributed DBMS Based On MapReduce
Redpoll

淘宝Hadoop数据分析实践