The Evolution of an Open Data Platform with Alluxio

Alluxio与开源大数据生态:
2021回顾与展望
范斌，Alluxio开源社区副总裁
1

关于我
apc999
binfan@alluxio.com
范斌
• Alluxio创始成员 & 开源副总裁
• 博士毕业于卡内基梅隆大学计算机系
• 加入Alluxio前，曾在谷歌参与下一代分布式存储的研发工作
apc999

Alluxio 概览
全球独创性：全球首个分布式超大规模数据编排系。
产学研一体：孵化于加州大学伯克利分校AMP实验室，是创始人李浩源博士的论文课
题。
全球化开发：孵化之初即秉承 “开源开放”愿景，实现了项目在全球范围内的开源。目
前已经有超过300多个组织机构以及超过1100位贡献者参与开发。
全面部署验证：全球十大互联网公司中有八家已经在生产环境中部署了Alluxio；已经
在全球Web规模的现代化数据服务的生产环境中得到验证。
融资：截止2021年11月，Alluxio共完成三轮融资，累计获得全球顶级风险投资机构
超过7000万美元的投资。
3

大数据生态圈和Alluxio的
昨天
4

10年前的大数据生态圈
5

项目诞生：UC Berkeley AMP实验室
项目原型：Alluxio（曾用名Tachyon）最初是AMP实
验室中Apache Spark的姐妹项目，研究如何使用分
布式技术统一管理堆外内存为Apache Spark应用
提供内存级数据访问加速。
背景：2013年，加州大学伯克利分校的AMP实验室
专注在大数据领域，已经推出了两款流行的开源项
目：Apache Spark和Apache Mesos。
参与人员：项目由李浩源（当时为AMP实验室在读博
士）负责，并由同实验室其他师生参与。
6

7
2015年的Alluxio(Tachyon)介绍
AMPLab活动上Tachyon演讲的截图

大数据生态圈和Alluxio的
今天
8

● 项目自2013年在Github开源，累计超过32000个
提交（commits）
● 来自全球超过100个组织机构的1100多位贡献者
参与开发。
● 全球化的项目PMC成员：来自Alluxio、腾讯、阿
里巴巴、南京大学、Google、 Meta
（Facebook）、Uber、IBM等高校和科技企业
● 2020年在Google以及OpenSSF的开源指标被评为
最有影响力的Java开源项目中排名第9[1]
。
9
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
基于开源社区，全球化开发协作

Alluxio云端数据编排平台
能够在跨集群、跨区域、任何云（私有云/公有云/混合云）中将数据更紧密的编排，
接近数据分析、AI/ML应用程序，从而向上层应用提供内存速度的数据访问。
10

统一的数据湖
透明智能的数据缓存能力, 提高
数据架构灵活性, 实现存算分离,
加速数据与分析的模块化发展
可插拔大数据软件栈
Composable data and
analytics
企业内部的数据问题, 不仅仅是
大数据问题, 数据融合为工程化
决策智能提供坚实的技术基石
全域数据的融合
From Big to Small and Wide
Data
统一的数据入口
Primary focal point of
access
抽象统一的数据语义为数据分析
提供统一的数据访问接口, 实现
数据整合,分享和管治的一体化
统一的数据湖
透明智能的数据缓存能力,
提高数据架构灵活性 , 实
现存算分离, 加速数据与
分析的模块化发展
可插拔大数据软件栈
Composable data and analytics
企业内部的数据问题, 不
仅仅是大数据问题, 数据
融合为工程化决策智能提
供坚实的技术基石
全域数据的融合
From Big to Small and Wide Data
统一的数据入口
Primary focal point of access
抽象统一的数据语义为数
据分析提供统一的数据访
问接口, 实现数据整合,分
享和管治的一体化
David Wheeler：There is no problem in computer science that can’t be solved using another
level of indirection.
11
Alluxio的愿景

互联网
公有云
综合
电子商务
其他
科技行业金融服务
电信媒体
LEARN MORE
12
遍布全球各个行业的Alluxio用户

13
2021年Alluxio开源社区的一些统计[1]
• 8 Alluxio Day 🍕 meetups
• 84 Live community developer meetings and online Office Hours
• 27 Webinars
• 62 blog ✍ posts in multiple languages
• 5 new PMC members and 1 new PMC maintainer
• 2 new Committers promoted
• 983 pull requests ✅ merged 💻 in GitHub with 308 coming from
community contributors
• 3144 new members 👋 and 24531 messages in Slack
• 512 issues 📝 created in GitHub
• 11 Alluxio 🚀 releases published
[1] https://www.alluxio.io/blog/a-year-with-alluxio-community-2021/

● “Presto + Alluxio” 联合开发兴趣小组
例：共同参与、开发了Presto社区的核心演进项目RaptorX[2]
● “机器学习/K8s + Alluxio” 联合开发兴趣小组
例：与阿里巴巴、南京大学共同发起了CNCF的Fluid项目
14
[2] https://prestodb.io/blog/2021/02/04/raptorx
与技术伙伴与周边开源社区，携手共赢

● 优化元数据服务（Master）
○ 更高效率、更可靠的高可用模式(基于Raft)
○ 更大规模集群（数千台Worker）的高效管理
○ 优化Master进程内存资源消耗
● 优化数据服务（Worker）
○ 优化Worker存储效率
○ 优化Worker进程内存资源消耗
● 优化Job Service
○ 支持更高效可控的distributedLoad，Persist，AsyncWrite
16
2022年的技术演进方向：Alluxio核心

● 与K8S环境更深度的集成
● 与数据湖方案（Hudi, Iceberg）更深度的集成
● 在AI场景下的优化
○ 超大规模小文件数据集、文件写
○ FUSE进程对内存的消耗
● 在OLAP（如Presto）场景下的优化
○ 对热数据集的估算和监控
17
2022年的技术演进方向：对应用的场景优化

18
1. Trend 1: Data is shared
Solution: Abstraction across heterogeneous compute
2. Trend 2: Data Ownership and Governance
Solution: Computation without copies (Data Lake), Performance Acceleration with
Caching ++
3. Trend 3: Elasticity in the Cloud
Solution: Multi-tier strategy for simplified performance
Observation in Long Term Directions
18

2021年Alluxio中国团队成立！
正在招人！

Alluxio Proprietary and Confidential
Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
7,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million

21
Trend 1: Data is Shared
21
1. Between Compute Frameworks
For example, Extract Transform Load (ETL) in a batch processing engine followed by
Presto for interactive queries
2. Between Diﬀerent Teams
Team A as producer shares data with Team B as consumer

22
Trend 2: Data is owned and Processing in place is simple
22
1. Data Ownership and Governance
Although replication provides isolation, security compliance is complex
2. Copies introduce redundancy
Which is error-prone and has high Total Cost of Ownership (TCO)

23
Trend 3: Elasticity for TCO
23
1. Elasticity of Compute Instances
To optimize TCO, elasticity is key but the user experience must also be preserved,
including execution of queries w/o interruption and performance SLAs

● 更中立的角度切入和带动更广阔的全球社区协作
● 更紧密以及快速迭代的方式展开开源社区之间的合作
● 更有效的技术推广平台，更丰富的渠道
● 更敏捷的技术验证和市场反馈
24
继续推动基于开源社区的演进路线

The Evolution of an Open Data Platform with Alluxio

More Related Content

What's hot (13)

Similar to The Evolution of an Open Data Platform with Alluxio (20)

More from Alluxio, Inc. (20)

The Evolution of an Open Data Platform with Alluxio