SlideShare a Scribd company logo
State of the Elephant


     Doug Cutting
      Cloudera
2009 Hadoop Milestones
●   A 2009 Sort Champion (O'Malley)
    ●   won 100TB “Gray” sort @ .578TB/minute
    ●   won “Minute” sort with 500GB
●   Split Core Project in Three
    ●   Common, HDFS & MapReduce
●   Released 0.18.3, 0.19.[0-2], 0.20.[0-1]
●   Many meetups, conferences, etc.
●   Yada, yada, yada.
Goals for 2010 and beyond
●   IMHO, YMMV, IANAM, etc.
●   Concrete
    ●   1.0 release
        –   compatible APIs & RPCs for > 1 year
        –   Kerberos-based authentication
●   Abstract
    ●   faster, more reliable, available
    ●   easier sharing
        –   of data & hardware resources
    ●   spreadsheet-like interfaces
        –   provide non-programmers
        –   with powerful, interactive tools
Abstract Requirements
●   security
    ●   facilitate sharing of resources
●   stable cross-language APIs
    ●   facilitate diverse tools & apps
●   expressive, inter-operable data
    ●   facilitates sharing of datasets
    ●   facilitates dynamic analyses
Data Formats
●   today in Hadoop:
    ●   text
        –   pro: inter-operable
        –   con: not expressive, inefficient
    ●   Java Writable
        –   pro: expressive, efficient
        –   con: platform-specific, fragile
Protocol Buffers & Thrift
●   expressive
●   efficient (small & fast)
●   but not very dynamic
    ●   cannot browse arbitrary data
    ●   no DESCRIBE or SHOW
    ●   viewing a new dataset
        –   requires code generation & load
    ●   writing a new dataset
        –   requires generating schema text
        –   plus code generation & load
Avro Data
●   as expressive
●   smaller and faster
●   dynamic
    ●   schema stored with data
        –   but factored out of instances
    ●   APIs permit reading & creating
        –   arbitrary datatypes
        –   without generating & loading code
Avro Data
●   includes a file format
●   includes a textual encoding
●   handles versioning
    ●   if schema changes
    ●   can still process data
●   hope Hadoop apps will
    ●   upgrade from text; &
    ●   and standardize on Avro for data
Avro RPC
●   leverage versioning support
    ●   to permit different versions of services to
        interoperate
●   for Hadoop services, will
    ●   provide cross-language access
    ●   let apps talk to clusters running different versions
Avro Status
●   Java & Python APIs
    ●   C & C++ APIs making rapid progress
●   1.1 release
    ●   added JSON data and comparators
●   1.2 release
    ●   added HTTP & UDP-based RPC
●   included in Hadoop 0.21
    ●   as format for job history
    ●   in sequence files
Avro Near Future
●   full mapreduce support for Avro data
    ●   enables fast comparators for non-Java apps
●   Avro RPC used in Hadoop 0.22 (1.0)?
    ●   provides compatibility; &
    ●   native access from non-Java
Thanks!




hadoop.apache.org/avro

More Related Content

PDF
Full report on blood bank management system
PDF
Hci and psychology
PDF
PPTX
MicroProgrammed Explained .
PDF
DPCO UNIT 2.pdf
PPTX
Stack Operations
PPTX
Addressing modes of 8085
PPT
Jumps in Assembly Language.
Full report on blood bank management system
Hci and psychology
MicroProgrammed Explained .
DPCO UNIT 2.pdf
Stack Operations
Addressing modes of 8085
Jumps in Assembly Language.

What's hot (20)

PPT
DATA REPRESENTATION
PPTX
Human age and gender Detection
PDF
Document Summarization
PPT
Memory mgmt 80386
PDF
Driver Drowsiness Detection report
PPT
Bank Management System
PDF
Movie recommendation project
PPTX
Modes of 80386
PPTX
Stack organization
PPT
Perception
DOCX
Python report on twitter sentiment analysis
PDF
SRS For Online Store
PPTX
design of accumlator
PPTX
CS304PC:Computer Organization and Architecture Session 11 general register or...
PPTX
Lecture 1: Semantic Analysis in Language Technology
PDF
Flipkart Software Requirements Specification (SRS)
PDF
Embedded systems tools & peripherals
PDF
Computer organisation -morris mano
PDF
project sentiment analysis
DATA REPRESENTATION
Human age and gender Detection
Document Summarization
Memory mgmt 80386
Driver Drowsiness Detection report
Bank Management System
Movie recommendation project
Modes of 80386
Stack organization
Perception
Python report on twitter sentiment analysis
SRS For Online Store
design of accumlator
CS304PC:Computer Organization and Architecture Session 11 general register or...
Lecture 1: Semantic Analysis in Language Technology
Flipkart Software Requirements Specification (SRS)
Embedded systems tools & peripherals
Computer organisation -morris mano
project sentiment analysis
Ad

Similar to ApacheCon09: Avro (20)

PDF
Hw09 Next Steps For Hadoop
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PDF
Introduction to Apache Spark
PDF
Plugging the Holes: Security and Compatability in Hadoop
PDF
Hw09 Security And Api Compatibility
PDF
Blackray @ SAPO CodeBits 2009
PDF
Savanna - Elastic Hadoop on OpenStack
PDF
Cloud Native API Design and Management
PDF
Present and future of unified, portable, and efficient data processing with A...
PPTX
Change data capture
PDF
BlackRay - The open Source Data Engine
PDF
Introduction to Apache Beam
PDF
GraphQL is actually rest
PPTX
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
PPTX
Apache Tez -- A modern processing engine
PDF
What is Apache Hadoop and its ecosystem?
PPTX
Hadoop and Big data in Big data and cloud.pptx
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
PDF
Intro to Apache Hadoop
ODP
Hadoop Introduction
Hw09 Next Steps For Hadoop
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Introduction to Apache Spark
Plugging the Holes: Security and Compatability in Hadoop
Hw09 Security And Api Compatibility
Blackray @ SAPO CodeBits 2009
Savanna - Elastic Hadoop on OpenStack
Cloud Native API Design and Management
Present and future of unified, portable, and efficient data processing with A...
Change data capture
BlackRay - The open Source Data Engine
Introduction to Apache Beam
GraphQL is actually rest
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Apache Tez -- A modern processing engine
What is Apache Hadoop and its ecosystem?
Hadoop and Big data in Big data and cloud.pptx
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Intro to Apache Hadoop
Hadoop Introduction
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
Review of recent advances in non-invasive hemoglobin estimation

ApacheCon09: Avro

  • 1. State of the Elephant Doug Cutting Cloudera
  • 2. 2009 Hadoop Milestones ● A 2009 Sort Champion (O'Malley) ● won 100TB “Gray” sort @ .578TB/minute ● won “Minute” sort with 500GB ● Split Core Project in Three ● Common, HDFS & MapReduce ● Released 0.18.3, 0.19.[0-2], 0.20.[0-1] ● Many meetups, conferences, etc. ● Yada, yada, yada.
  • 3. Goals for 2010 and beyond ● IMHO, YMMV, IANAM, etc. ● Concrete ● 1.0 release – compatible APIs & RPCs for > 1 year – Kerberos-based authentication ● Abstract ● faster, more reliable, available ● easier sharing – of data & hardware resources ● spreadsheet-like interfaces – provide non-programmers – with powerful, interactive tools
  • 4. Abstract Requirements ● security ● facilitate sharing of resources ● stable cross-language APIs ● facilitate diverse tools & apps ● expressive, inter-operable data ● facilitates sharing of datasets ● facilitates dynamic analyses
  • 5. Data Formats ● today in Hadoop: ● text – pro: inter-operable – con: not expressive, inefficient ● Java Writable – pro: expressive, efficient – con: platform-specific, fragile
  • 6. Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● no DESCRIBE or SHOW ● viewing a new dataset – requires code generation & load ● writing a new dataset – requires generating schema text – plus code generation & load
  • 7. Avro Data ● as expressive ● smaller and faster ● dynamic ● schema stored with data – but factored out of instances ● APIs permit reading & creating – arbitrary datatypes – without generating & loading code
  • 8. Avro Data ● includes a file format ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● hope Hadoop apps will ● upgrade from text; & ● and standardize on Avro for data
  • 9. Avro RPC ● leverage versioning support ● to permit different versions of services to interoperate ● for Hadoop services, will ● provide cross-language access ● let apps talk to clusters running different versions
  • 10. Avro Status ● Java & Python APIs ● C & C++ APIs making rapid progress ● 1.1 release ● added JSON data and comparators ● 1.2 release ● added HTTP & UDP-based RPC ● included in Hadoop 0.21 ● as format for job history ● in sequence files
  • 11. Avro Near Future ● full mapreduce support for Avro data ● enables fast comparators for non-Java apps ● Avro RPC used in Hadoop 0.22 (1.0)? ● provides compatibility; & ● native access from non-Java