SlideShare a Scribd company logo
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Avro
Avro Apache Avro Data
Serialization
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Apache Avro
❖ Data serialization system
❖ Data structures
❖ Binary data format
❖ Container file format to store persistent data
❖ RPC capabilities
❖ Does not require code generation to use
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schemas
❖ Supports schemas for defining data structure
❖ Serializing and deserializing data, uses schema
❖ File schema
❖ Avro files store data with its schema
❖ RPC Schema
❖ RPC protocol exchanges schemas as part of the
handshake
❖ Schemas written in JSON
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro compared to…
❖ Similar to Thrift, Protocol Buffers, JSON, etc.
❖ Does not require code generation
❖ Avro needs less encoding as part of the data since it
stores names and types in the schema
❖ It supports evolution of schemas.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema
Avro schema stored in src/main/avro by default.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Code Generation
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Employee Code Generation
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using Generated Avro class
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing employees to an
Avro File
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading employees From a
File
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using GenericRecord
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing Generic Records
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading using Generic
Records
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema Validation
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro supported types
❖ Records
❖ Arrays
❖ Enums
❖ Unions
❖ Maps
❖ Strings, Int, Boolean, Decimal, Timestamp, Date
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Fuller example Avro Schema
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro
❖ Fast data serialization
❖ Supports data structures
❖ Supports Records, Maps, Array, and basic types
❖ You can use it direct or use Code Generation
❖ Read more
❖ Kafka Training
❖ Kafka Consulting

More Related Content

PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Introduction to elasticsearch
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PDF
Running Apache NiFi with Apache Spark : Integration Options
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Introduction to elasticsearch
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Running Apache NiFi with Apache Spark : Integration Options
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Exactly-Once Financial Data Processing at Scale with Flink and Pinot

What's hot (20)

PDF
Reading The Source Code of Presto
ODP
Protocol Buffers
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
High Availability PostgreSQL with Zalando Patroni
PDF
Introduction to Kafka Streams
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Fig 9-02
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PPTX
Apache Flink Deep Dive
PDF
Apache Arrow: High Performance Columnar Data Framework
PPTX
Building flexible ETL pipelines with Apache Camel on Quarkus
PDF
Spark and S3 with Ryan Blue
PDF
PostgreSQL WAL for DBAs
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Reading The Source Code of Presto
Protocol Buffers
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Spark SQL Deep Dive @ Melbourne Spark Meetup
High Availability PostgreSQL with Zalando Patroni
Introduction to Kafka Streams
Optimizing Delta/Parquet Data Lakes for Apache Spark
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
A Deep Dive into Query Execution Engine of Spark SQL
Fig 9-02
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Apache Flink Deep Dive
Apache Arrow: High Performance Columnar Data Framework
Building flexible ETL pipelines with Apache Camel on Quarkus
Spark and S3 with Ryan Blue
PostgreSQL WAL for DBAs
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Ad

Viewers also liked (6)

PPTX
Kafka and Avro with Confluent Schema Registry
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
PPTX
Kafka website activity architecture
PPTX
Avro introduction
PPTX
Kafka Intro With Simple Java Producer Consumers
PPTX
Processing IoT Data with Apache Kafka
Kafka and Avro with Confluent Schema Registry
Kafka Tutorial - basics of the Kafka streaming platform
Kafka website activity architecture
Avro introduction
Kafka Intro With Simple Java Producer Consumers
Processing IoT Data with Apache Kafka
Ad

Similar to Avro Tutorial - Records with Schema for Kafka and Hadoop (20)

PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 2)
PPTX
Kafka Tutorial: Streaming Data Architecture
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
PPTX
Brief introduction to Kafka Streaming Platform
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
PDF
kafka-tutorial-cloudruable-v2.pdf
PPTX
Amazon AWS basics needed to run a Cassandra Cluster in AWS
PPTX
Kafka Tutorial: Kafka Security
PPTX
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
PPTX
Kafka Tutorial: Advanced Producers
PPTX
Kafka Tutorial - DevOps, Admin and Ops
PPTX
Kafka MirrorMaker: Disaster Recovery, Scaling Reads, Isolate Mission Critical...
PPT
Apache cassandra
PDF
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
PDF
Streaming Microservices With Akka Streams And Kafka Streams
PPTX
Triangle of Cassandra & Solr & Kafka
PPT
spark-kafka_mod
PPTX
MongoDB and AWS: Integrations
PDF
Apache Kafka - A Distributed Streaming Platform
PDF
Apache kafka-a distributed streaming platform
Kafka Tutorial - Introduction to Apache Kafka (Part 2)
Kafka Tutorial: Streaming Data Architecture
Kafka Tutorial, Kafka ecosystem with clustering examples
Brief introduction to Kafka Streaming Platform
Kafka Tutorial - introduction to the Kafka streaming platform
kafka-tutorial-cloudruable-v2.pdf
Amazon AWS basics needed to run a Cassandra Cluster in AWS
Kafka Tutorial: Kafka Security
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Kafka Tutorial: Advanced Producers
Kafka Tutorial - DevOps, Admin and Ops
Kafka MirrorMaker: Disaster Recovery, Scaling Reads, Isolate Mission Critical...
Apache cassandra
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Streaming Microservices With Akka Streams And Kafka Streams
Triangle of Cassandra & Solr & Kafka
spark-kafka_mod
MongoDB and AWS: Integrations
Apache Kafka - A Distributed Streaming Platform
Apache kafka-a distributed streaming platform

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Avro Tutorial - Records with Schema for Kafka and Hadoop

  • 1. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Avro Avro Apache Avro Data Serialization
  • 2. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Apache Avro ❖ Data serialization system ❖ Data structures ❖ Binary data format ❖ Container file format to store persistent data ❖ RPC capabilities ❖ Does not require code generation to use
  • 3. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schemas ❖ Supports schemas for defining data structure ❖ Serializing and deserializing data, uses schema ❖ File schema ❖ Avro files store data with its schema ❖ RPC Schema ❖ RPC protocol exchanges schemas as part of the handshake ❖ Schemas written in JSON
  • 4. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro compared to… ❖ Similar to Thrift, Protocol Buffers, JSON, etc. ❖ Does not require code generation ❖ Avro needs less encoding as part of the data since it stores names and types in the schema ❖ It supports evolution of schemas.
  • 5. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schema Avro schema stored in src/main/avro by default.
  • 6. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Code Generation
  • 7. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Employee Code Generation
  • 8. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Using Generated Avro class
  • 9. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Writing employees to an Avro File
  • 10. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Reading employees From a File
  • 11. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Using GenericRecord
  • 12. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Writing Generic Records
  • 13. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Reading using Generic Records
  • 14. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schema Validation
  • 15. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro supported types ❖ Records ❖ Arrays ❖ Enums ❖ Unions ❖ Maps ❖ Strings, Int, Boolean, Decimal, Timestamp, Date
  • 16. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Fuller example Avro Schema
  • 17. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro ❖ Fast data serialization ❖ Supports data structures ❖ Supports Records, Maps, Array, and basic types ❖ You can use it direct or use Code Generation ❖ Read more ❖ Kafka Training ❖ Kafka Consulting

Editor's Notes

  • #3: Apache Avro™ is a data serialization system. Avro provides data structures, binary data format, container file format to store persistent data and RPC capabilities. Avro does not require code generation to use. Integrates well with JavaScript, Python, Ruby and Java.
  • #4: Avro data format is defined by Avro schemas. When deserializing data, the schema is used. Data is serialized based on the schema, and schema is sent with data. Avro data plus schema is fully self-describing. When Avro files store data with its schema. Avro RPC is also based on schema. Part of the RPC protocol exchanges schemas as part of the handshake. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. Avro schemas are written in JSON.
  • #5: Avro is similar to Thrift, Protocol Buffers, JSON, etc. Avro does not require code generation. Avro needs less encoding as part of the data since it stores names and types in the schema. It supports evolution of schemas.
  • #6: Example Schema: {"namespace": "com.cloudurable.phonebook", "type": "record", "name": "Employee", "fields": [ {"name": "firstName", "type": “string"}, {"name": "lastName", "type": "string"}, {"name": "age", "type": "int"}, {"name": "phoneNumber", "type": "string"} ] } Avro schema is just JSON.
  • #7: There are plugins for Maven and Gradle to generate code based on Avro schemas. This gradle-avro-plugin is a Gradle plugin that uses Avro tools to do Java code generation for Apache Avro. This plugin supports Avro schema files (avsc), and Avro RPC IDL (avdl). For Kafka you only need avsc. Notice that we did not generate setter methods. This makes the instances somewhat immutable.
  • #8: The plugin generates the files and puts them under build/generated-main-avro-java.
  • #9: The Employee class has a constructor and has a builder.
  • #10: The above shoes serializing an Employee list to disk. In Kafka, we will not be writing to disk directly. We are just showing how so you have a way to test Avro serialization, which is helpful when debugging schema incompatibilities. Note we create a DatumWriter, which converts Java instance into an in-memory serialized format. SpecificDatumWriter is used with generated classes like Employee. DataFileWriter writes the serialized records to the employee.avro file.
  • #11: The above deserializes employees from the employees.avro file. Deserializing is similar to serializing but in reverse. We create a SpecificDatumReader to converts in-memory serialized items into instances of our generated Employee class. The DatumReader reads records from the file by calling next. Another way to read is using forEach as follows: final DataFileReader<Employee> dataFileReader = new DataFileReader<>(file, empReader); dataFileReader.forEach(employeeList::add);
  • #12: You can use a generic record instead of using generated code.
  • #13: You can write to Avro files using Generic records as well.
  • #14: You can read from Avro files using generic records as well.
  • #15: Avro will validate the data types when it serializes and deserializes the data.
  • #16: The document https://avro.apache.org/docs/current/spec.html#Protocol+Declaration describes all of the supported types.
  • #17: The above has examples of default values, arrays, primitive types, Records within records, enums, and more.