SlideShare a Scribd company logo
Speaking the
language of Big Data
Ranganathan Balashanmugam, ThoughtWorks
@ran_than
the data interchange protocols
About Me ● Graduated as Civil engineer
● Technology Lead at ThoughtWorks,
India
● Organizer of Hyderabad Scalability
Meetup with ~3500 members.
“BigData because every byte has a story.”
“BigData because every byte has a story.”
➔ Write everything (no write schema)
➔ Read required content
Why in big data?
Feature creep
Our assumption of
services
How do they
communicate?
Expected language of Big Data
language
schema
Text Files
“Written once; Processed many times”
Text Files
➔ Good at writes.
➔ Reads needs processing, every time.
Serialized Deserialized
● Locked to language
● Heavy
● Not flexible
Java serialization, Ruby’s marshal, Python’s pickle, etc
Demo
CSV, XML, JSON, etc
“Schema defined”
CSV, XML,
JSON, etc
➔ CSV - schema not present with data
➔ Parsing efficiency
➔ communication
➔ flexibility
➔ data types
➔ rewrite
Schema change wars
“Language and platform neutral way to store and
exchange structured data in distributed systems.”
Expectations
Expectations ❏ Simple, readable schema
❏ Rich data structures, easy design to
description
❏ Support multiple languages, with
simple integration, code generation
❏ Efficiency (time/space)
❏ Easy integrations (availability of
libraries)
❏ Tooling
❏ Versioning
❏ Old version, new server
❏ New version, old server
❏ Reduce boilerplate code
❏ Supports compression
Schema Python
Java
Other
languages
Less
memory
More
Data;
Less
memory;
faster
Manual
Design and Describe
Auto
Generated
boilerplate code
Auto
Serialize
Deserialize
Compress
Agile
Flexible schema
No code rewrite
Stay calm and focusTooling
Simple commands
“Language and platform neutral way to store and
exchange structured data in distributed systems.”
Tada
Data Interchange Protocols
➔ Protocol Buffers
➔ Apache Thrift
➔ Apache Avro
➔ Cap’n Proto
➔ Flatbuffers
➔ Kryo (java only)
➔ MessagePack
➔ BERT
➔ Apache Etch
➔ Internet Communications Engine
Data Interchange Protocols
➔ Protocol Buffers
➔ Apache Thrift
➔ Apache Avro
➔ Cap’n Proto
➔ Flatbuffers
➔ Kryo (java only)
➔ MessagePack
➔ BERT
➔ Apache Etch
➔ Internet Communications Engine
Protocol Buffers
A flexible, efficient, automated mechanism for serializing
structured data – think XML, but smaller, faster, and simpler.
Google
Protobuf
● Google's own serializer
● Cross-language
● Schema evolution
● Compact
● Strong typed
● Battle tested
“All the data is just a sequence of bytes, which
makes no sense without the encoding.”
0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1
Base 128 varints
“Varints are a method of serializing integers using one or more bytes.”
➔ Each byte is a varint
➔ Last byte - most significant bit (msb) - indicated further bytes
➔ Lower 7 bits - two's complement
1
0000 0001
300
1010 1100 0000 0010
010 1100 000 0010
000 0010 010 1100
000 0010 ++ 010 1100 = 100101100
Non-varints
Double, fixed64 - fixed 64bit lump of data
Message Structure
“A protocol buffer message is a series of key-value pairs.”
➔ Key is varint
➔ Key = (field_number << 3) | wire_type
Type Meaning
0 Varint
1 64-bit
2 length-
delimited
…. ….
Key = 000 1000
Wire_type = Varint
Field_number = right_shift (000 1000, 3) = 1
message Message {
required int a = 1;
}
Strings
message String {
required string a = 2;
}
a = “testing”;
12 07 74 65 73 74 69 6e 67
Key = HexToBinary(12) = 0001 0010 =>
Tag = 2, type = 2
Length varint = 7 bytes follows
Src: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Demo
Apache Avro
Avro is a remote procedure call and data serialization framework.
Avro ➔ Originally developed by Doug
Cutting for use with Hadoop.
➔ RPC framework.
➔ Part of Hadoop.
➔ Avro data is always serialized with its
schema.
➔ Supports binary and JSON encoding.
Src: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Compact of encodings
Schema
Avro data is always serialized with its schema.
Writer
{
"type": "record",
"name": "Tweet",
"namespace": "com.avro.serialization.gen",
"fields": [
{
"name": "userId",
"type": "int"
},
{
"name": "userName",
"type": "string"
},
{
"name": "text",
"type": "string"
}
]
}
Reader
{
"type": "record",
"name": "Tweet",
"namespace": "com.avro.serialization.gen",
"fields": [
{
"name": "userIcon",
"type": "string"
},
{
"name": "userName",
"type": "string"
},
{
"name": "text",
"type": "string"
}
]
}
Name=Nathan,
userName: ran_than,
text: Hello
userName: ran_than,
text: Hello
Avro Encoding
➔ Length followed by UTF-8 bytes.
➔ The parser matches fields in the reader and writer schema
by name.
➔ Hadoop has millions of records with same schema. Object
Container Files handle this -- schema in the beginning.
➔ Easy to load schema into Pig.
Avro - Object Container File
Header Block 1 Block 2 Block ... Block N
16 bit Sync marker
File Metadata:
Includes avro.schema
and avro.codec
4 bytes: ASCII
‘O’, ‘b’, ‘j’
Count of objects
in block
Size (in bytes) of
serialized objects
Serialized objects
compressed by
specified codec
16 bit Sync
marker
Demo
Apache Thrift
Thrift is a remote procedure call and data serialization
framework. and was developed at Facebook for "scalable cross-
language services development".
Thrift ➔ Designed by x-googler in 2007
➔ Developed internally at Facebook
➔ RPC stack
➔ Many encoding options
Thrift Protocol Stack
Src: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Src: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Thrift Protocols
➔ TBinaryProtocol
➔ TCompactProtocol
➔ TDenseProtocol: Similar to TCompactProtocol but strips
off the meta information from what is transmitted, and
adds it back in at the receiver.
➔ TJSONProtocol
➔ TSimpleJSONProtocol : A write-only protocol using JSON.
Suitable for parsing by scripting languages
➔ TDebugProtocol : Uses a human-readable text format to
aid in debugging.
Demo
“It is not the strongest of the species that survive,
nor the most intelligent, but the one most
responsive to change.”
- Charles Darwin
THANK YOU
For questions or suggestions:
Ranganathan Balashanmugam
@ran_than

More Related Content

PPTX
GIDS 2016 Understanding and Building No SQLs
PDF
Starting with MongoDB
PPTX
Appache Cassandra
PDF
NoSQL
PPT
MongoDb - Details on the POC
PPTX
PPTX
NOSQL Databases types and Uses
PPTX
An Intro to NoSQL Databases
GIDS 2016 Understanding and Building No SQLs
Starting with MongoDB
Appache Cassandra
NoSQL
MongoDb - Details on the POC
NOSQL Databases types and Uses
An Intro to NoSQL Databases

What's hot (20)

PDF
Open source Technology
PPTX
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
PDF
NoSQL databases
PPT
No SQL and MongoDB - Hyderabad Scalability Meetup
PPTX
Mongo DB
KEY
MongoDB NYC Python
PDF
NoSQL Databases
PPT
MongoDB Pros and Cons
PPTX
Mongo db
PPTX
Key-Value NoSQL Database
PPTX
introduction to NOSQL Database
PDF
No sq lv1_0
PPTX
Nosql databases
PPTX
NoSQL and MongoDB
PPTX
Introduction to NoSQL Databases
PDF
Introduction to NoSQL
PPTX
Selecting best NoSQL
PDF
FOSSASIA 2016 - 7 Tips to design web centric high-performance applications
PPTX
NoSQL databases
PPTX
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Open source Technology
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
NoSQL databases
No SQL and MongoDB - Hyderabad Scalability Meetup
Mongo DB
MongoDB NYC Python
NoSQL Databases
MongoDB Pros and Cons
Mongo db
Key-Value NoSQL Database
introduction to NOSQL Database
No sq lv1_0
Nosql databases
NoSQL and MongoDB
Introduction to NoSQL Databases
Introduction to NoSQL
Selecting best NoSQL
FOSSASIA 2016 - 7 Tips to design web centric high-performance applications
NoSQL databases
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Ad

Viewers also liked (20)

PPTX
아파치 쓰리프트 (Apache Thrift)
DOCX
Report submitted to (1)
PPT
Презентация памятники Волгодонска. Петрова Алла
DOC
REPORT OF OYO SEMO[1]
PPT
samoupravlenie
PDF
Offline First Applications
PDF
Chapter 2 part2-Correlation
PPT
proekti
PDF
Cheney Court - Linguarama
PDF
Receiving your State Pension abroad
PPTX
Impact of the greece downturn
PPT
Department of clinical pharmacy an overview with renal system (2)
PPTX
Winter art from Ireland
PDF
Chapter 3 part2- Sampling Design
PPT
Проект Павленко "Безопасные каникулы".
PPTX
Ang aking pananaw sa pamilya
PPTX
Analysing problems creatively final
PDF
2016: A good year to invest in Spanish property?
PPT
Snowmen from POland
PPTX
Business Game Presentation of Management Audit
아파치 쓰리프트 (Apache Thrift)
Report submitted to (1)
Презентация памятники Волгодонска. Петрова Алла
REPORT OF OYO SEMO[1]
samoupravlenie
Offline First Applications
Chapter 2 part2-Correlation
proekti
Cheney Court - Linguarama
Receiving your State Pension abroad
Impact of the greece downturn
Department of clinical pharmacy an overview with renal system (2)
Winter art from Ireland
Chapter 3 part2- Sampling Design
Проект Павленко "Безопасные каникулы".
Ang aking pananaw sa pamilya
Analysing problems creatively final
2016: A good year to invest in Spanish property?
Snowmen from POland
Business Game Presentation of Management Audit
Ad

Similar to Apache big data 2016 - Speaking the language of Big Data (20)

PPTX
Avro intro
PDF
(Big) Data Serialization with Avro and Protobuf
PDF
3 avro hug-2010-07-21
PPTX
PDF
Serialization (Avro, Message Pack, Kryo)
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
When Kafka Is the Source of Truth With Ricardo Ferreira | Current 2022
PPTX
Google Protocol Buffers
PPTX
Avro - More Than Just a Serialization Framework - CHUG - 20120416
PDF
3 apache-avro
PPTX
Evolving Streaming Applications
PDF
Hw09 Next Steps For Hadoop
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache avro data serialization framework
PDF
Avro Data | Washington DC HUG
PDF
Apache avro and overview hadoop tools
PPTX
Golang proto buff_ixxo
PPTX
Protocol buffers
PDF
ApacheCon09: Avro
PDF
Streaming in Scala with Avro
Avro intro
(Big) Data Serialization with Avro and Protobuf
3 avro hug-2010-07-21
Serialization (Avro, Message Pack, Kryo)
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
When Kafka Is the Source of Truth With Ricardo Ferreira | Current 2022
Google Protocol Buffers
Avro - More Than Just a Serialization Framework - CHUG - 20120416
3 apache-avro
Evolving Streaming Applications
Hw09 Next Steps For Hadoop
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache avro data serialization framework
Avro Data | Washington DC HUG
Apache avro and overview hadoop tools
Golang proto buff_ixxo
Protocol buffers
ApacheCon09: Avro
Streaming in Scala with Avro

More from techmaddy (6)

PDF
Qcon London2020 Scaling distributed teams
PDF
Apache parquet - Apache big data North America 2017
PDF
Serverless architectures
PDF
Technology -- the first strategy to startups
PDF
Technology -- the first strategy to startups
PDF
The best of Apache Kafka Architecture
Qcon London2020 Scaling distributed teams
Apache parquet - Apache big data North America 2017
Serverless architectures
Technology -- the first strategy to startups
Technology -- the first strategy to startups
The best of Apache Kafka Architecture

Recently uploaded (20)

PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Introduction to Data Science and Data Analysis
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Managing Community Partner Relationships
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Transcultural that can help you someday.
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Introduction to Data Science and Data Analysis
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Managing Community Partner Relationships
climate analysis of Dhaka ,Banglades.pptx
modul_python (1).pptx for professional and student
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck
Transcultural that can help you someday.
A Complete Guide to Streamlining Business Processes
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx

Apache big data 2016 - Speaking the language of Big Data