SlideShare a Scribd company logo
Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
take a data structure…There is a Data Definition Language!module links {		class Link {ustringURL;booleanisRelative;ustringanchorText;		};}
and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr	namespace inclrec {		class RI :		public hadoop::Record {		    private:			int32_t I32;			double D;std::string S;
remember, to only use C++/Java$rcc--help	Usage: rcc --language[java|c++] ddl-files
then you can start to make it better…I wanted it in pythonNeed 2 parts. Parsing library and DDL translatorI only did the first partIf you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her…http://github.com/ptarjan/hadoop_record
the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefullHadoopupacks zip files – name it .mod
nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

More Related Content

PDF
Memory Debugging
PPTX
Cscope and ctags
PDF
OpenStreetMap in the age of Spark
PDF
Dataframes Showdown (miniConf 2022)
PDF
source{d} Engine: Exploring git repos with SQL
PDF
Cpp lab 13_pres
PDF
Introduction to PIG components
PPT
Inside database
Memory Debugging
Cscope and ctags
OpenStreetMap in the age of Spark
Dataframes Showdown (miniConf 2022)
source{d} Engine: Exploring git repos with SQL
Cpp lab 13_pres
Introduction to PIG components
Inside database

What's hot (20)

PPTX
sphinx-i18n — The True Story
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
PDF
Business logic with PostgreSQL and Python
PPT
Getting started with PostGIS geographic database
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
PPT
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
PPTX
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
PPTX
Data analysis on hadoop
PDF
DUG'20: 07 - Storing High-Energy Physics data in DAOS
PDF
Meetup Elasticsearch 13 novembre 2014
PPT
Using HDF5 and Python: The H5py module
PPT
Tokyocabinet
PDF
Geo Package and OWS Context at FOSS4G PDX
ODP
Working with Shared Libraries in Perl
PDF
Docopt, beautiful command-line options for R, user2014
PPT
Substituting HDF5 tools with Python/H5py scripts
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
PPSX
NASA HDF/HDF-EOS Data for Dummies (and Developers)
PDF
anticorrp
PDF
Pybind11 - SciPy 2021
sphinx-i18n — The True Story
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Business logic with PostgreSQL and Python
Getting started with PostGIS geographic database
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Data analysis on hadoop
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Meetup Elasticsearch 13 novembre 2014
Using HDF5 and Python: The H5py module
Tokyocabinet
Geo Package and OWS Context at FOSS4G PDX
Working with Shared Libraries in Perl
Docopt, beautiful command-line options for R, user2014
Substituting HDF5 tools with Python/H5py scripts
20141111 파이썬으로 Hadoop MR프로그래밍
NASA HDF/HDF-EOS Data for Dummies (and Developers)
anticorrp
Pybind11 - SciPy 2021
Ad

Viewers also liked (10)

PDF
Semantic Searchmonkey
PPT
Hands on Hadoop
PPTX
How To Be A Hacker
PPTX
Hacku Intro 2009
PPTX
Yahoo! HackU 2010
PPT
SearchMonkey
PDF
Soleus Audio Manager Help
PDF
Yahoo Developer Network overview
PPS
Trompe L’Oeil & Decorazioni Pignotti Pisanu
PPTX
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Semantic Searchmonkey
Hands on Hadoop
How To Be A Hacker
Hacku Intro 2009
Yahoo! HackU 2010
SearchMonkey
Soleus Audio Manager Help
Yahoo Developer Network overview
Trompe L’Oeil & Decorazioni Pignotti Pisanu
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
A comparative analysis of optical character recognition models for extracting...
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf

Hadoop Jute Record Python

  • 1. Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
  • 2. Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
  • 3. don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
  • 4. take a data structure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}
  • 5. and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;
  • 6. remember, to only use C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files
  • 7. then you can start to make it better…I wanted it in pythonNeed 2 parts. Parsing library and DDL translatorI only did the first partIf you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her…http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefullHadoopupacks zip files – name it .mod
  • 12. nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback