SlideShare a Scribd company logo
1
Binary Storage Formats for Sparse
Erik Welch
ewelch@anaconda.com
September 21, 2021
2
● To try to improve the status quo
○ Matrix Market, FROSTT, and other text-based formats
○ COO: coordinate triples of rows, columns, and values
■ Columnar storage technology is really good!
○ Bespoke, library-specific binary formats
○ TileDB
○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.)
● To discuss and create a vision for new binary storage formats for sparse
○ “Open standard” formats that are broadly useful and widely accessible
● To determine what is most important to the people here
○ Which languages, libraries, formats, features, technologies, etc.
○ And what do you wish for? Dream big!
● To find partners and team up for the next steps
○ Prototype, analyze, communicate results, and iterate
○ Goals for 3-6 months and beyond
Why are we here?
3
● Library developers and users
● Anybody with large sparse data or graphs
● Hardware creators and vendors
○ For example, to better optimize running workloads on GPUs or other accelerators
● Researchers
● Companies with graph workloads
● Benchmark organizers
● Maintainers of repositories of sparse data
● Virtually anybody who needs to work with sparse data or graphs
Who is this for?
4
● Data written to long term storage
○ Any technically competent person who discovers the file 10 years later should be
able to figure out how to load and use it (even if they don’t recognize the format)
● The logical model of a file format
○ Metadata about metadata
■ File extension (such as .zip or .pdf)
■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY)
■ Version number of format
■ Any additional extensions needed to understand the metadata or data
○ Metadata
■ Data type, endianness of data, shape of sparse array, etc.
■ Info about compression
○ Data
■ Contiguous arrays of raw bytes
■ May be chunked or compressed
What is a file format?
5
An “open” file format is also about community
● More likely to succeed with more partners
● Right now, a small group (or just me) can
make a lot of progress with prototyping
and documenting to bootstrap the effort
○ Along with feedback from LAGraph group
and others
● We’ll want more partners eventually
● Please think about how/when you can help
○ Student projects
○ Volunteer to be beta users, give feedback
○ Participate in this session
● We’ll need a governance model eventually
○ Hopefully! This is a good “problem” to have
○ We’re happy to have this discussion
“If you want to go quickly,
go alone. If you want to
go far, go together.”
African Proverb
6
● Fast enough
● Small enough
● Flexible enough
● Portable enough
● Future-proof enough
● Standardized enough
● “Cloud native” enough
● Simple enough to implement
● Easy enough to use
● Scalable enough
● Able to evolve
● And, of course, able to support
whatever optimized format
Tim Davis can dream up!
What are our high level goals?
“Pinky, are you pondering
what I’m pondering?”
Brain
7
Essential sparse formats to support: rank 1 and 2
Sparse vector indices values
COO (ijv triples) rows columns values
CSR indptr col_indices values
CSC indptr row_indices values
DCSR (HyperCSR) rows indptr col_indices values
DCSC (HyperCSC) columns indptr row_indices values
Let’s start designing!
Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
8
Essential formats for rank N: COO and CSF
http://shaden.io/pub-files/smith2017knl.pdf
CSF is like DCSR/HyperCSR generalized to higher dimensions
9
● Metadata
○ Let’s use JSON
○ Versatile
○ Human-readable
○ No need to think about endianness
of metadata values
Logical model of a file
● Data
○ Like a key-value store with binary data
○ This is just a logical model
○ We are free to pack the data into bytes
however we wish
10
Unified Sparse Storage
Rank 1 Rank 2
key CSR CSC DCSR DCSC COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0]
compressed_axes [0] [0]
coord_axes [0] [1] [1] [1] [1] [0, 1]
axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2
optional has_duplicates
Minimal, cohesive set of metadata and storage keys
11
Unified Sparse Storage: Rank 3
key CSF CSF (alt) CSR+extra COO DCSR+extra
double- fiber_ind0
compressed fiber_ptr0
fiber_ind1
fiber_ptr1
compressed indptr
sparse coord0
coord1
coord2
metadata fiber_axes [0, 1] [1] [0]
compressed_axes [0] [0]
coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2]
axis_order
optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2
optional has_duplicates
12
Unified Sparse Storage: Rank 2 combinations
Rank 2 Combos
key
CSR
/COO
DCSR
/CSR
DCSR
/COO
DCSR/CSR
/COO
CSC
/COO
DCSC
/CSC
DCSC
/COO
DCSC/CSC
/COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0] [0] [0] [0] [0]
compressed_axes [0] [0] [0] [0] [0] [0]
coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1]
axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
optional has_duplicates
13
● No values?
○ Index-only
● Single values array
○ Typical usage
● Multiple value arrays!
○ Different data types
○ Give them different names?
○ Allows bitmaps
○ Want ability to read only one values array
● Dense array (may be rank N) for each element
○ For example, from embeddings
● Can a fill value be given for the missing elements?
Unified Sparse Storage: Values
14
● Straw man proposal: use asar!
○ “asar is a simple extensive archive format, it works
like tar that concatenates all files together without
compression, while having random access support”
● Support random access
● Use JSON to store files' information
● Very easy to write a parser
How should we pack bytes into a file?
Other options
● ZIP file
● LMDB
● HDF4, HDF5, CDF5, NetCDF
● Directory of files on file system
● Directory of files in cloud
○ S3, GCS, Azure, Ceph, etc.
● Zarr
● n5
● ASDF
| UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
15
● Arbitrary metadata via JSON
○ Able to evolve and support extensions
● Metadata scheme to specify CSR, DCSR, COO, CSF, etc.
○ Indicates what data is available and how to read it
● Single file archive that supports random access
○ Logically behaves like a key-value store
● Great, but what about extensions?
○ Support other sparse formats
○ Specify compression
○ Split single array into chunks
○ Additional data types (such as datetime)
○ Indicate variant of or option for existing format
Quick recap: what do we have so far?
Design goal: easily show via feature matrix which formats
and extensions are supported by each language and library.
16
● CSR
○ “Padded” CSR: add indptr_end so column indices and values can have gaps
■ Cheaper to add or remove values
○ Column indices in a given row can be:
■ unsorted
■ sorted
■ binary tree as a binary heap
○ CSR5
● COO
○ One array per coordinate
■ May have different data types (uint8, uint16, etc.)
○ One array for all coordinates (N x D)
○ Has duplicates or not
○ Lexicographic sort depth
File format variations
17
File format variations
https://arxiv.org/pdf/1904.03329.pdf
B-CSF
18
Graph properties
● directed / undirected
● weighted / unweighted
● is adjacency matrix
● is incidence matrix
● is bipartite graph
● is multigraph
● is hypergraph
● has self-edges
● has node attributes
Matrix properties
● is upper triangular
● is lower triangular
● is diagonal
● is symmetric
● is symmetric (only upper triangle)
● is symmetric (only lower triangle)
● is iso-valued
● ndiag
What about other properties?
A graph is more than just a sparse matrix
Thank You!
Erik Welch
ewelch@anaconda.com
About Anaconda
With more than 20 million users, Anaconda is the world’s most
popular data science platform and the foundation of modern
machine learning. We pioneered the use of Python for data
science, champion its vibrant community, and continue to
steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable
corporate, research, and academic institutions around the world
to harness the power of open-source for competitive advantage,
groundbreaking research, and a better world.
Visit https://www.anaconda.com to learn more.

More Related Content

PPTX
Open Data Mashups: linking fragments into mosaics
PDF
Demystifying datastores
PDF
Linked Open Data: A simple how-to
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
PDF
Indexing, searching, and aggregation with redi search and .net
PDF
Using NLP to Explore Entity Relationships in COVID-19 Literature
PDF
Brett Ragozzine - Graph Databases and Neo4j
PDF
Python Blaze Overview
Open Data Mashups: linking fragments into mosaics
Demystifying datastores
Linked Open Data: A simple how-to
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
Indexing, searching, and aggregation with redi search and .net
Using NLP to Explore Entity Relationships in COVID-19 Literature
Brett Ragozzine - Graph Databases and Neo4j
Python Blaze Overview

What's hot (17)

PPTX
Flow control, variable types
PDF
Europe PubMed Central and Linked Data
PPTX
Structures
PDF
Linking knowledge spaces
PPTX
Dsa unit 1
PPTX
Postgre sql data types
PDF
Crawling the Web for Structured Documents
PPTX
Learning R - Handling NetCDF files
PPTX
Introduction to Data Structures
PDF
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
PDF
Core Data
PPT
Introduction to trees
PPTX
Day 4( magic camp)
PPTX
Bdam presentation on parquet
PDF
Shilpa shukla processing_text
PPTX
Over view of data structures
Flow control, variable types
Europe PubMed Central and Linked Data
Structures
Linking knowledge spaces
Dsa unit 1
Postgre sql data types
Crawling the Web for Structured Documents
Learning R - Handling NetCDF files
Introduction to Data Structures
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Core Data
Introduction to trees
Day 4( magic camp)
Bdam presentation on parquet
Shilpa shukla processing_text
Over view of data structures
Ad

Similar to HPEC 2021 sparse binary format (20)

PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Anything-to-Graph
PDF
Overview of no sql
PPTX
Ledingkart Meetup #2: Scaling Search @Lendingkart
PDF
Data engineering Stl Big Data IDEA user group
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Data Structures & Algorithms
PDF
SDEC2011 Mahout - the what, the how and the why
PDF
Data Infra Meetup | ByteDance's Native Parquet Reader
PPTX
Database at Scale (Different db types).pptx
PDF
Mattingly "Text Mining Techniques"
PDF
Cassandra in production
PDF
How to get started in Big Data for master's students
PDF
Hadoop and cassandra
PPTX
PostgreSQL - Object Relational Database
PPTX
Apache Hive for modern DBAs
PDF
Mattingly "Text and Data Mining: Searching Vectors"
PDF
TinkerPop 2020
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Challenges and patterns for semantics at scale
Apache Spark 101 - Demi Ben-Ari
Anything-to-Graph
Overview of no sql
Ledingkart Meetup #2: Scaling Search @Lendingkart
Data engineering Stl Big Data IDEA user group
The Parquet Format and Performance Optimization Opportunities
Data Structures & Algorithms
SDEC2011 Mahout - the what, the how and the why
Data Infra Meetup | ByteDance's Native Parquet Reader
Database at Scale (Different db types).pptx
Mattingly "Text Mining Techniques"
Cassandra in production
How to get started in Big Data for master's students
Hadoop and cassandra
PostgreSQL - Object Relational Database
Apache Hive for modern DBAs
Mattingly "Text and Data Mining: Searching Vectors"
TinkerPop 2020
Apache Iceberg - A Table Format for Hige Analytic Datasets
Challenges and patterns for semantics at scale
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
MIND Revenue Release Quarter 2 2025 Press Release
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

HPEC 2021 sparse binary format

  • 1. 1 Binary Storage Formats for Sparse Erik Welch ewelch@anaconda.com September 21, 2021
  • 2. 2 ● To try to improve the status quo ○ Matrix Market, FROSTT, and other text-based formats ○ COO: coordinate triples of rows, columns, and values ■ Columnar storage technology is really good! ○ Bespoke, library-specific binary formats ○ TileDB ○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.) ● To discuss and create a vision for new binary storage formats for sparse ○ “Open standard” formats that are broadly useful and widely accessible ● To determine what is most important to the people here ○ Which languages, libraries, formats, features, technologies, etc. ○ And what do you wish for? Dream big! ● To find partners and team up for the next steps ○ Prototype, analyze, communicate results, and iterate ○ Goals for 3-6 months and beyond Why are we here?
  • 3. 3 ● Library developers and users ● Anybody with large sparse data or graphs ● Hardware creators and vendors ○ For example, to better optimize running workloads on GPUs or other accelerators ● Researchers ● Companies with graph workloads ● Benchmark organizers ● Maintainers of repositories of sparse data ● Virtually anybody who needs to work with sparse data or graphs Who is this for?
  • 4. 4 ● Data written to long term storage ○ Any technically competent person who discovers the file 10 years later should be able to figure out how to load and use it (even if they don’t recognize the format) ● The logical model of a file format ○ Metadata about metadata ■ File extension (such as .zip or .pdf) ■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY) ■ Version number of format ■ Any additional extensions needed to understand the metadata or data ○ Metadata ■ Data type, endianness of data, shape of sparse array, etc. ■ Info about compression ○ Data ■ Contiguous arrays of raw bytes ■ May be chunked or compressed What is a file format?
  • 5. 5 An “open” file format is also about community ● More likely to succeed with more partners ● Right now, a small group (or just me) can make a lot of progress with prototyping and documenting to bootstrap the effort ○ Along with feedback from LAGraph group and others ● We’ll want more partners eventually ● Please think about how/when you can help ○ Student projects ○ Volunteer to be beta users, give feedback ○ Participate in this session ● We’ll need a governance model eventually ○ Hopefully! This is a good “problem” to have ○ We’re happy to have this discussion “If you want to go quickly, go alone. If you want to go far, go together.” African Proverb
  • 6. 6 ● Fast enough ● Small enough ● Flexible enough ● Portable enough ● Future-proof enough ● Standardized enough ● “Cloud native” enough ● Simple enough to implement ● Easy enough to use ● Scalable enough ● Able to evolve ● And, of course, able to support whatever optimized format Tim Davis can dream up! What are our high level goals? “Pinky, are you pondering what I’m pondering?” Brain
  • 7. 7 Essential sparse formats to support: rank 1 and 2 Sparse vector indices values COO (ijv triples) rows columns values CSR indptr col_indices values CSC indptr row_indices values DCSR (HyperCSR) rows indptr col_indices values DCSC (HyperCSC) columns indptr row_indices values Let’s start designing! Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
  • 8. 8 Essential formats for rank N: COO and CSF http://shaden.io/pub-files/smith2017knl.pdf CSF is like DCSR/HyperCSR generalized to higher dimensions
  • 9. 9 ● Metadata ○ Let’s use JSON ○ Versatile ○ Human-readable ○ No need to think about endianness of metadata values Logical model of a file ● Data ○ Like a key-value store with binary data ○ This is just a logical model ○ We are free to pack the data into bytes however we wish
  • 10. 10 Unified Sparse Storage Rank 1 Rank 2 key CSR CSC DCSR DCSC COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] compressed_axes [0] [0] coord_axes [0] [1] [1] [1] [1] [0, 1] axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2 optional has_duplicates Minimal, cohesive set of metadata and storage keys
  • 11. 11 Unified Sparse Storage: Rank 3 key CSF CSF (alt) CSR+extra COO DCSR+extra double- fiber_ind0 compressed fiber_ptr0 fiber_ind1 fiber_ptr1 compressed indptr sparse coord0 coord1 coord2 metadata fiber_axes [0, 1] [1] [0] compressed_axes [0] [0] coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2] axis_order optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2 optional has_duplicates
  • 12. 12 Unified Sparse Storage: Rank 2 combinations Rank 2 Combos key CSR /COO DCSR /CSR DCSR /COO DCSR/CSR /COO CSC /COO DCSC /CSC DCSC /COO DCSC/CSC /COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] [0] [0] [0] [0] compressed_axes [0] [0] [0] [0] [0] [0] coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1] axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 optional has_duplicates
  • 13. 13 ● No values? ○ Index-only ● Single values array ○ Typical usage ● Multiple value arrays! ○ Different data types ○ Give them different names? ○ Allows bitmaps ○ Want ability to read only one values array ● Dense array (may be rank N) for each element ○ For example, from embeddings ● Can a fill value be given for the missing elements? Unified Sparse Storage: Values
  • 14. 14 ● Straw man proposal: use asar! ○ “asar is a simple extensive archive format, it works like tar that concatenates all files together without compression, while having random access support” ● Support random access ● Use JSON to store files' information ● Very easy to write a parser How should we pack bytes into a file? Other options ● ZIP file ● LMDB ● HDF4, HDF5, CDF5, NetCDF ● Directory of files on file system ● Directory of files in cloud ○ S3, GCS, Azure, Ceph, etc. ● Zarr ● n5 ● ASDF | UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
  • 15. 15 ● Arbitrary metadata via JSON ○ Able to evolve and support extensions ● Metadata scheme to specify CSR, DCSR, COO, CSF, etc. ○ Indicates what data is available and how to read it ● Single file archive that supports random access ○ Logically behaves like a key-value store ● Great, but what about extensions? ○ Support other sparse formats ○ Specify compression ○ Split single array into chunks ○ Additional data types (such as datetime) ○ Indicate variant of or option for existing format Quick recap: what do we have so far? Design goal: easily show via feature matrix which formats and extensions are supported by each language and library.
  • 16. 16 ● CSR ○ “Padded” CSR: add indptr_end so column indices and values can have gaps ■ Cheaper to add or remove values ○ Column indices in a given row can be: ■ unsorted ■ sorted ■ binary tree as a binary heap ○ CSR5 ● COO ○ One array per coordinate ■ May have different data types (uint8, uint16, etc.) ○ One array for all coordinates (N x D) ○ Has duplicates or not ○ Lexicographic sort depth File format variations
  • 18. 18 Graph properties ● directed / undirected ● weighted / unweighted ● is adjacency matrix ● is incidence matrix ● is bipartite graph ● is multigraph ● is hypergraph ● has self-edges ● has node attributes Matrix properties ● is upper triangular ● is lower triangular ● is diagonal ● is symmetric ● is symmetric (only upper triangle) ● is symmetric (only lower triangle) ● is iso-valued ● ndiag What about other properties? A graph is more than just a sparse matrix
  • 20. About Anaconda With more than 20 million users, Anaconda is the world’s most popular data science platform and the foundation of modern machine learning. We pioneered the use of Python for data science, champion its vibrant community, and continue to steward open-source projects that make tomorrow’s innovations possible. Our enterprise-grade solutions enable corporate, research, and academic institutions around the world to harness the power of open-source for competitive advantage, groundbreaking research, and a better world. Visit https://www.anaconda.com to learn more.