SlideShare a Scribd company logo
Valhalla at Pantheon
     A Distributed File System Built
on Cassandra, Twisted Python, and FUSE
Pantheon's Requirements
● Density
  ○ Over 50K volumes in a single cluster
  ○ Over 1000 clients on a single application server
● Storage volume
  ○ Over 10TB in a single cluster
  ○ De-duplication of redundant data
● Throughput
  ○ Peaks during the U.S. business day and during site
    imports and backups
● Performance
  ○ Back-end for Drupal web applications; access
    has to be fast enough to not burden a web request
  ○ The applications won't be adapted from running on
    local disk to running on Valhalla
Why not off-the-shelf?
● NFS
  ○ UID mapping requires trusted clients and networks
  ○ Standard Kerberos implementations have no HA
  ○ No cloud HA for client/server communication
● GlusterFS
  ○ Cannot scale volume density (though HekaFS can)
  ○ Cannot de-duplicate data
● Ceph
  ○ Security model relies on trusted clients
● MooseFS
  ○ Only primitive security
Valhalla's Design Manifesto
● Drupal applications read and write whole
  files between 10KB and 10MB
   ○ And most reads hit the edge proxy cache
● Drupal tracks files in its database and has
  little need for fstat() or directory listings
● POSIX compliance for locking and
  permissions is unimportant
   ○ But volume-level access control is critical
● Volumes may contain up to 1MM files
● Availability and performance trump
  consistency
volumes                                               content_by_file


       /d1/      /d1/f1.txt         /d1/d3/    /d1/d3/f2.txt                 content
vol1                                                       ...   ade12...
                 ade12...                      c12bea...                     binary



        /dir1/     /dir1/file.txt       /dir1/f2.txt                         content
vol2                                                       ...   c12bea...
                   ade12...             c12bea...                            binary



        /dir3/     /dir3/f3.txt         /dir3/f2.txt                         content
vol3                                                       ...   13a8cd...
                   13a8cd...            c12bea...                            binary

                              ...                                                      ...




                                                       Valhalla 1.0
Valhalla 1.0 Retrospective
● What worked
  ○ Efficient volume cloning
● What didn't
  ○ Slow computation of directory content when a
    directory is small but contains a large subdirectory
    ■ Fix: Depth prefix for entries
  ○ Slow computation of file size
    ■ Fix: Denormalize metadata into directory entries
  ○ Problems replicating large files
    ■ Fix: Split files into chunks
volumes                                                            content_by_file


       1:/d1/      1:/d1/f1.txt              1:/d1/d3/     2:/d1/d3/f2.txt                        0           1
vol1               {"size": 1243,                          {"size": 111,
                                                                                ...   ade12...
                    "hash": ade12...                        "hash": c12bea...
                                                                                                  binary      binary



        1:/dir1/      1:/dir1/file.txt           1:/dir1/f2.txt                                   0
vol2                                                                            ...   c12bea...
                      {"size": 1243,             {"size": 111,                                    binary
                      "hash": ade12...            "hash": c12bea...




        1:/dir3/        1:/dir3/f3.txt            1:/dir3/f2.txt                                  0           1        2
vol3                                                                            ...   13a8cd...
                        {"size": 5243,            {"size": 111,                                   binary      binary   binary
                        "hash": 13a8cd...          "hash": c12bea...


                                       ...                                                              ...




                                                             Valhalla 2.0
Valhalla 2.0 Retrospective
● What worked
  ○ Version 1.0 issues fixed
● Problems to solve
  ○ Directory listings iterate over many columns
    ■ Fix: Cache complete PROPFIND responses
  ○ Single-threaded client bottlenecks
    ■ Fix: "Fast path" with direct HTTP from PHP and
        proxied by Nginx
  ○ File content compaction eats up too much disk
    ■ Fix: "Offloading" cold and large content to S3
        using iterative scripts and real-time decisions
listing_cache                               Unchanged

                                                  content_by_file
       /dir1/         /dir2/
vol1
       binary         binary
                                                        ...


       /dir1/                                        volumes
vol2
       binary
                                                        ...

       /d1/           /d1/d2/   /d3/
vol3
       binary         binary    binary

                ...




                                   Valhalla 3.0
Valhalla 3.0 Retrospective
● What worked
  ○ Version 2.0 issues fixed
● Problems to solve
  ○ Changes invalidate cached PROPFINDs, and then
    clients do a PROPFIND
    ■ Fix: Extend schema and API to support volume
        and directory event propagation
  ○ Single-threaded client still bottlenecks
    ■ Fix: New, multithreaded client
  ○ Client uses a write-invalidate cache
    ■ Fix: Move to a write-through/write-back model
Meanwhile, in backups
● Stopped using davfs2 file mounts
● New backup preparation algorithm
  a. Backup builder downloads volume manifest
  b. Iterates through each file and goes directly from S3
     to the tarball
  c. Any files not yet on S3 get pushed there by
     requesting an "offload"
● Lower client overhead
● Lower server overhead
● Longer backup preparation time
events                                       Unchanged

                                                                                       content_by_file
               t=1                                  t=2
vol1:/dir1/
               {"path": "/dir2/","event":           {"path": "/dir2/f2.txt","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...


               t=5                                  t=6                                   volumes
vol1:/dir2/
               {"path": "/dir5/","event":           {"path": "/dir6/","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...

               t=5                                  t=6
                                                                                        listing_cache
vol3:/d1/d2/
               {"path": "f3.txt","event":           {"path": "f3.txt","event":
               "CREATED"...                         "DESTROYED"...


                                              ...                                            ...




                                                      Valhalla 4.0
Valhalla 4.0 Retrospective
● What worked
  ○ Version 3.0 issues fixed
● Problems to solve
  ○ Cloning volumes breaks the event stream
     ■ Fix: Invalidate events from before the volume
        clone request
  ○ Clients receiving earlier copies of their own events
     ■ Fix: Only send clients events published by other
        clients
  ○ Clients write a file and then have to re-download it
    because of ETag limitations
     ■ Fix: Extend PUT to send ETag on response
  ○ Iteration through file content items times out
     ■ Fix: Iterate through local sstable keys
volume_metadata              Unchanged

                                              content_by_file
       rewritten
vol1
       t=3
                                                    ...

                                                 volumes
vol2

                                                    ...

       rewritten
                                               listing_cache
vol3
       t=2

                         ...                        ...

                                                  events



                                                    ...
                               Valhalla 4.5
Implementing the Client Side
● Ditched davfs2
  ○ Single-threaded with only experimental patches to
    multi-thread
  ○ Crufty code base designed to abstract FUSE versus
    Coda
● Based code off of fusedav
  ○ Already multithreaded
  ○ Uses proven Neon WebDAV client
● Gutted cache
  ○ Needed fine-grained update capability for write-
    through and write-back
  ○ Replaced with LevelDB
● Added in high-level FUSE operations
  ○ Atomic open+truncate, atomic create+open, etc.
Caching model
● LevelDB
  ○   Embeddable with low overhead
  ○   Iteratation without allocation management
  ○   Data model identical to single Cassandra row
  ○   Storage model similar to Cassandra sstables
  ○   Similar atomicity to row changes in Cassandra 1.1+
● Mirrored volume row locally
  ○ Including prefixes and metadata
  ○ May move to Merkel-based replication later
Benchmarks versus Local and Older Models
Benchmarks versus Local and Older Models
What's Next at Pantheon
● Move more toward a pNFS model
  ○ No file content storage in Cassandra (all in S3)
  ○ Peer-to-peer or other non-Cassandra file content
    coordination between clients
● Peer-to-peer cache advisories between
  clients
  ○ Less chatty server communication to poll events
  ○ Smaller window of incoherence (3s to <1s)
● Dropping the "fast path"
  ○ Client is already multithreaded
  ○ Client cache is smarter than direct Valhalla access
  ○ Minimizes incompatibility with Drupal
What's Next for the Community
● Finalize GPL-licensed FuseDAV client
  ○ Already public on GitHub
  ○ Public test suite with bundled server
  ○ Coordinate with existing FuseDAV users to make the
    Pantheon version the official successor
● Publish WebDAV extensions and seek
  standards acceptance
  ○ Progressive PROPFIND
  ○ ETag on PUT
David Strauss
● My groups
  ○ Drupal Association
  ○ Pantheon Systems
  ○ systemd/udev
● Get in touch
  ○ david@davidstrauss.net
  ○ @davidstrauss
  ○ facebook.com/warpforge
● Learn more about Pantheon
  ○   Developer Open House
  ○   Presented by Kyle Mathews and Josh Koenig
  ○   Thursday, February 14th, 12PM PST
  ○   Sign up: http://tinyurl.com/a3ofpc2

More Related Content

PDF
JDO 2019: Kubernetes logging techniques with a touch of LogSense - Marcin Stożek
PPTX
Linux tech talk
PPT
Extending ns
ODP
Cassandra-Powered Distributed DNS
PPTX
Planificación clase a clase
PDF
How we cooked Elasticsearch, Consul, HAproxy and DNS-recursor
PDF
Cache on Delivery
ODP
bup backup system (2011-04)
JDO 2019: Kubernetes logging techniques with a touch of LogSense - Marcin Stożek
Linux tech talk
Extending ns
Cassandra-Powered Distributed DNS
Planificación clase a clase
How we cooked Elasticsearch, Consul, HAproxy and DNS-recursor
Cache on Delivery
bup backup system (2011-04)

Similar to Valhalla at Pantheon (20)

PPT
Integrity and Security in Filesystems
PPT
The Elephant in the Library
PPTX
Facebook's Approach to Big Data Storage Challenge
PDF
Filesystems timing attacks
PDF
Ivan Novikov - Filesystem timing attacks practice
PPTX
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
ODP
Linux-Fu for PHP Developers
PDF
北航云计算公开课03 google file system
PDF
Course 102: Lecture 24: Archiving and Compression of Files
PDF
DotDotPwn v3.0 [GuadalajaraCON 2012]
PDF
Guadalajara con 2012
KEY
2011 03-31 Riak Stockholm Meetup
PDF
Cc index
PDF
Inexpensive storage
PDF
DotDotPwn Fuzzer - Black Hat 2011 (Arsenal)
PDF
Pairtrees for object storage
PDF
PDF
GOTO 2011 preso: 3x Hadoop
Integrity and Security in Filesystems
The Elephant in the Library
Facebook's Approach to Big Data Storage Challenge
Filesystems timing attacks
Ivan Novikov - Filesystem timing attacks practice
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Linux-Fu for PHP Developers
北航云计算公开课03 google file system
Course 102: Lecture 24: Archiving and Compression of Files
DotDotPwn v3.0 [GuadalajaraCON 2012]
Guadalajara con 2012
2011 03-31 Riak Stockholm Meetup
Cc index
Inexpensive storage
DotDotPwn Fuzzer - Black Hat 2011 (Arsenal)
Pairtrees for object storage
GOTO 2011 preso: 3x Hadoop
Ad

More from David Timothy Strauss (13)

PDF
Advanced Drupal 8 Caching
PDF
LCache DrupalCon Dublin 2016
PDF
Container Security via Monitoring and Orchestration - Container Security Summit
PDF
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
PDF
Effective service and resource management with systemd
PDF
Containers > VMs
PDF
PHP at Density and Scale (Lone Star PHP 2014)
PDF
PHP at Density and Scale
PDF
PHP at Density and Scale
PDF
Scalable Drupal Infrastructure
PDF
Planning LAMP infrastructure
PDF
Is Drupal Secure?
ODP
Cassandra queuing
Advanced Drupal 8 Caching
LCache DrupalCon Dublin 2016
Container Security via Monitoring and Orchestration - Container Security Summit
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Effective service and resource management with systemd
Containers > VMs
PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale
PHP at Density and Scale
Scalable Drupal Infrastructure
Planning LAMP infrastructure
Is Drupal Secure?
Cassandra queuing
Ad

Recently uploaded (20)

PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
August Patch Tuesday
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Encapsulation theory and applications.pdf
WOOl fibre morphology and structure.pdf for textiles
Digital-Transformation-Roadmap-for-Companies.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Group 1 Presentation -Planning and Decision Making .pptx
Chapter 5: Probability Theory and Statistics
Encapsulation_ Review paper, used for researhc scholars
Tartificialntelligence_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
MIND Revenue Release Quarter 2 2025 Press Release
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
August Patch Tuesday
Unlocking AI with Model Context Protocol (MCP)
DP Operators-handbook-extract for the Mautical Institute
Encapsulation theory and applications.pdf

Valhalla at Pantheon

  • 1. Valhalla at Pantheon A Distributed File System Built on Cassandra, Twisted Python, and FUSE
  • 2. Pantheon's Requirements ● Density ○ Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server ● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data ● Throughput ○ Peaks during the U.S. business day and during site imports and backups ● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications won't be adapted from running on local disk to running on Valhalla
  • 3. Why not off-the-shelf? ● NFS ○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication ● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data ● Ceph ○ Security model relies on trusted clients ● MooseFS ○ Only primitive security
  • 4. Valhalla's Design Manifesto ● Drupal applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache ● Drupal tracks files in its database and has little need for fstat() or directory listings ● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical ● Volumes may contain up to 1MM files ● Availability and performance trump consistency
  • 5. volumes content_by_file /d1/ /d1/f1.txt /d1/d3/ /d1/d3/f2.txt content vol1 ... ade12... ade12... c12bea... binary /dir1/ /dir1/file.txt /dir1/f2.txt content vol2 ... c12bea... ade12... c12bea... binary /dir3/ /dir3/f3.txt /dir3/f2.txt content vol3 ... 13a8cd... 13a8cd... c12bea... binary ... ... Valhalla 1.0
  • 6. Valhalla 1.0 Retrospective ● What worked ○ Efficient volume cloning ● What didn't ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
  • 7. volumes content_by_file 1:/d1/ 1:/d1/f1.txt 1:/d1/d3/ 2:/d1/d3/f2.txt 0 1 vol1 {"size": 1243, {"size": 111, ... ade12... "hash": ade12... "hash": c12bea... binary binary 1:/dir1/ 1:/dir1/file.txt 1:/dir1/f2.txt 0 vol2 ... c12bea... {"size": 1243, {"size": 111, binary "hash": ade12... "hash": c12bea... 1:/dir3/ 1:/dir3/f3.txt 1:/dir3/f2.txt 0 1 2 vol3 ... 13a8cd... {"size": 5243, {"size": 111, binary binary binary "hash": 13a8cd... "hash": c12bea... ... ... Valhalla 2.0
  • 8. Valhalla 2.0 Retrospective ● What worked ○ Version 1.0 issues fixed ● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
  • 9. listing_cache Unchanged content_by_file /dir1/ /dir2/ vol1 binary binary ... /dir1/ volumes vol2 binary ... /d1/ /d1/d2/ /d3/ vol3 binary binary binary ... Valhalla 3.0
  • 10. Valhalla 3.0 Retrospective ● What worked ○ Version 2.0 issues fixed ● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
  • 11. Meanwhile, in backups ● Stopped using davfs2 file mounts ● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload" ● Lower client overhead ● Lower server overhead ● Longer backup preparation time
  • 12. events Unchanged content_by_file t=1 t=2 vol1:/dir1/ {"path": "/dir2/","event": {"path": "/dir2/f2.txt","event": "CREATED"... "CREATED"... ... t=5 t=6 volumes vol1:/dir2/ {"path": "/dir5/","event": {"path": "/dir6/","event": "CREATED"... "CREATED"... ... t=5 t=6 listing_cache vol3:/d1/d2/ {"path": "f3.txt","event": {"path": "f3.txt","event": "CREATED"... "DESTROYED"... ... ... Valhalla 4.0
  • 13. Valhalla 4.0 Retrospective ● What worked ○ Version 3.0 issues fixed ● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
  • 14. volume_metadata Unchanged content_by_file rewritten vol1 t=3 ... volumes vol2 ... rewritten listing_cache vol3 t=2 ... ... events ... Valhalla 4.5
  • 15. Implementing the Client Side ● Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda ● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client ● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB ● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
  • 16. Caching model ● LevelDB ○ Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+ ● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
  • 17. Benchmarks versus Local and Older Models
  • 18. Benchmarks versus Local and Older Models
  • 19. What's Next at Pantheon ● Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients ● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s) ● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
  • 20. What's Next for the Community ● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor ● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
  • 21. David Strauss ● My groups ○ Drupal Association ○ Pantheon Systems ○ systemd/udev ● Get in touch ○ david@davidstrauss.net ○ @davidstrauss ○ facebook.com/warpforge ● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: http://tinyurl.com/a3ofpc2