SlideShare a Scribd company logo
Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
Hello!
Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice
Scalable Looking at storage and serving infrastructures 1
Reliable Looking at redundancy, failure rates, on the fly changes 2
Cheap Looking at upfront costs, TCO and lifetimes 3
Four buckets Storage Serving BCP Cost
Storage
The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
Hardware overview The storage scale NAS SAN DAS Internal Higher Lower
Internal storage A disk in a computer SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes
DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
SAN Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband Low level protocols
NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS High level protocols
Of course, it’s more confusing than that
Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: c1t2d3 Controller, Target, Disk (Slice) Now means a virtual partition/volume LVM, Logical Volume Management
NAS vs SAN With SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN
SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement
Some Practical Examples There are a lot of vendors Configurations vary  Prices vary  wildly Let’s look at a couple Ones I happen to have experience with Not an endorsement ;)
NetApp Filers Heads and shelves, up to 500TB in 260U FC SAN with 1 or 2 NAS heads
Isilon IQ 2U Nodes, 3-96 nodes/cluster, 6-600 TB FC/InfiniBand SAN with NAS head on each node
Scaling Vertical vs Horizontal
Vertical scaling Get a bigger box Bigger disk(s)  More disks Limited by current tech – size of each disk and total number in appliance
Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of
Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space
Huge FS Create one giant volume with growing space Sun’s ZFS Isilon IQ Expandable on-the-fly? Upper limits Always limited  somewhere
Huge FS Pluses Simple from the application side Logically simple Low administrative overhead Minuses All your eggs in one basket Hard to expand Has an upper limit
Physical nodes Application handles distribution to multiple physical nodes Disks, Boxes, Appliances, whatever One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever
Physical Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Minuses Many ‘mounts’ to manage More administration
Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever
Virtual Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Addressing is logical, not physical Flexible volume sizing, consolidation Minuses Many ‘mounts’ to manage More administration
Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files read multiple parts simultaneously
Chunked space Pluses High performance Limitless size Minuses Conceptually complex Can be hard to expand on the fly Can’t manually poke it
Real Life Case Studies
GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks they’ve given Designed to store huge files for fast access
GFS – Google File System Single ‘Master’ node holds metadata SPF – Shadow master allows warm swap Grid of ‘chunkservers’ 64bit filenames 64 MB file chunks
GFS – Google File System 1(a) 2(a) 1(b) Master
GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing Synchronously for atomicity
GFS – Google File System Reading is fast (parallelizable) But requires a lease Master server is required for all reads and writes
MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage
MogileFS – OMG Files Single metadata store (MySQL) MySQL Cluster avoids SPF Multiple ‘tracker’ nodes locate files Multiple ‘storage’ nodes store files
MogileFS – OMG Files Tracker Tracker MySQL
MogileFS – OMG Files Replication of file ‘classes’ happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes
Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage
Flickr File System No metadata store Deal with it yourself Multiple ‘StorageMaster’ nodes Multiple storage nodes with virtual volumes
Flickr File System SM SM SM
Flickr File System Metadata stored by app Just a virtual volume number App chooses a path Virtual nodes are mirrored Locally and remotely Reading is done directly from nodes
Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately
Serving
Serving files Serving files is easy! Apache Disk
Serving files Scaling is harder Apache Disk Apache Disk Apache Disk
Serving files This doesn’t scale well Primary storage is expensive And takes a lot of space In many systems, we only access a small number of files most of the time
Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)
Why it works Keep a smaller working set Use faster hardware Lots of RAM SCSI Outer edge of disks (ZCAV) Use more duplicates Cheaper, since they’re smaller
Two models Layer 4 ‘Simple’ balanced cache Objects in multiple caches Good for few objects requested many times Layer 7 URL balances cache Objects in a single cache Good for many objects requested a few times
Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies  greatly  with each
Cache Churn How long do objects typically stay in cache? If it gets too short, we’re doing badly But it depends on your traffic profile Make the cached object store larger
Problems Caching has some problems: Invalidation is hard Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier
CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people Already in-place In lots of places GSLB/DNS balancing
Edge networks Origin
Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
CDN Models Simple model You push content to them, they serve it Reverse proxy model You publish content on an origin, they proxy and cache it
CDN Invalidation You don’t control the caches Just like those awful ISP ones Once something is cached by a CDN, assume it can never change Nothing can be deleted Nothing can be modified
Versioning When you start to cache things, you need to care about versioning Invalidation & Expiry Naming & Sync
Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly Avoid users finding old content Save cache space
Cache versioning Simple rule of thumb: If an item is modified, change its name (URL) This can be independent of the file system!
Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
Authentication Authentication inline layer Apache / perlbal Authentication sideline ICP (CARP/HTCP) Authentication by URL FlickrFS
Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin
Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin
Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the ‘keys’ for accessing files Cache Origin Web Server
BCP
Business Continuity Planning How can I deal with the unexpected? The core of BCP Redundancy Replication
Reality On a long enough timescale, anything that  can  fail,  will  fail Of course, everything  can  fail True reliability comes only through redundancy
Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many  node x  boxes can fail at once?
Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage
Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but my need duplication too
Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these
Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing replication queuing Host and colo level rebuilding
Reliable Reads & Writes Reliable reads are easy 2 or more copies of files Reliable writes are harder Write 2 copies at once But what do we do when we can’t write to one?
Dual writes Queue up data to be written Where? Needs itself to be reliable Queue up journal of changes And then read data from the disk whose write succeeded Duplicate whole volume after failure Slow!
Cost
Judging cost Per GB? Per GB upfront and per year Not as simple as you’d hope How about an example
Hardware costs Cost of hardware Usable GB Single Cost
Power costs Cost of power per year Usable GB Recurring Cost
Power costs Power installation cost Usable GB Single Cost
Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
Network costs Cost of network gear Usable GB Single Cost
Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
TCO Total cost of ownership in two parts Upfront Ongoing Architecture plays a huge part in costing Don’t get tied to hardware Allow heterogeneity Move with the market
(fin)
Photo credits flickr.com/photos/ebright/260823954/ flickr.com/photos/thomashawk/243477905/ flickr.com/photos/tom-carden/116315962/ flickr.com/photos/sillydog/287354869/ flickr.com/photos/foreversouls/131972916/ flickr.com/photos/julianb/324897/ flickr.com/photos/primejunta/140957047/ flickr.com/photos/whatknot/28973703/ flickr.com/photos/dcjohn/85504455/
You can find these slides online: iamcal.com/talks/

More Related Content

PPS
Web20expo Scalable Web Arch
PPS
Web20expo Filesystems
PPS
Web20expo Filesystems
PPS
Web20expo Scalable Web Arch
PPS
Web20expo Filesystems
PPS
Flickr Services
PPS
Flickr Php
PPS
Scalable Web Arch
Web20expo Scalable Web Arch
Web20expo Filesystems
Web20expo Filesystems
Web20expo Scalable Web Arch
Web20expo Filesystems
Flickr Services
Flickr Php
Scalable Web Arch

What's hot (9)

PPTX
Scalable Web Architecture and Distributed Systems
PPT
Knowledge share about scalable application architecture
PDF
Tulsa tech fest 2010 - web speed and scalability
PPTX
Migrating enterprise workloads to AWS
KEY
Caching: A Guided Tour - 10/12/2010
PPTX
Virtualizing Tier One Applications - Varrow
PDF
WordPress at Peak Performance (Radio Edit)
PDF
NOMAD ENTERPRISE & WAN CACHING APPLIANCES NETWORK OPTIMIZATION IN A CONFIGURA...
PPT
Configuring Apache Servers for Better Web Perormance
Scalable Web Architecture and Distributed Systems
Knowledge share about scalable application architecture
Tulsa tech fest 2010 - web speed and scalability
Migrating enterprise workloads to AWS
Caching: A Guided Tour - 10/12/2010
Virtualizing Tier One Applications - Varrow
WordPress at Peak Performance (Radio Edit)
NOMAD ENTERPRISE & WAN CACHING APPLIANCES NETWORK OPTIMIZATION IN A CONFIGURA...
Configuring Apache Servers for Better Web Perormance
Ad

Similar to Beyond the File System - Designing Large Scale File Storage and Serving (20)

PPS
Web20expo Filesystems
PPS
Beyond the File System: Designing Large-Scale File Storage and Serving
ODP
Distributed File System
 
PPS
Scalable Web Architectures - Common Patterns & Approaches
PPTX
Cloud computing UNIT 2.1 presentation in
PPT
Dfs (Distributed computing)
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
PDF
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
PPT
Distributed File Systems
PDF
Ceph data services in a multi- and hybrid cloud world
PDF
SUSE Storage: Sizing and Performance (Ceph)
PPTX
storage-systems.pptx
ODP
MNPHP Scalable Architecture 101 - Feb 3 2011
PDF
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
PDF
Introduction to distributed file systems
PPS
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
PDF
End of RAID as we know it with Ceph Replication
PPT
Farms, Fabrics and Clouds
PDF
Scalability Considerations
PDF
XenSummit - 08/28/2012
Web20expo Filesystems
Beyond the File System: Designing Large-Scale File Storage and Serving
Distributed File System
 
Scalable Web Architectures - Common Patterns & Approaches
Cloud computing UNIT 2.1 presentation in
Dfs (Distributed computing)
Designs, Lessons and Advice from Building Large Distributed Systems
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Distributed File Systems
Ceph data services in a multi- and hybrid cloud world
SUSE Storage: Sizing and Performance (Ceph)
storage-systems.pptx
MNPHP Scalable Architecture 101 - Feb 3 2011
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Introduction to distributed file systems
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
End of RAID as we know it with Ceph Replication
Farms, Fabrics and Clouds
Scalability Considerations
XenSummit - 08/28/2012
Ad

More from mclee (20)

PPT
平甩 Ppt
PPT
B肝抗原消失抗體產生的過程
PPS
5分鐘全面瞭解當前世界金融危機
PPS
Web20expo Scalable Web Arch
PDF
Cal Summit Small
PDF
060925大和總研太陽能產業簡報資料
PPT
951128日盛證券 舉辦石化及油品市場的演講資料
PPT
2007 年台股長線投資規(謝金河先生)
PPT
0314 聚焦未來財富─大中華股市巡禮
PDF
Tips
PPT
0411中港股市:全球最耀眼的市場
PPS
態度 台大教授方煒
PPS
一生要去的58個地方
PPS
Who Is The Winner
PPT
聰明人須知
PPT
聰明人須知
PPT
Grandpa 大海報
PPS
我來過我很乖
PPS
恐怖的食品添物0330
PPT
Everywhere
平甩 Ppt
B肝抗原消失抗體產生的過程
5分鐘全面瞭解當前世界金融危機
Web20expo Scalable Web Arch
Cal Summit Small
060925大和總研太陽能產業簡報資料
951128日盛證券 舉辦石化及油品市場的演講資料
2007 年台股長線投資規(謝金河先生)
0314 聚焦未來財富─大中華股市巡禮
Tips
0411中港股市:全球最耀眼的市場
態度 台大教授方煒
一生要去的58個地方
Who Is The Winner
聰明人須知
聰明人須知
Grandpa 大海報
我來過我很乖
恐怖的食品添物0330
Everywhere

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Monthly Chronicles - July 2025
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx

Beyond the File System - Designing Large Scale File Storage and Serving

  • 1. Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
  • 3. Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice
  • 4. Scalable Looking at storage and serving infrastructures 1
  • 5. Reliable Looking at redundancy, failure rates, on the fly changes 2
  • 6. Cheap Looking at upfront costs, TCO and lifetimes 3
  • 7. Four buckets Storage Serving BCP Cost
  • 9. The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
  • 10. Hardware overview The storage scale NAS SAN DAS Internal Higher Lower
  • 11. Internal storage A disk in a computer SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes
  • 12. DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
  • 13. SAN Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband Low level protocols
  • 14. NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS High level protocols
  • 15. Of course, it’s more confusing than that
  • 16. Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: c1t2d3 Controller, Target, Disk (Slice) Now means a virtual partition/volume LVM, Logical Volume Management
  • 17. NAS vs SAN With SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN
  • 18. SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement
  • 19. Some Practical Examples There are a lot of vendors Configurations vary Prices vary wildly Let’s look at a couple Ones I happen to have experience with Not an endorsement ;)
  • 20. NetApp Filers Heads and shelves, up to 500TB in 260U FC SAN with 1 or 2 NAS heads
  • 21. Isilon IQ 2U Nodes, 3-96 nodes/cluster, 6-600 TB FC/InfiniBand SAN with NAS head on each node
  • 22. Scaling Vertical vs Horizontal
  • 23. Vertical scaling Get a bigger box Bigger disk(s) More disks Limited by current tech – size of each disk and total number in appliance
  • 24. Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of
  • 25. Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space
  • 26. Huge FS Create one giant volume with growing space Sun’s ZFS Isilon IQ Expandable on-the-fly? Upper limits Always limited somewhere
  • 27. Huge FS Pluses Simple from the application side Logically simple Low administrative overhead Minuses All your eggs in one basket Hard to expand Has an upper limit
  • 28. Physical nodes Application handles distribution to multiple physical nodes Disks, Boxes, Appliances, whatever One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever
  • 29. Physical Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Minuses Many ‘mounts’ to manage More administration
  • 30. Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever
  • 31. Virtual Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Addressing is logical, not physical Flexible volume sizing, consolidation Minuses Many ‘mounts’ to manage More administration
  • 32. Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files read multiple parts simultaneously
  • 33. Chunked space Pluses High performance Limitless size Minuses Conceptually complex Can be hard to expand on the fly Can’t manually poke it
  • 34. Real Life Case Studies
  • 35. GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks they’ve given Designed to store huge files for fast access
  • 36. GFS – Google File System Single ‘Master’ node holds metadata SPF – Shadow master allows warm swap Grid of ‘chunkservers’ 64bit filenames 64 MB file chunks
  • 37. GFS – Google File System 1(a) 2(a) 1(b) Master
  • 38. GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing Synchronously for atomicity
  • 39. GFS – Google File System Reading is fast (parallelizable) But requires a lease Master server is required for all reads and writes
  • 40. MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage
  • 41. MogileFS – OMG Files Single metadata store (MySQL) MySQL Cluster avoids SPF Multiple ‘tracker’ nodes locate files Multiple ‘storage’ nodes store files
  • 42. MogileFS – OMG Files Tracker Tracker MySQL
  • 43. MogileFS – OMG Files Replication of file ‘classes’ happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes
  • 44. Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage
  • 45. Flickr File System No metadata store Deal with it yourself Multiple ‘StorageMaster’ nodes Multiple storage nodes with virtual volumes
  • 46. Flickr File System SM SM SM
  • 47. Flickr File System Metadata stored by app Just a virtual volume number App chooses a path Virtual nodes are mirrored Locally and remotely Reading is done directly from nodes
  • 48. Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately
  • 50. Serving files Serving files is easy! Apache Disk
  • 51. Serving files Scaling is harder Apache Disk Apache Disk Apache Disk
  • 52. Serving files This doesn’t scale well Primary storage is expensive And takes a lot of space In many systems, we only access a small number of files most of the time
  • 53. Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)
  • 54. Why it works Keep a smaller working set Use faster hardware Lots of RAM SCSI Outer edge of disks (ZCAV) Use more duplicates Cheaper, since they’re smaller
  • 55. Two models Layer 4 ‘Simple’ balanced cache Objects in multiple caches Good for few objects requested many times Layer 7 URL balances cache Objects in a single cache Good for many objects requested a few times
  • 56. Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies greatly with each
  • 57. Cache Churn How long do objects typically stay in cache? If it gets too short, we’re doing badly But it depends on your traffic profile Make the cached object store larger
  • 58. Problems Caching has some problems: Invalidation is hard Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier
  • 59. CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people Already in-place In lots of places GSLB/DNS balancing
  • 61. Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
  • 62. CDN Models Simple model You push content to them, they serve it Reverse proxy model You publish content on an origin, they proxy and cache it
  • 63. CDN Invalidation You don’t control the caches Just like those awful ISP ones Once something is cached by a CDN, assume it can never change Nothing can be deleted Nothing can be modified
  • 64. Versioning When you start to cache things, you need to care about versioning Invalidation & Expiry Naming & Sync
  • 65. Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly Avoid users finding old content Save cache space
  • 66. Cache versioning Simple rule of thumb: If an item is modified, change its name (URL) This can be independent of the file system!
  • 67. Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
  • 68. Authentication Authentication inline layer Apache / perlbal Authentication sideline ICP (CARP/HTCP) Authentication by URL FlickrFS
  • 69. Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin
  • 70. Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin
  • 71. Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the ‘keys’ for accessing files Cache Origin Web Server
  • 72. BCP
  • 73. Business Continuity Planning How can I deal with the unexpected? The core of BCP Redundancy Replication
  • 74. Reality On a long enough timescale, anything that can fail, will fail Of course, everything can fail True reliability comes only through redundancy
  • 75. Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many node x boxes can fail at once?
  • 76. Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage
  • 77. Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but my need duplication too
  • 78. Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these
  • 79. Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing replication queuing Host and colo level rebuilding
  • 80. Reliable Reads & Writes Reliable reads are easy 2 or more copies of files Reliable writes are harder Write 2 copies at once But what do we do when we can’t write to one?
  • 81. Dual writes Queue up data to be written Where? Needs itself to be reliable Queue up journal of changes And then read data from the disk whose write succeeded Duplicate whole volume after failure Slow!
  • 82. Cost
  • 83. Judging cost Per GB? Per GB upfront and per year Not as simple as you’d hope How about an example
  • 84. Hardware costs Cost of hardware Usable GB Single Cost
  • 85. Power costs Cost of power per year Usable GB Recurring Cost
  • 86. Power costs Power installation cost Usable GB Single Cost
  • 87. Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
  • 88. Network costs Cost of network gear Usable GB Single Cost
  • 89. Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
  • 90. Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
  • 91. TCO Total cost of ownership in two parts Upfront Ongoing Architecture plays a huge part in costing Don’t get tied to hardware Allow heterogeneity Move with the market
  • 92. (fin)
  • 93. Photo credits flickr.com/photos/ebright/260823954/ flickr.com/photos/thomashawk/243477905/ flickr.com/photos/tom-carden/116315962/ flickr.com/photos/sillydog/287354869/ flickr.com/photos/foreversouls/131972916/ flickr.com/photos/julianb/324897/ flickr.com/photos/primejunta/140957047/ flickr.com/photos/whatknot/28973703/ flickr.com/photos/dcjohn/85504455/
  • 94. You can find these slides online: iamcal.com/talks/