SlideShare a Scribd company logo
Deduplication and Single
                                   Instance Storage

                                    Practical Applications for Backups,
                                     Archiving, and Primary Storage




                                       Presented by:

                                              Jacob Farmer
                                              Cambridge Computer


© Copyright 2009-2010, Cambridge Computer Services, Inc. – All Rights Reserved
www.CambridgeComputer.com – 781-250-3000
About Your Lecturer

       Jacob Farmer, CTO, Cambridge Computer
         • Cambridge Computer, founded in 1991, provides training, integration,
           sales, and consulting in the fields of storage management, data
           protection, and digital archiving.
       Been working in data protection and storage management for
       almost 20 years.
         • Lecturer on storage technologies for Usenix for the past 10 years.
       Hybrid of industry analyst and consultant to end-users.
         • Spend 25% of my time working in the industry, going to conferences,
           meeting with vendors.
         • 75% of my time customer-facing, helping the sales and services
           departments design solutions for end users.
       Email: jfarmer@CambridgeComputer.com


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   2
Follow Me on Twitter


        My personal activities:
         •@JacobAFarmer
                –Note the “A” – my middle initial
        My educational activities
         •@Cambridge_EDU

Usenix-On-The-Road: The Latest Trends in Storage Networking
© Copyright 2009-2010-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                     www.CambridgeComputer.com   3
Agenda / Topics

       Dedupe basics
         • What is it, how does it work, and what is all the fuss about?
         • Hashing, segmenting, indexing, etc.
       Dedupe for backup systems
         • Basic benefits
         • Different approaches for scaling backups and how they relate back to dedupe
                – Front end bottlenecks
                – Backup data-movers
                – Back-end bottlenecks and scalable deduping
       Dedupe for primary storage
         • Virtual servers, physical servers, VDI
         • Rich media dedupe
       WAN Accelerators
       Questions as time permits

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   4
What is Deduplication?

       A term that refers to a number of different methods
       and techniques for reducing multiple instances of
       identical data down to a single (or at least fewer)
       instances.
         • Common data is replaced with pointers or tokens that refer
           back to the actual data.
       Other terms for deduplication
         • Data Reduction
         • Commonality Factoring
         • Capacity Optimization
         • Single Instancing or Single-Instance Storage (SIS)

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   5
Is Deduplication a Form of
 Compression?
       Yes, and No.

       YES – Deduplication results in data taking up less
       storage space or consuming less bandwidth on a
       network circuit.
         • Note that dedupe is often used in conjunction with
           conventional compression.
       NO – Deduplication could work on data types that
       are not compressible.
         • If you have 10 identical JPEG files stored in an
           uncompressible format, they could be reduced to a single
           instance, thus freeing up 90% of your capacity.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   6
Where Do You Find Dedupe
 Solutions?
       Deduplication solutions come to market whenever costs or
       efficiencies can be achieved by eliminating redundancy.
         • Backups
                – Conventional backup systems generate tons of redundant data
         • Email systems (at rest and in flight)
                – I send an email with the same attachment to everyone in the company.
                – Then everyone stores it in his/her personal home directory
                – Everyone in the branch offices pulls it over the WAN
         • File traffic over a WAN
         • Application and O.S. binaries across multiple systems
                – Virtual Servers and Virtual Desktops
                – Backups over a WAN
         • Very large collections of rich media files

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   7
Hashing / Fingerprinting

       Hashing (aka fingerprints, digests, signatures)
         • Generates a unique number (160+ bits) based on content
         • Hash acts as a proxy for content
         • Given a hash, not computationally feasible to generate
           content
       Common Hashing Algorithms
         • MD5
         • SHA-1
         • SHA-256
         • AES
       Hash Size and the Birthday Paradox
         • The size of the hash needs to be suitable to the task at
           hand

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   8
Hash Collisions - Are they real?


                                                   Fibre Channel
                                                   Bit Error Rate
                                10-10                            10-20                     10-30
                                                                                                     Probability



       Hit by
     lightning                                 Simultaneous
                                              triple disk fault                            Cryptographic
                                                 on RAID-6                                 hash collision

                      Win the                                                Cretaceous extinction meteor
                      lottery                                                    hitting in the next second



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                    www.CambridgeComputer.com      9
What Makes Deduplication and SIS
 Technology Difficult to Engineer?
       Hash Processing
         • Modern CPUs make this much easier
                – 100+ MB/sec/core
         • Hardware co-processor cards can hash at rates north of
           1.5GB/Sec.
       Disk performance
         • Deduped data often ends up getting fragmented on disk
                – This can hurt performance especially for backup systems

       Alignment of de-dupe segments
       Indexing

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   10
Indexing Can Be Hard
            # lookups/sec


                            106
                                                               Router                    Fine grained
                            105   Purpose built hardware                                    content
                                                                                           tracking
                            104

                            103                                                            large
                                  Software Database                                      database
                            102
                                  Technology
                            101
                                                        iPod                NYC
                            100   Human Lookup Rates                     phonebook


                                  101    102      103      104     105      106    107     108      109
                                                  # records
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                     www.CambridgeComputer.com   11
Parsing / Segmenting / Chunking

       Data needs to be “chopped up” in a consistent way in order to get optimal
       dedupe ratios
       Without any kind of special segmenting strategies backup streams and
       complex file types do not dedupe effectively

       Large files are almost always changed with overstrike semantics
         • Databases, structured data, .vmdk, .pst files
       Small files are almost changed with insert semantics
         • Office apps, editors etc
       If there are large files (e.g. database tables, virtual machine images) in the
       backup mix, their treatment usually will dominate any data reduction
       strategy.
         • Don’t sweat the small stuff!
       Different vendors may have strengths with one type or another

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   12
Change Types: Insert v. Overstrike



        Insert:
             The quick brown fox jumped over the lazy dog.
             The quick brown horse jumped over the lazy dog.
                                                                                   Identical data (may be)
                                                                                   misaligned
     Overstrike:
                                                                                   “Fred” added to
                            Joe                                            Sue     employee database
                            Joe                         Fred               Sue     Identical data doesn’t
                                                                                   move

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   13
NetBackup OST (open storage option)

       API and framework that makes it very easy for a dedupe
       target device vendor to parse the data stream.
        • Pre segments content
         • Enables more efficient dedup solutions
         • Allows for smart copy between systems of only changed
           data


          PQZ                             R                                        PQR
                                            Z

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   14
Deduplication for Backups




www.CambridgeComputer.com    15
Backup Systems Have a Lot of
 Redundant Data
       Conventional backup solutions generate a ton of redundant
       data
         • Assuming weekly full backups, a file that has not changed in 5 years,
           still gets backed up 260 times!
         • Assuming daily full backups of email, a message you received 5 years
           ago gets backed up 1825 times.
                – Similarly, a record in a database from 5 years ago might be backed up
                  1825 times!
       There are really two problems to solve:
         • Minimizing the amount of redundant data that gets repeatedly
           transferred
         • Minimizing the amount of redundant data that gets stored.



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   16
Most of the Buzz on Dedupe is from
 Backup Target Vendors
       A “target” is a backup storage device
       Dedupe disk targets generally come packaged as
         • NAS
                – File server (NFS or CIFS) interface
         • Virtual tape library
                – A disk device that emulates a tape library
                – Fibre Channel or iSCSI interface
                – NAS vs VTL outside the scope of this talk




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   17
When Do Dedupe Disk Targets Shine?

       When you are backing up a lot of redundant data
         • Files that never or seldom change between backups
         • Duplicated files
         • Databases and email repositories that are receptive to
           commonality factoring
       When you are retaining backup data for a decent
       amount of time
         • Ideally you are keeping several weeks of backups
       When you seek to replicate a conventional backup
       system over a WAN.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   18
Example: NYC Law Firm with
 NetBackup
       6+ Terabytes
       Full backups every day!
         • Why? Because someone had a bad experience in the past
           with incremental backups and has trust issues
       90 day retention period
       Most files seldom change
         • Many files are scanned images that never change
       Several large databases
       Several TB of MS Exchange
       Average result – 102x capacity optimization !
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   19
Try to Visualize 100x Capacity
 Optimization




                 OR

   One 3U cabinet v. 7 full racks full of gear!

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   20
Backup Vaulting – Another Use Case
 for Dedupe




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   21
Why Replicate the Backup System?

       Relatively easy DR solution
         • Does not require additional software for the hosts
         • Does not require storage devices with replication
           capabilities
         • One system that replicate all of your hosts
                – Platform-independent

       Eliminate the need to ship tapes off site
         • Eliminate the need to encrypt tapes




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   22
Example: Defense Contractor
 Replicating ERP System
       Problem: CIO does not want employee data being
       sent off site without encrypting the tapes.
         • IT staff wants to avoid tape encryption.

       Solution:
         • Full backup of 800GB+ Oracle database to deduplicating disk target
           every day.
         • Retain backups for 60 days on disk.
                – 60 x 800 = 48TB
         • Vault backups to remote site over T1

       Outcome
         • Dedupe ration of about 70:1
         • 800GB backup job traverses the T1 in a few hours

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   23
But, Before We Get all Hot and
 Bothered . . .


 Let’s review how backup systems
 actually work!



www.CambridgeComputer.com          24
Common Backup Bottlenecks

                                                                                   Backup Clients
                                                                                   You have to get data off the
                                                                                   host and transfer it
                                                                                   Network
                                  Network                                          Seldom the real bottleneck,
                                                                                   except over a WAN

                                                                                   Backup Servers
                                                                                   I/O processing is the most
                                                                                   common bottleneck

                                                                                   Storage Devices
                                                                                   Storage devices can be a
                                                                                   bottleneck, but are seldom
                                                                                   the whole problem.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                      www.CambridgeComputer.com   25
Front-end and Network – Minimize
 Duplication in the First Place
       Backups generate a lot of redundant data, so what if
       we had smarter client software that did not generate
       redundant data?
         • Incremental Forever
                – After the first full backup, only do incremental backups
                – This is what IBM TSM does, for instance
         • Synthetic Full Backup
                – Last weeks full backup is merged with this weeks incremental
                  backups to “synthesize” this week’s full backup.
                – No need to transfer redundant data




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   26
Example: Energy Firm using IBM
 Tivoli Storage Manager
       TSM only backs up files that have changed.
         • It does not generate a lot of duplicate files
       Most of the 15TB of capacity are documentation and images
       that do not change – ever.
         • Relatively little of it is database.
         • Images don’t compress
         • Utilizing compression on TSM client for compressible files
       Over all deduplication ratio: about 2:1
         • Can’t justify the cost of dedupe across the board
         • Resolution: Set up dedupe tier for database and email
                – Do the file backups to conventional disk and tape




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   27
Synthetic Full Backups – An Approach
 that Creates a Need for Dedupe
       Synthetic Full Backups
         • “Poor man’s incremental forever”
         • Combine subsequent incremental backups with the
           previous full backup to “synthesize” the next week’s full
           backup.
         • Great technique for minimizing networking traffic from
           backups.
       Synthetic full backups require that at least two weeks
       of backups be available on disk.
         • Dedupe disk targets tend to be a big win for synthetic full
           backups

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   28
Example: Research Firm with 6 Week
 Retention and Synthetic Fulls
       60TB+
         • Mix of large file systems, content management systems,
           email, and database
       Using Commvault with heavy use of synthetic full
       backups
       6 week retention on disk
       Dedupe ratios between 8x and 16x
         • NOTE: Their backup data could not fit in one dedupe box,
           so they are managing 4 separate dedupe appliances in
           each of their locations.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   29
Theoretical v. Actual Capacity
 Your Mileage May Vary
       YMMV – one customer’s mileage
         • 48 TB raw disk
         • 36 TB with RAID-6
         • 35 +/ TB for unique capacity
         • 3-5 TB deliberately left empty for headroom
       Might hold
         • As much as 500 TB of backups
         • Or as little as 50 TB.




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   30
Dedupe on the Backup Client

       Host-side dedupe is a form of sub-file-level incremental
         • Instead of catching block-level changes, the file system changes are
           hashed and compared with the back-end storage repository.
         • Alternative to block-based CDP
         • Unique data segments are then transferred to the backup service.
       Host-side deduping is very valuable over the WAN.
         • Minimizes data that needs to be transferred
         • Typically it will dedupe across hosts, reducing files that are common to
           multiple hosts
                – Such as application and operating system binaries




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   31
WAN Backup Software with Dedupe in
 the Client
               Dedupe Client

           London
                                                                              LAN

                                                                                              New York
                                                                Shared Client & Local Recovery
            Local USB                      WAN



          Hong Kong                                       Backup Server(s)                 Jersey City
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                    www.CambridgeComputer.com   32
Backup System Network I/O
 Processing Bottlenecks


 Moving Backup Data Through the
 Network



www.CambridgeComputer.com         33
Backup Server I/O Processing is a
 Major Bottleneck
       In most enterprise backup systems a single backup
       server would be a major performance bottleneck
         • Unless you were doing incremental forever or sub-file-level
           backups
         • Add a dedup process to that and it becomes that much
           harder
       A common practice for scaling out backup server
       performance is to add network “data movers”
         • Also known as: storage nodes, media servers, media
           agents, etc.


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   34
Interesting Idea – Add Deduplication
 to the Network Data Movers




                                                    Network




                          Dedicated Storage Network




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   35
I/O Processing Bottlenecks
 Network Data Movers and “LAN-Free”



  Network data
  movers                                                                                   “LAN-Free”
                                                                                           Clients



                        Dedicated Storage Network




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   36
LAN-Free Backup Clients and NDMP
 Backups
       In larger enterprise-class backup systems it is
       common to have larger servers move data directly to
       storage devices over Fibre Channel.
       The fastest way to backup large NAS server is to do
       NDMP dumps over Fibre Channel.




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   37
“LAN-FREE” – End-run Around the
 Backup Server



                                                                                            SAN Clients work
                                        G ig E                                              like slave servers.
                                                                                            They back up
                                                                                            directly to the
                                                                                            storage media, while
                                                                                            reporting metadata
                                                                                            over the LAN to the
          Storage Area Network                                                              backup server.


                                                                       Tape        Presumably all of these
                                                                       Robot       tape drives are part of a
                                                                        Arm        tape library.

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   38
Dedupe with LAN-Free Backup Clients

       With LAN-Free backup you get no benefit from
       dedupe processing residing on the data movers.
         • The dedupe logic needs to sit on the target storage device
       This is where VTLs shine
         • VTL works just like tape
                – Network data movers work fine
                – LAN-Free clients work fine
         • VTLs offer higher throughput than CIFS or NFS
                – Common to see total throughput in excess of 1GB/Sec
         • VTLs might offer tighter integration with tape
       Many VTLs do dedupe as a post-process


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   39
Back End Bottlenecks


 Can your dedupe appliance keep
 pace with the backup system?



www.CambridgeComputer.com         40
Back-End Bottlenecks: Can the
 Dedupe Storage Devices Hack It?
       If you open up the flood gates, you might find that a
       single dedupe box on the LAN cannot hack it.
       Some solutions:
         • Buy lots of individual dedupe devices
         • Maybe use a VTL implementation of dedupe
                – Sorry out of the scope of this lecture
         • Post-process deduping instead of deduping on-the-fly
                – Less efficient from a capacity standpoint, but should be able to
                  achieve considerably better performance
         • New grid-based architectures that offer parallel processing
           for deduplication
         • Newer dedupe devices that are up to the task
Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   41
Stand-alone Deduplication Servers

       Single server dedupe solutions are often constrained
       by:
         • RAM and processing power
         • The size of the index they can manage
         • Disk performance
       When you max out the box, you need to buy another
       one
         • Very painful incremental upgrade
         • No dedupe across multiple boxes
         • Make sure that you but a big enough box!

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   42
Object-Based File System with Grid
 Architecture and Global Dedupe
                                                                            CIFS/NFS Clients
                                                                       Backup System Data Movers
                                                                    Conventional File System Consumers




      Front-End Nodes
      Export File Systems
      Scale-out performance into GBs/Sec




       Back-End Nodes
       Manage disk, dedupe, and redundancy
       Scale-deep to Petabytes of capacity




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   43
VTL with Scalable Deduplication



                                             G ig E




                     Storage Network




                                                      De-Duplication Processors     Single Instance
                               VTL                                                    Repository

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   44
Summary: Alternative Technologies
 to Dedupe Disk Targets
       Don’t Duplicate in the First Place
         • Incremental Forever Backups
         • WAN-enabled backups, perhaps with dedupe on the client

       Throw disk at it
         • Bulk SATA arrays cost typically less than $1K per TB
                – Capacities up to 2PB
                – Densities on the order of 1PB / rack
                – MAID – power management to spin down inactive drives

       Replicate Your SAN or NAS
         • Use optimized file backup or archive solution to provide file recovery
           and to meet retention requirements



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   45
Other Examples of Dedupe
 Technology


 Primary Storage, VDI, Rich Media
 Archiving



www.CambridgeComputer.com           46
Block-Level Dedupe for Primary
 Storage
       Most dedupe solutions are designed specifically for backup
       and archival data.
       A limited number of products can dedupe on live data.
         • One day perhaps, dedupe for primary storage will be a way of life
       Great applications – those with redundant data!
         • Desktop virtualization (VDI)
                – A number of very interesting solutions are coming to market
         • VMDK backup, dedupe, and fail-over on one platform
         • Boot image servers
       Reclamation of empty disk space
         • Blank space deduplicates very nicely!



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   47
Single-Instance Storage for Virtual
 Desktops
       Storage is a big deal-breaker for many VDI use cases
         • Replaces desktop storage and desktop personnel with SAN storage and
           highly specialized storage managers
       New techniques for VDI storage break the desktop down into
       elements and find commonality across all desktops
       Virtual desktop file systems are “stitched together” from common
       elements:
         • Operating system
         • Applications or sets of applications
         • Variable elements
                – For example: anti-virus signatures
         • Personal elements
                –   Screen savers and background images
                –   Google toolbar
                –   Personal applications
                –   Personal files


Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   48
Dedupe Across Large Collections of
 Rich Media Files
       Many types of files have content-level commonality across a
       large collection of files.
         • TIFF
         • JPG
         • PNG
         • OpenEXR
         • DICOM
         • MS Office Documents
         • PDFs
       A high level of commonality can be detected and de-
       duplicated, assuming a large enough sample set of data.
         • Capacity optimization (depending on file type) on the order of 2x to 10x and
           beyond.



Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   49
Dedupe in WAN Accelerators




www.CambridgeComputer.com     50
MS Exchange Branch Office: Example
 of the Need for Dedupe over the WAN


          Chicago                                                                    New York

                                                                                     Atlanta
                                                      WAN

  MS Exchange Server
  Message with attachment sent to all staff.                                         San Fran
  Single instance message storage, but
  the same message crosses the WAN multiple times

Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com   51
WAN Accelerator with Inline Dedupe



                                                                                             Site B



  Site A

                                   WAN Accelerators / WAFS Gateways
                                                                                                 F ile S er ver s o r
                                                                                                 NAS Ap p lian ce




Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010
© Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved.
                                                                                   www.CambridgeComputer.com            52
Questions – If Time Permits




www.CambridgeComputer.com      53

More Related Content

PDF
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
PDF
Ronnie Oomen (EMC)
PPTX
Panzura Global Storage System
PDF
26 a6 emc europe - arnaud christoffel
PDF
Nimbus Partner Solutions Brief
PDF
Sun storage tek 6140 customer presentation
PDF
Tom McCann - Sopra
PPT
Open archive islandora-channel-training
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
Ronnie Oomen (EMC)
Panzura Global Storage System
26 a6 emc europe - arnaud christoffel
Nimbus Partner Solutions Brief
Sun storage tek 6140 customer presentation
Tom McCann - Sopra
Open archive islandora-channel-training

What's hot (18)

PPTX
IBM Tape the future of tape
PDF
Drobo storage for_business_summary
PPT
O leary2012 comp_ppt_ch08
PDF
Overview and current topics in solid state storage
PPTX
IBM Tape Update Dezember18 - TS1160
PDF
Re-inventing the Database: What to Keep and What to Throw Away
PPTX
The future of tape
PDF
How can maximize your storage capabilities by using IBM backup & restore solu...
PDF
Les solutions EMC de sauvegarde des données avec déduplication dans les envir...
PDF
Taneja Group: Midrange Redefined – the IBM Storwize V7000 Analyst Paper
PDF
Dell - Storage 12sept2012
PDF
Cloud Storage Adoption, Practice, and Deployment
PPTX
Setting up Storage Features in Windows Server 2012
PDF
PDF
Tape and cloud strategies for VM backups
PPT
Avamar 7 2010
PPTX
Introducing Lattus Object Storage
IBM Tape the future of tape
Drobo storage for_business_summary
O leary2012 comp_ppt_ch08
Overview and current topics in solid state storage
IBM Tape Update Dezember18 - TS1160
Re-inventing the Database: What to Keep and What to Throw Away
The future of tape
How can maximize your storage capabilities by using IBM backup & restore solu...
Les solutions EMC de sauvegarde des données avec déduplication dans les envir...
Taneja Group: Midrange Redefined – the IBM Storwize V7000 Analyst Paper
Dell - Storage 12sept2012
Cloud Storage Adoption, Practice, and Deployment
Setting up Storage Features in Windows Server 2012
Tape and cloud strategies for VM backups
Avamar 7 2010
Introducing Lattus Object Storage
Ad

Viewers also liked (7)

PDF
Implementing ibm storage data deduplication solutions sg247888
ODP
Barcamp Gent 2: rsnapshot
ODP
How we setup Rsync-powered Incremental Backups
PPTX
Oracle sharding : Installation & Configuration
PDF
Oracle 12.2 sharded database management
PDF
Oracle Cloud Networking And Security Exposed
DOCX
A hybrid cloud approach for secure authorized deduplication
Implementing ibm storage data deduplication solutions sg247888
Barcamp Gent 2: rsnapshot
How we setup Rsync-powered Incremental Backups
Oracle sharding : Installation & Configuration
Oracle 12.2 sharded database management
Oracle Cloud Networking And Security Exposed
A hybrid cloud approach for secure authorized deduplication
Ad

Similar to Deduplication and single instance storage (20)

PDF
DataDomain brochure
PDF
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
PDF
Object Based Storage
 
PDF
Branch office in a box
PDF
Data Domain Architecture
PDF
Dedupe-Centric Storage for General Applications
 
PPTX
PPT
lec-7.ppt It Infrastructure: Storage
PDF
Databse & Technology 2 _ Francisco Munoz alvarez _ 11g new functionalities fo...
PPTX
Openstorage with OpenStack, by Bradley
PPTX
Pm 01 bradley stone_openstorage_openstack
PDF
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
PDF
Storage Training July 10
PDF
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
PDF
Caspa Preservabl Infrastructure Luigi Briguglio
PDF
Preservation Planning: Choosing a suitable digital preservation strategy
PDF
Digital preservation: an introduction
PDF
Database Management
PDF
Distributed computing the Google way
DataDomain brochure
Panzura & Scality - Cloud Storage made seamless - Cloud Expo New York City 2012
Object Based Storage
 
Branch office in a box
Data Domain Architecture
Dedupe-Centric Storage for General Applications
 
lec-7.ppt It Infrastructure: Storage
Databse & Technology 2 _ Francisco Munoz alvarez _ 11g new functionalities fo...
Openstorage with OpenStack, by Bradley
Pm 01 bradley stone_openstorage_openstack
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
Storage Training July 10
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
Caspa Preservabl Infrastructure Luigi Briguglio
Preservation Planning: Choosing a suitable digital preservation strategy
Digital preservation: an introduction
Database Management
Distributed computing the Google way

More from Interop (20)

PDF
Preparing for the cloud
PDF
Portable clouds navigating cloud standards
PDF
Planning for (and deploying!) 4 g wireless
PDF
Planning and implementing windows 7
PDF
Outsourcing it security yes, it’s still your problem
PDF
Next gen lan infrastructure
PDF
New approaches to vulnerability management
PDF
Mst cloud interoperability process
PDF
Mobile security new challenges practical solutions
PDF
Mobile computing threats
PDF
Mobile application development strategies
PDF
Managing your virtual environment
PDF
Managing change in the data center network
PDF
Managing a public cloud
PDF
Malice through the looking glass
PDF
Extending the lifecycle of your storage area network
PDF
Desktop virtualization primer one size does not fit all
PDF
Desktop virtualization best practices
PDF
Deep dive why networking must fundamentally change
PDF
Deep dive storage networking the path to performance
Preparing for the cloud
Portable clouds navigating cloud standards
Planning for (and deploying!) 4 g wireless
Planning and implementing windows 7
Outsourcing it security yes, it’s still your problem
Next gen lan infrastructure
New approaches to vulnerability management
Mst cloud interoperability process
Mobile security new challenges practical solutions
Mobile computing threats
Mobile application development strategies
Managing your virtual environment
Managing change in the data center network
Managing a public cloud
Malice through the looking glass
Extending the lifecycle of your storage area network
Desktop virtualization primer one size does not fit all
Desktop virtualization best practices
Deep dive why networking must fundamentally change
Deep dive storage networking the path to performance

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Business Ethics Teaching Materials for college
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
master seminar digital applications in india
PPTX
Pharma ospi slides which help in ospi learning
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
The Final Stretch: How to Release a Game and Not Die in the Process.
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Open folder Downloads.pdf yes yes ges yes
PPTX
COMPUTERS AS DATA ANALYSIS IN PRECLINICAL DEVELOPMENT.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Cell Structure & Organelles in detailed.
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Business Ethics Teaching Materials for college
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
master seminar digital applications in india
Pharma ospi slides which help in ospi learning
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Renaissance Architecture: A Journey from Faith to Humanism
The Final Stretch: How to Release a Game and Not Die in the Process.
2.FourierTransform-ShortQuestionswithAnswers.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Open folder Downloads.pdf yes yes ges yes
COMPUTERS AS DATA ANALYSIS IN PRECLINICAL DEVELOPMENT.pptx
O7-L3 Supply Chain Operations - ICLT Program
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Cell Structure & Organelles in detailed.
Abdominal Access Techniques with Prof. Dr. R K Mishra
Week 4 Term 3 Study Techniques revisited.pptx

Deduplication and single instance storage

  • 1. Deduplication and Single Instance Storage Practical Applications for Backups, Archiving, and Primary Storage Presented by: Jacob Farmer Cambridge Computer © Copyright 2009-2010, Cambridge Computer Services, Inc. – All Rights Reserved www.CambridgeComputer.com – 781-250-3000
  • 2. About Your Lecturer Jacob Farmer, CTO, Cambridge Computer • Cambridge Computer, founded in 1991, provides training, integration, sales, and consulting in the fields of storage management, data protection, and digital archiving. Been working in data protection and storage management for almost 20 years. • Lecturer on storage technologies for Usenix for the past 10 years. Hybrid of industry analyst and consultant to end-users. • Spend 25% of my time working in the industry, going to conferences, meeting with vendors. • 75% of my time customer-facing, helping the sales and services departments design solutions for end users. Email: jfarmer@CambridgeComputer.com Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 2
  • 3. Follow Me on Twitter My personal activities: •@JacobAFarmer –Note the “A” – my middle initial My educational activities •@Cambridge_EDU Usenix-On-The-Road: The Latest Trends in Storage Networking © Copyright 2009-2010-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 3
  • 4. Agenda / Topics Dedupe basics • What is it, how does it work, and what is all the fuss about? • Hashing, segmenting, indexing, etc. Dedupe for backup systems • Basic benefits • Different approaches for scaling backups and how they relate back to dedupe – Front end bottlenecks – Backup data-movers – Back-end bottlenecks and scalable deduping Dedupe for primary storage • Virtual servers, physical servers, VDI • Rich media dedupe WAN Accelerators Questions as time permits Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 4
  • 5. What is Deduplication? A term that refers to a number of different methods and techniques for reducing multiple instances of identical data down to a single (or at least fewer) instances. • Common data is replaced with pointers or tokens that refer back to the actual data. Other terms for deduplication • Data Reduction • Commonality Factoring • Capacity Optimization • Single Instancing or Single-Instance Storage (SIS) Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 5
  • 6. Is Deduplication a Form of Compression? Yes, and No. YES – Deduplication results in data taking up less storage space or consuming less bandwidth on a network circuit. • Note that dedupe is often used in conjunction with conventional compression. NO – Deduplication could work on data types that are not compressible. • If you have 10 identical JPEG files stored in an uncompressible format, they could be reduced to a single instance, thus freeing up 90% of your capacity. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 6
  • 7. Where Do You Find Dedupe Solutions? Deduplication solutions come to market whenever costs or efficiencies can be achieved by eliminating redundancy. • Backups – Conventional backup systems generate tons of redundant data • Email systems (at rest and in flight) – I send an email with the same attachment to everyone in the company. – Then everyone stores it in his/her personal home directory – Everyone in the branch offices pulls it over the WAN • File traffic over a WAN • Application and O.S. binaries across multiple systems – Virtual Servers and Virtual Desktops – Backups over a WAN • Very large collections of rich media files Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 7
  • 8. Hashing / Fingerprinting Hashing (aka fingerprints, digests, signatures) • Generates a unique number (160+ bits) based on content • Hash acts as a proxy for content • Given a hash, not computationally feasible to generate content Common Hashing Algorithms • MD5 • SHA-1 • SHA-256 • AES Hash Size and the Birthday Paradox • The size of the hash needs to be suitable to the task at hand Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 8
  • 9. Hash Collisions - Are they real? Fibre Channel Bit Error Rate 10-10 10-20 10-30 Probability Hit by lightning Simultaneous triple disk fault Cryptographic on RAID-6 hash collision Win the Cretaceous extinction meteor lottery hitting in the next second Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 9
  • 10. What Makes Deduplication and SIS Technology Difficult to Engineer? Hash Processing • Modern CPUs make this much easier – 100+ MB/sec/core • Hardware co-processor cards can hash at rates north of 1.5GB/Sec. Disk performance • Deduped data often ends up getting fragmented on disk – This can hurt performance especially for backup systems Alignment of de-dupe segments Indexing Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 10
  • 11. Indexing Can Be Hard # lookups/sec 106 Router Fine grained 105 Purpose built hardware content tracking 104 103 large Software Database database 102 Technology 101 iPod NYC 100 Human Lookup Rates phonebook 101 102 103 104 105 106 107 108 109 # records Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 11
  • 12. Parsing / Segmenting / Chunking Data needs to be “chopped up” in a consistent way in order to get optimal dedupe ratios Without any kind of special segmenting strategies backup streams and complex file types do not dedupe effectively Large files are almost always changed with overstrike semantics • Databases, structured data, .vmdk, .pst files Small files are almost changed with insert semantics • Office apps, editors etc If there are large files (e.g. database tables, virtual machine images) in the backup mix, their treatment usually will dominate any data reduction strategy. • Don’t sweat the small stuff! Different vendors may have strengths with one type or another Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 12
  • 13. Change Types: Insert v. Overstrike Insert: The quick brown fox jumped over the lazy dog. The quick brown horse jumped over the lazy dog. Identical data (may be) misaligned Overstrike: “Fred” added to Joe Sue employee database Joe Fred Sue Identical data doesn’t move Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 13
  • 14. NetBackup OST (open storage option) API and framework that makes it very easy for a dedupe target device vendor to parse the data stream. • Pre segments content • Enables more efficient dedup solutions • Allows for smart copy between systems of only changed data PQZ R PQR Z Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 14
  • 16. Backup Systems Have a Lot of Redundant Data Conventional backup solutions generate a ton of redundant data • Assuming weekly full backups, a file that has not changed in 5 years, still gets backed up 260 times! • Assuming daily full backups of email, a message you received 5 years ago gets backed up 1825 times. – Similarly, a record in a database from 5 years ago might be backed up 1825 times! There are really two problems to solve: • Minimizing the amount of redundant data that gets repeatedly transferred • Minimizing the amount of redundant data that gets stored. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 16
  • 17. Most of the Buzz on Dedupe is from Backup Target Vendors A “target” is a backup storage device Dedupe disk targets generally come packaged as • NAS – File server (NFS or CIFS) interface • Virtual tape library – A disk device that emulates a tape library – Fibre Channel or iSCSI interface – NAS vs VTL outside the scope of this talk Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 17
  • 18. When Do Dedupe Disk Targets Shine? When you are backing up a lot of redundant data • Files that never or seldom change between backups • Duplicated files • Databases and email repositories that are receptive to commonality factoring When you are retaining backup data for a decent amount of time • Ideally you are keeping several weeks of backups When you seek to replicate a conventional backup system over a WAN. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 18
  • 19. Example: NYC Law Firm with NetBackup 6+ Terabytes Full backups every day! • Why? Because someone had a bad experience in the past with incremental backups and has trust issues 90 day retention period Most files seldom change • Many files are scanned images that never change Several large databases Several TB of MS Exchange Average result – 102x capacity optimization ! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 19
  • 20. Try to Visualize 100x Capacity Optimization OR One 3U cabinet v. 7 full racks full of gear! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 20
  • 21. Backup Vaulting – Another Use Case for Dedupe Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 21
  • 22. Why Replicate the Backup System? Relatively easy DR solution • Does not require additional software for the hosts • Does not require storage devices with replication capabilities • One system that replicate all of your hosts – Platform-independent Eliminate the need to ship tapes off site • Eliminate the need to encrypt tapes Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 22
  • 23. Example: Defense Contractor Replicating ERP System Problem: CIO does not want employee data being sent off site without encrypting the tapes. • IT staff wants to avoid tape encryption. Solution: • Full backup of 800GB+ Oracle database to deduplicating disk target every day. • Retain backups for 60 days on disk. – 60 x 800 = 48TB • Vault backups to remote site over T1 Outcome • Dedupe ration of about 70:1 • 800GB backup job traverses the T1 in a few hours Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 23
  • 24. But, Before We Get all Hot and Bothered . . . Let’s review how backup systems actually work! www.CambridgeComputer.com 24
  • 25. Common Backup Bottlenecks Backup Clients You have to get data off the host and transfer it Network Network Seldom the real bottleneck, except over a WAN Backup Servers I/O processing is the most common bottleneck Storage Devices Storage devices can be a bottleneck, but are seldom the whole problem. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 25
  • 26. Front-end and Network – Minimize Duplication in the First Place Backups generate a lot of redundant data, so what if we had smarter client software that did not generate redundant data? • Incremental Forever – After the first full backup, only do incremental backups – This is what IBM TSM does, for instance • Synthetic Full Backup – Last weeks full backup is merged with this weeks incremental backups to “synthesize” this week’s full backup. – No need to transfer redundant data Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 26
  • 27. Example: Energy Firm using IBM Tivoli Storage Manager TSM only backs up files that have changed. • It does not generate a lot of duplicate files Most of the 15TB of capacity are documentation and images that do not change – ever. • Relatively little of it is database. • Images don’t compress • Utilizing compression on TSM client for compressible files Over all deduplication ratio: about 2:1 • Can’t justify the cost of dedupe across the board • Resolution: Set up dedupe tier for database and email – Do the file backups to conventional disk and tape Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 27
  • 28. Synthetic Full Backups – An Approach that Creates a Need for Dedupe Synthetic Full Backups • “Poor man’s incremental forever” • Combine subsequent incremental backups with the previous full backup to “synthesize” the next week’s full backup. • Great technique for minimizing networking traffic from backups. Synthetic full backups require that at least two weeks of backups be available on disk. • Dedupe disk targets tend to be a big win for synthetic full backups Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 28
  • 29. Example: Research Firm with 6 Week Retention and Synthetic Fulls 60TB+ • Mix of large file systems, content management systems, email, and database Using Commvault with heavy use of synthetic full backups 6 week retention on disk Dedupe ratios between 8x and 16x • NOTE: Their backup data could not fit in one dedupe box, so they are managing 4 separate dedupe appliances in each of their locations. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 29
  • 30. Theoretical v. Actual Capacity Your Mileage May Vary YMMV – one customer’s mileage • 48 TB raw disk • 36 TB with RAID-6 • 35 +/ TB for unique capacity • 3-5 TB deliberately left empty for headroom Might hold • As much as 500 TB of backups • Or as little as 50 TB. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 30
  • 31. Dedupe on the Backup Client Host-side dedupe is a form of sub-file-level incremental • Instead of catching block-level changes, the file system changes are hashed and compared with the back-end storage repository. • Alternative to block-based CDP • Unique data segments are then transferred to the backup service. Host-side deduping is very valuable over the WAN. • Minimizes data that needs to be transferred • Typically it will dedupe across hosts, reducing files that are common to multiple hosts – Such as application and operating system binaries Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 31
  • 32. WAN Backup Software with Dedupe in the Client Dedupe Client London LAN New York Shared Client & Local Recovery Local USB WAN Hong Kong Backup Server(s) Jersey City Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 32
  • 33. Backup System Network I/O Processing Bottlenecks Moving Backup Data Through the Network www.CambridgeComputer.com 33
  • 34. Backup Server I/O Processing is a Major Bottleneck In most enterprise backup systems a single backup server would be a major performance bottleneck • Unless you were doing incremental forever or sub-file-level backups • Add a dedup process to that and it becomes that much harder A common practice for scaling out backup server performance is to add network “data movers” • Also known as: storage nodes, media servers, media agents, etc. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 34
  • 35. Interesting Idea – Add Deduplication to the Network Data Movers Network Dedicated Storage Network Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 35
  • 36. I/O Processing Bottlenecks Network Data Movers and “LAN-Free” Network data movers “LAN-Free” Clients Dedicated Storage Network Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 36
  • 37. LAN-Free Backup Clients and NDMP Backups In larger enterprise-class backup systems it is common to have larger servers move data directly to storage devices over Fibre Channel. The fastest way to backup large NAS server is to do NDMP dumps over Fibre Channel. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 37
  • 38. “LAN-FREE” – End-run Around the Backup Server SAN Clients work G ig E like slave servers. They back up directly to the storage media, while reporting metadata over the LAN to the Storage Area Network backup server. Tape Presumably all of these Robot tape drives are part of a Arm tape library. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 38
  • 39. Dedupe with LAN-Free Backup Clients With LAN-Free backup you get no benefit from dedupe processing residing on the data movers. • The dedupe logic needs to sit on the target storage device This is where VTLs shine • VTL works just like tape – Network data movers work fine – LAN-Free clients work fine • VTLs offer higher throughput than CIFS or NFS – Common to see total throughput in excess of 1GB/Sec • VTLs might offer tighter integration with tape Many VTLs do dedupe as a post-process Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 39
  • 40. Back End Bottlenecks Can your dedupe appliance keep pace with the backup system? www.CambridgeComputer.com 40
  • 41. Back-End Bottlenecks: Can the Dedupe Storage Devices Hack It? If you open up the flood gates, you might find that a single dedupe box on the LAN cannot hack it. Some solutions: • Buy lots of individual dedupe devices • Maybe use a VTL implementation of dedupe – Sorry out of the scope of this lecture • Post-process deduping instead of deduping on-the-fly – Less efficient from a capacity standpoint, but should be able to achieve considerably better performance • New grid-based architectures that offer parallel processing for deduplication • Newer dedupe devices that are up to the task Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 41
  • 42. Stand-alone Deduplication Servers Single server dedupe solutions are often constrained by: • RAM and processing power • The size of the index they can manage • Disk performance When you max out the box, you need to buy another one • Very painful incremental upgrade • No dedupe across multiple boxes • Make sure that you but a big enough box! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 42
  • 43. Object-Based File System with Grid Architecture and Global Dedupe CIFS/NFS Clients Backup System Data Movers Conventional File System Consumers Front-End Nodes Export File Systems Scale-out performance into GBs/Sec Back-End Nodes Manage disk, dedupe, and redundancy Scale-deep to Petabytes of capacity Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 43
  • 44. VTL with Scalable Deduplication G ig E Storage Network De-Duplication Processors Single Instance VTL Repository Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 44
  • 45. Summary: Alternative Technologies to Dedupe Disk Targets Don’t Duplicate in the First Place • Incremental Forever Backups • WAN-enabled backups, perhaps with dedupe on the client Throw disk at it • Bulk SATA arrays cost typically less than $1K per TB – Capacities up to 2PB – Densities on the order of 1PB / rack – MAID – power management to spin down inactive drives Replicate Your SAN or NAS • Use optimized file backup or archive solution to provide file recovery and to meet retention requirements Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 45
  • 46. Other Examples of Dedupe Technology Primary Storage, VDI, Rich Media Archiving www.CambridgeComputer.com 46
  • 47. Block-Level Dedupe for Primary Storage Most dedupe solutions are designed specifically for backup and archival data. A limited number of products can dedupe on live data. • One day perhaps, dedupe for primary storage will be a way of life Great applications – those with redundant data! • Desktop virtualization (VDI) – A number of very interesting solutions are coming to market • VMDK backup, dedupe, and fail-over on one platform • Boot image servers Reclamation of empty disk space • Blank space deduplicates very nicely! Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 47
  • 48. Single-Instance Storage for Virtual Desktops Storage is a big deal-breaker for many VDI use cases • Replaces desktop storage and desktop personnel with SAN storage and highly specialized storage managers New techniques for VDI storage break the desktop down into elements and find commonality across all desktops Virtual desktop file systems are “stitched together” from common elements: • Operating system • Applications or sets of applications • Variable elements – For example: anti-virus signatures • Personal elements – Screen savers and background images – Google toolbar – Personal applications – Personal files Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 48
  • 49. Dedupe Across Large Collections of Rich Media Files Many types of files have content-level commonality across a large collection of files. • TIFF • JPG • PNG • OpenEXR • DICOM • MS Office Documents • PDFs A high level of commonality can be detected and de- duplicated, assuming a large enough sample set of data. • Capacity optimization (depending on file type) on the order of 2x to 10x and beyond. Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 49
  • 50. Dedupe in WAN Accelerators www.CambridgeComputer.com 50
  • 51. MS Exchange Branch Office: Example of the Need for Dedupe over the WAN Chicago New York Atlanta WAN MS Exchange Server Message with attachment sent to all staff. San Fran Single instance message storage, but the same message crosses the WAN multiple times Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 51
  • 52. WAN Accelerator with Inline Dedupe Site B Site A WAN Accelerators / WAFS Gateways F ile S er ver s o r NAS Ap p lian ce Deduplication and Single Instance Storage – Interop – Las Vegas – April 27, 2010 © Copyright 2009-2010, Cambridge Computer Services, Inc. All rights reserved. www.CambridgeComputer.com 52
  • 53. Questions – If Time Permits www.CambridgeComputer.com 53