HDFS & ASM

Download as PPTX, PDF

2 likes658 views

The document compares HDFS (Hadoop Distributed File System) and ASM (Automatic Storage Management), discussing their histories, architectures, and performance factors such as disk size, read speed, and data protection methods. It highlights HDFS's suitability for large files and sequential access, while ASM manages metadata efficiently in databases. The conclusion points to ongoing developments, such as Hadoop 3.0 improvements, including erasure coding and enhanced namenode capabilities.

Technology

Copyright © 2017 Accenture All rights reserved.
Platform little more than 'skunkworks'
outside tech industries
John Mertic
Open Data Platform Initiative

Data Growth
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Explosive data growth well known
Many Exabytes of data created every day

Disk Size Increases
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
HDD size has increased
Cost per GB decreased

Disk Read Speed
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Sequential read speed has not improved at same rate
Time to read entire 8TB drive is circa 12 hours!

Space Consumption
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Lots of Drives = More drive failures
Need to store redundant copies of data
Rebalancing

Failure Rate
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
1000 drives means a drive fails a week

Data Protection
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Could use hardware RAID
HDFS & ASM give protection via software

HDFS History
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
HDFS part of Hadoop
HDFS is 10 years old

HDFS History
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Hadoop ecosystem builds on HDFS

ASM History
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
ASM released in 2003
ASM

Commodity
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
HDFS designed to run on large number of nodes
HDFS is a distributed filesystem written in java

HDFS Goals
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Designed for very large files
Designed for sequential access

NON HDFS Use Cases
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Low latency
Lots of small files
NO SYMBOL

Metadata
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
File system metadata critical

ASM Instance
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
ASM Instance manages metadata

ASM Architecture
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Node Types
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Client JVM
HDFS
Client
Name Node
dn-1 dn-2 dn-3 dn-4 dn-5 dn-6 dn-7 dn-8 dn-9

Flex ASM Architecture
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

NameNode RAM
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
apps
/
users jfa
spark
hive
Hierarchical
Namespace
dn-2 dn-3blk_123 dn-1
dn-8 dn-9dn-7blk_456
Block
Manager
heartbeat
disk used
disk free
dn-1
heartbeat
disk used
disk free
dn-2
heartbeat
disk used
disk free
dn-3
Live
DataNodes

Namespace Durability
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Name Node
Image
Checkpoint
Edit Log

Formatting
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
hdfs namenode -format

Blocks
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
HDFS block size 128MB by default

Blocks
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Client Access
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

DataNode
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

NameNode Resilience
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

ASM Resilience
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
ASM Cluster Pool of Storage
Disk Group BDisk Group A
Shared Disk Groups
Wide File Striping
Databases share ASM
instances
ASM Instance
Database Instance
ASM Disk
Node5Node4Node3Node2Node1 Node5
Node5 runs as
ASM Client to
Node4
Node1 runs as
ASM Client to
Node2
Node1 runs as
ASM Client to
Node4
Node2 runs as
ASM Client to
Node3

NameNode Backup
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Preventing Namenode SPOF
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Standby Name Node
Active Name Node
Edit Log Image
Edit Log
Image
NFS

Secondary NameNode
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

NameNode HA
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Quorum Journal Manager
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

NameNode Failover & Fencing
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

File Permissions
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
POSIX like

Replica Placement
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Rack 1 Rack 3Rack 2

Database I/O
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Reading Data
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Client Node
Client JVM
HDFS
Client
Distributed
filesystem
FSData
InputStream
heartbeat
disk used
disk free
dn-1
heartbeat
disk used
disk free
dn-2
heartbeat
disk used
disk free
dn-3
Name Node
Image
CheckpointEdit Log

Writing Data
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Client Node
Client JVM
HDFS
Client
Distributed
filesystem
FSData
OutputStream
Name Node
heartbeat
disk used
disk free
dn-1
heartbeat
disk used
disk free
dn-2
heartbeat
disk used
disk free
dn-3
write write
Name Node
Image
CheckpointJournal
ack ack

Rebalancing
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Rebalancing

ASM Rebalance
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
alter diskgroup rebalance

HDFS Balancer
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Rebalancing
start-balancer.sh

HDFS Balancer
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
dfs.datanode.balance.bandwidthPerSec

Bit Rot
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
DataBlockScanner

CheckSum
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

What’s Coming
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

Hadoop 3.0
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Erasure Coding

Hadoop 3.0
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
Intra-datanode balancer

Hadoop 3.0
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.
More than 2 NameNodes

Conclusion
Comparing HDFS & ASM
Copyright © 2017 Accenture All rights reserved.

HDFS & ASM

1. Comparing HDFS & ASM Jason Arneil

20. NameNode RAM Comparing HDFS & ASM Copyright © 2017 Accenture All rights reserved. apps / users jfa spark hive Hierarchical Namespace dn-2 dn-3blk_123 dn-1 dn-8 dn-9dn-7blk_456 Block Manager heartbeat disk used disk free dn-1 heartbeat disk used disk free dn-2 heartbeat disk used disk free dn-3 Live DataNodes

28. ASM Resilience Comparing HDFS & ASM Copyright © 2017 Accenture All rights reserved. ASM Cluster Pool of Storage Disk Group BDisk Group A Shared Disk Groups Wide File Striping Databases share ASM instances ASM Instance Database Instance ASM Disk Node5Node4Node3Node2Node1 Node5 Node5 runs as ASM Client to Node4 Node1 runs as ASM Client to Node2 Node1 runs as ASM Client to Node4 Node2 runs as ASM Client to Node3

38. Reading Data Comparing HDFS & ASM Copyright © 2017 Accenture All rights reserved. Client Node Client JVM HDFS Client Distributed filesystem FSData InputStream heartbeat disk used disk free dn-1 heartbeat disk used disk free dn-2 heartbeat disk used disk free dn-3 Name Node Image CheckpointEdit Log

39. Writing Data Comparing HDFS & ASM Copyright © 2017 Accenture All rights reserved. Client Node Client JVM HDFS Client Distributed filesystem FSData OutputStream Name Node heartbeat disk used disk free dn-1 heartbeat disk used disk free dn-2 heartbeat disk used disk free dn-3 write write Name Node Image CheckpointJournal ack ack

Editor's Notes

#2: BIG DATA is all the rage Almost as popular as Cloud This is where we are dealing with datasets in the hundreds of TB’s to Petabytes And using 100s to 1000’s of CPUs in parallel to process this data Aggregating the power of many servers as a single resource The idea of this presentation is to show how concepts you are familiar with (in ASM) Carry over to the world of HDFS As DBAs & Systems folks you are in prime position to manage this coming wave
#3: My name is jason Arneil Been in IT for around 18 years, both as an Oracle DBA and a System Administrator The last 4 ½ years exclusively worked on the Exadata platform. I’m really just dipping my toes in the Big Data world – it is all rage though! Blogged a bit in the past You can find me on twitter Became an Oracle ACE a couple of years ago Now work in the Accenture Enkitec Group
#5: I Was quite struck when I saw this quote last month. To me that smells of opportunity
#6: Exponential data growth is a well known phenomena Many exabytes stored every day worldwide This creates a storage problem
#7: This does help us store more data We now have 8TB as fairly standard enterprise HDD’s
#8: Speed at very best roughly 200MB/s So to be able to run analysison 10’s or 100’s of TB’s of data In a reasonable time frame You are going to need LOTS of drives – 100’s or 1000’s of drives The more concurrency you have the more drives you will need
#9: More drives leads to more drive failure So we need a mechanism to protect our data from drive failure Storing redundant copies of data actually leads to even more drives being used And more drive failures
#10: Cloud Storage company Backblaze have over 50,000 drives in their DataCenters They publish drive reliability stats from this real world situation – in a proper Air conditioned DC While Drive failure varies with age, their average failure rate was going on 5% Source: https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/
#11: We have to have some way of protecting our Data Hardware RAID is an expensive solution – particularly at 100”s of TB Doesn’t provide data locality for analysing the data Transfer of huge quantities of data to servers would be massive bottleneck Analysis of huge amounts of data more efficient if executed near the data it is operating on
#12: Hadoop Distributed File System (HDFS™) Hadoop is an open source project from the apache software foundation Has had a reasonable amount of time to develop, evolve and mature – but filesytems generally have a long (multi-decade) lifespan Has its roots from google - though the elephant logo is from a toy owned by son of a yahoo engineer – Doug cutting Note the Distributed part – a filesystem that manages storage across a range of machines That is the storage of those individual machines is presented as an aggregate
#13: You can think of various layers in the hadoop world With storage as the base layer Followed by a method of allocating resources and scheduling tasks across the cluster – Yet Another resource Negotiator Then various applications used for data analysis that can take advantage of these Hadoop scales computation, storage and I/O bandwidth
#14: ASM has it’s genesis all the way back in 1996 – initial problem that led to it was related to video streaming! Took 7 years from initial idea to released product ASM is a clustered filesystem – not a distributed filesystem Design goal was to be able to stripe data across 1000’s of disks It would also be fault tolerant
#15: HDFS is designed to be portable from one platform to another It is designed to run on commodity hardware A key goal is linear scalability both on data size and compute resources: Doubling numbers of nodes should half processing time on same volume of data Likewise doubling the data volume and the number of nodes should result in constant processing time Essentially it uses a divide and conquer approach You can buy it from Oracle it runs on the Big Data Appliance
#16: “Very large” here means files that are hundreds of megabytes, gigabytes, or terabytes in size” Petabyte sized clusters not unheard off makes it easy to store large files: optimises sequential reading of data over latency It’s likely on HDFS that analysis will read large percentage of entire dataset – very different from typical RDBMS usage Reading most of the dataset efficiently is more important than the latency of reading first record HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates Can append to a file, but cannot update at arbitrary point
#17: HDFS not designed for low latency access Lots of small files does not scale well on HDFS
#18: metadata is critical to the operation of a filesystem Essentially you can’t access the files stored without the metadata
#19: When using an oracle Database with ASM We have an ASM instance in addition to the database instance This ASM instance has a small portion of the RDBMS code ASM that manages the metadata for the Datafiles Note metadata is stored (and protected) with the data in the diskgroups
#20: ASM architecture upto 12c looks like this Database on every node, ASM instance on every node All accessing the same underlying drives where the data is There is only 1 type of node and all nodes are identical
#21: When it comes to the world of HDFS we have 2 types of nodes Namenode: - Minimum of 1, mostly 2 for redundancy – we’ll come on to that Namenode manages the filesystem namespace Maintains filesystem tree and metadata for all files and directories Regulates access to files by clients The other type of node we have is the datanode Many Datanodes in cluster – these are where the data is stored and where the computations and analysis are executed These are all just standard servers – are likely to spread across multiple racks in the datacenter because you have so many of them Datanodes responsible for serving read/write operations from HDFS clients Datanodes also perform block creation/delition and replication upon instruction from Namenode
#22: But How different is it really from the 12c flex ASM architecture Here you no longer have ASM instances running on all nodes, and DB instances can run on nodes that don’t have ASM instances Think of ASM as the “namenodes” – managing the metadata and the databases being clients of the ASM instances Analogy even better if you think about exadata Where the storage is on standard servers running linux and where even some computation normally done at the database is offloaded to Hadoop is extending this idea all the way – all computation done where the storage resides
#23: NameNode is critical in HDFS Metadata is stored persistently on disk on namenode in 2 files: Namespace image – name space is hierarchy of files and directories + Edit log The metadata is decoupled from the data Namenode also knows on which datanode all blocks for a given file will reside – remember the same block will exist on multiple datanodes Block locations not stored permanently on namenode – This info can be reconstructed from datanodes provide periodic block reports This is stored in memory and with many files this can become limiting factor for scalability Though we can federate the namespace – so multiple namenodes each manage a portion of the filesystem – This is NOT HA though DataNodes send heartbeats every 3 secs - No heartbeat in 10 mins – node presumed dead namenode schedules re-replication of lost replicas
#24: Durability of namespace maintained by write-ahead journal and checkpoints Journal transactions persisted into edit log before replying to client This records every change that occurs to file system metadata The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. Checkpoints periodically written to image file Block locations discovered from DataNodes via block reports – these are NOT persisted on NameNode This can lead to slow startup times of namenode
#25: Creating a diskgroup in ASM implicitly creates filesystem Size is not specified and data is spread evenly across all disks A new hdfs installtion needs to be formatted The formatting process creates an empty filesystem by creating the storage directories and the initial versions of the namenode’s persistent data structures Datanodes are not involved in the initial formatting process as the namenode manages the filesystem metadata You don’t need to say how large a filesystem to create as it’s determined by number of members of the cluster So filesystem size can be increased with additional cluster members long after creation
#26: disk has a block size 512bytes typical or 4K (modern) - minimum amount of data that can read/write Filesystem data block is multiple of disk block size, typically few KB in size IN ASM files written as a collection of extents Extents are multiples of Allocation Units, typically going from 1MB up to 64MB but can be set higher HDFS as a filesystem has concept of a block - 128MB by default but it is configurable Files in HDFS broken into block sized chunks stored as independent units File smaller than a full block DOES NOT occupy a full block of space Reason for such a large block size is to minimise seek costs
#27: Having a block abstraction enables a file to span multiple disks Nothing to require all blocks from the same file to be on same drive Blocks are fixed size which simplifies metadata management – metadata don’t need to be stored with blocks Easy to calculate how many blocks can fit on a disk Block concept also useful when it comes to replication and fault tolerance
#28: A Client accesses the filesystem on behalf of a User by communicating with namenode and datanodes The client can present a POSIX like filesystem to user – user code does not need to know about namenode/datanodes to function HDFS interaction mediated through a JAVA API Can interact with filesystem via HTTP REST API – but slower than java Also a C library
#29: Datanode is workhorse of HDFS Store and retrieve blocks when told by clients or namenode Report back periodically to namenode with lists of blocks they are storing
#30: Filesystem cannot function without NAMENODE If the namenode were destroyed all files on the filesystem would be lost as No way of reconstructing files from blocks on datanodes Vital to ensure resilience of namenode There are a number of different options for ensuring NameNode resilience ASM way ahead in terms of resilience
#31: With non-flex ASM if we lose an ASM instance we only lose the DBs on that node Other nodes keep working – definite advantage of cluster technology And it’s even better with flex ASM if we loose an ASM instance all databases can carry on processing. ASM Instance can also relocate if node fails
#32: What we need to protect is the edit log and the image file Hadoop can be configured to ensure namenode wites persistent metadata to multiple filesystems These are synchronous and atomic writes Usual choice is to write to local disk and NFS mount
#33: Active name nodes writes updates both locally and to NFS Share Standby Name node also has access to NFS share Even with a secondary namenode it won’t be able to service requests until 1 Namespace image is loaded into memory 2 Edit log is replayed 3 received enough block reports from datanodes On decent sized cluster this could be 30 mins! – Not really high availability
#34: One step up the availability ladder is to run secondary namenode Does NOT act as a namenode – does not serve requests Job is to merge namespace image with edit log Copy of this merged namenode image can be used if primary namenode fails Note this has a datalag so some data can be lost This is still not high availability And causes problem for routine maintenance and planned downtime
#35: Previous options does not provide high availability of the filesystem HA can be accomplished with a pair of namenodes in Active-Standby configuration Standby can take over from Active node without significant delay Namenodes MUST have highly available shared storage Datanodes MUST send block reports to both nodes – Remember block mappings stored in MEMORY on namenode Clients must be enabled to handle namenode failover If active namenode fails Standby can take over quickly because it has latest state available in memory Both edit log and block mappings Can use NFS filer or a Quorum Journal Manager (QJM) QJM is recommended choice
#36: QJM is a dedicated HDFS implementation Solely designed for purpose of providing HA for edit log QJM runs a group of journal nodes Each edit must be written to a majority of these
#37: Transition managed by a failover controller Default implementation uses Zookeeper to ensure only 1 namenode active Each namenode runs a heartbeat process Can’t have active-active as we don’t have a cluster filesystem – can’t have multiple nodes writing to same file Previously active namenode can be fenced – can use STONITH
#38: HDFS has permission model for files and directories very POSIX like 3 types of perms: r, w, x X is ignored for a file (no concept of executing a file) but is needed for directory access Each file has owner, group and mode (mode is perms for ower,group, and others) Note by default Hadoop runs with security disabled
#39: placement of replicas is critical to HDFS reliability and performance purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. HDFS’s placement policy is to put one replica on one node in the local rack another on a different node in a different rack, and the last on a different node in the same rack as the preivous This policy cuts the inter-rack write traffic which generally improves write performance The chance of rack failure is far less than that of node failure – so doesn’t reduce data availability does reduce the aggregate network bandwidth used when writing data since a block is placed in only two unique racks rather than three As long as you have even chance of starting with node in each rack the data will be evenly distributed across all racks
#40: I/O from Database goes direct to Disks does not go via ASM
#41: To read a block client requests the list of replica locations from NameNode For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Client caches replica locations Datanode Locations sorted by proximity to client Data read from the dataNodes
#42: A client request to create a file does not reach the NameNode immediately HDFS client caches the file data into a temp local file Application writes are transparently redirected to this temp local file Once local file accumulates data worth over one HDFS block size, client contacts NameNode NameNode inserts the file name into the file system hierarchy and allocates a data block for it client flushes the block of data from the local temporary file to the first DataNode in small portions First Datanode sends the portions to the second datanode Second datanode sends to third Data is pipelined from one DataNode to the next. Data nodes tell namenode which blocks they have via block reports
#43: HDFS & ASM BOTH work best when blocks of a file are spread evenly across all disks This gives best I/O performance
#44: In ASM if new disks are added (or dropped) A rebalance can ensure the data is evenly spread across all the disks
#45: The balancer program is a Hadoop daemon that redistributes blocks Moves blocks from overutilized datanodes to underutilized ones Still adheres to block replica placement policies cluster is deemed to be balanced, which means that the utilization of every datanode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage”
#46: Only one balancer operation may run on cluster at one time Balancer designed to run in background Limits bandwidth used to move blocks around
#47: Explain Bit rot As organisations store more data the possibility of silent disk corruptions grows Can set the CONTENT.CHECK attribute on a diskgroup to ensure a rebalance will perform this logical content checking “each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media” “Because HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas to produce a new, uncorrupt replica”
#48: HDFS checksums all data written to it, and by default when reading data A separate checksum is created for every 512 bytes (by default) CRC is 4 bytes long less than 1% storage overhead When clients read data from datanodes checksum verified
#49: One thing about Hadoop in additon to all the whacky names for things Is that the pace of change is phenomenal in comparison to the old school RDBMS world I wanted to show a couple of snazzy things that are coming with the next HDFS release
#50: It’s pretty inefficient space wise having to store 3 copies of the same data Just to guarantee protection for your data Erasure Coding is a way of encoding data so that the original data can be recovered with just a subset of the original It sounds awfully similar to RAID 5/6 but with parity stored with the data not a separate device Should consume way less space than triple mirroring with similar failure rates However this will trade CPU cycles for space gains
#51: A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode This situation is not handled by the existing HDFS balancer, which concerns itself with inter- that is BETWEEN different data nodes, not intra-, DN skew – i.e between disks within a data node!!
#52: With hadoop 3 you can increase the availability of of your cluster be having an increased number of namenodes
#53: It might even be the case that the Big Data world evolves so fast that HDFS is being to be superseded A new kid on the storage block is KUDU Which takes the best of HDFS sequential performance along with low latency random access
#54: You may have heard enough by now As DBAs and Systems Folks HDFS is likely to feature in your organisations And we are likely to be the folks managing that infrastructure So best to be prepared!
#55: Put a link and a book recommendation Questions?
#56: “DEFLATE is a compression algorithm whose standard implementation is zlib.” Gzip normally used to produce deflate format files Concept of splittable is very important – splittable format allows you to seek to any point in the stream A non splittable file format will have to have all it’s blocks processed by the same process – rather than by distributed processes

HDFS & ASM

More Related Content

What's hot (19)

Similar to HDFS & ASM (20)

Recently uploaded (20)

HDFS & ASM

Editor's Notes