Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - Jacob Rapp, Cisco

Cisco and Big Data – Hadoop World 2011

Unified Fabric optimized for Big Data infrastructures with seamless integration with current data models Traditional Database RDBMS Storage SAN/NAS “ Big Data” Store And Analyze “ Big Data” Real-Time Capture, Read & Update NoSQL Application Virtualized, Bare-Metal, Cloud Sensor Data Logs Social Media Click Streams Mobility Trends Event Data Cisco Unified Fabric

128 Nodes of UCS C200 M2, 1RU Rock-Mount Servers 4x2TB Drives, Dual Xeon 5670 @ 2.93GHz, 96GB RAM The Setup Lab environment overview 16 Nodes of UCS C210 M2, 2RU Rack-Mount Servers 16xSFF Drives, Dual Xeon 5670 @2.93GHz, 96GB RAM

Unified Fabric L2 &/or L3 for SAN/NAS, RDBMS, UCS, and Big Data L2/L3 Top of Rack Infrastructure … .. … .. Note: Two topologies were tested to examine the benefits of providing an integrated solution that can support multiple technologies, such as Traditional RDBMS, SAN/NAS, Virtualization, etc. N7K N7K N7K N3k N3k N3k N3k N3k N3k N7K N7K N7K N5k N5k N5k N5k … N2k … N2k UCS

Cluster Size Number of Data Nodes Data Model MapReduce functions Input Data Size Total starting dataset Characteristics of Data Node I/O, CPU, Memory, etc Data Locality in HDFS Ability to processes data where it already is located. Background Activity Number of Jobs running, type of jobs, Importing, exporting Networking Characteristics Availability, Buffering, 10GE vs 1GE, Oversubscription, Latency

A general characteristic of an optimally configured cluster is the ability to decrease job completion times by scaling out the nodes. Test results from ETL-like Workload (Yahoo Terasort) using 1TB data set.

The complexity of the functions used in Map and/or Reduce has a large impact on the type of job and network traffic. Note: Yahoo Terasort has a more balanced Map vs. Reduce functions and the same size of intermediate and final data (1TB Input, Shuffle and Output) Shakespeare WordCount has most of the processing in the Map Functions, smaller intermediate and even smaller final Data. (1TB Input, 10MB Shuffle, 1MB Output)

Network Graph of all Traffic received on an single node (80 Node run) Reducers Start Maps Finish Job Complete Maps Start These symbols represent a node sending traffic to HPC064 Note: Shortly after the Reducers start Map tasks are finishing and data is being shuffled to reducers As Maps completely finish the network is no loner used as Reducers have all the data they need to finish the job The red line is the total amount of traffic received by hpc064

Network Graph of all Traffic received on an single node (80 Node run) Output Data Replication Enabled Replication of 3 enabled (1 copy stored locally, 2 stored remotely) Each reduce output is replicated now, instead of just stored locally Note: If output replication is enabled, then the end of the terasort, must store additional copies. For a 1TB sort, 2TB will need to be replicated across the network.

Network Graph of all Traffic received on an single node (80 node run) Reducers Start Maps Finish Job Complete Maps Start The red line is the total amount of traffic received by hpc064 These symbols represent a node sending traffic to HPC064 Note: Due the the combination of the length of the Map phase and the reduced data set being shuffled, the network is being utilized throughout the job, but by a limited amount.

Given the same MapReduce Job, the larger the input dataset, the longer the job will take. Note: It is important to note that as dataset sizes increase completion times may not scale linearly as many jobs can hit the ceiling of I/O and/or Compute power. Test results from ETL-like Workload (Yahoo Terasort) using varying data set sizes.

The I/O capacity, CPU and memory of the Data Node can have a direct impact on performance of a cluster. Note: A 2RU Server with 16 disks gives the node more storage, but trading off CPU per RU. On the other hand a 1RU server gives more CPU per rack.

Data Locality – The ability to process data where it is locally stored. Note: During the Map Phase, the JobTracker attempts to use data locality to schedule map tasks where the data is locally stored. This is not perfect and is dependent on a data nodes where the data is located. This is a consideration when choosing the replication factor. More replicas tend to create higher probability for data locality. Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the data available locally.

Hadoop clusters are generally multi-use. The effect of background use can effect any single jobs completion. Note: A given Cluster, is generally running may different types of Jobs, Importing into HDFS, Etc. Example View of 24 Hour Cluster Use Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs (Blue lines are ETL Jobs and purple lines are BI Jobs) Importing Data into HDFS

The relative impact of various network characteristics on Hadoop clusters. Impact of Network Characteristics on Job Completion times

The failure of a networking device can affect multiple data nodes of a Hadoop cluster with a range of effects. Note: The tasks on affected nodes need to be rescheduled, schedule maintenance activities such as data rebalancing, increasing load on the cluster. Important to evaluate the overall availability of the system. Hadoop was designed with failure in mind, given any one node failure does not represent a huge issue, but network failures can span many nodes in the system causing rebalancing and decreased overall resources. Redundancy paths and load sharing schemes. General redundancy mechanisms can also increase bandwidth. Ease of management and consistent Operating System Main sources of outages can include human error. Ease of management and consistency are general best practices.

Several HDFS operations and phases of MapReduce jobs are very bursty in nature Note: The extent of bursts largely depend on the type of job (ETL vs. BI). Bursty phases can include replication of data (either importing into HDFS or output replication) and the output of the mappers during the shuffle phase. A network that cannot handle bursts effectively will drop packets, so optimal buffering is needed in network devices to absorb bursts. Optimal Buffering Given large enough incast, TCP will collapse at some point no matter how large the buffer Well studied by multiple universities Alternate solutions (Changing TCP behavior) proposed rather than Huge buffer switches ( http://simula. stanford .edu/sedcl/files/dc tcp -final.pdf )

Buffer being used during shuffle phase Buffer being used during output replication The buffer utilization is highest during the shuffle and output replication phases. Optimized buffer sizes are required to avoid packet loss leading to slower job completion times. Note: The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Rack layer

Buffer being used during shuffle phase Buffer being used during output replication Note: The Fabric Extender Buffer utilization was roughly equivalent to that of the N3k, but has 32MB of buffer vs. 9MB of N3k The buffer utilization is highest during the shuffle and output replication phases. Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.

In a multi-use cluster described previously, multiple job types (ETL, BI, etc.) and Importing data into HDFS can be happening at the same time Note: Usage may vary depending on job scheduling options

In the largest workloads, multi-terabytes can be transmitted across the network Note: Data taken from multi-use workload (Multi-ETL + Multi-BI + HDFS Import).

Generally 1GE is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload. Note: Multiple 1GE links can be bonded together to not only increase bandwidth, but increase bandwidth.

Moving from 1GE to 10GE actually lowers the buffer requirement on the switching layer. Note: By moving to 10GE, the data node has a larger input to receive data lessening the need for buffers on the network as the total aggregate speed or amount of data does not increase substantially. This is due, in part, to limits of I/O and Compute capabilities

Generally network latency, while consistent latency being important, does not represent a significant factor for Hadoop Clusters. Note: There is a difference in network latency vs. application latency. Optimization in the application stack can decrease application latency that can potentially have a significant benefit.

For more information: www.cisco.com/go/bigdata

Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - Jacob Rapp, Cisco

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - Jacob Rapp, Cisco (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - Jacob Rapp, Cisco