3. 为什么选择 Hadoop ? Need to process huge datasets on large clusters of computers Very expensive to build reliability into each application. Nodes fail every day f ailure is expected, rather than exceptional. The number of nodes in a cluster is not constant. Need common infrastructure Efficient, reliable, easy to use Open Source, Apache License
15. HDFS :数据本地化 Data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Results Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Hadoop Cluster Block 1 Block 1 Block 2 Block 2 Block 2 Block 1 MAP MAP MAP Reduce Block 3 Block 3 Block 3
#7:按照当前各公司公布的数据来看,百度日处理规模居全球主要互联网公司第 2 名,仅次于 Google 的每日 30PB 左右的输入数据处理量。
#15:– Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
#21:Block (Object) Storage Subsystem Shared storage provided as pools of blocks Namespaces (HDFS, others) use one or more block-pools Note: HDFS has 2 layers today – we are generalizing/extending it.