SlideShare a Scribd company logo
Today we measure available data in zettabytes

IN 2011, THE AMOUNT
OF DATA SURPASSED

1.8

90% OF THE DATA IN THE WORLD TODAY

has been created in the last two years
alone

ZETTABYTES
COMBINED GDP OF:

1.8

ZETTABYTES

=

57.5
BILLION
32 GB iPads

**IDC Digital Universe Study Extracting Value from Chaos
© 2013 SAP AG. All rights reserved.

=

$34.4

•
•
•
•

=

TRILLION

US
• France
Japan
• UK
China
• Italy
Germany

1

Confidential

1
Where is this data?
Types and Volumes of Data …

Traditional content types,
Including unstructured data,

…have grown dramatically

are growing by up to 80% per year

CRM Systems
M2M data
Transactions

Sales
Order

Mobile

ERP Systems

Instant Messages

Transactions

Planning

Email
Things

Sales Order

Things
Demand

Legacy EDW
© 2013 SAP AG. All rights reserved. 2013 SAP AG. All rights reserved.
©

Planning

Legacy ERP
Structured data grew by

Inventory

more than 40% per year

Mobile

Customer
2

2
What can’t we see?
WHAT CRITICAL “NEW SIGNALS”

MIGHT WE BE MISSING?
Is it in our ERP Systems?
Our M2M data?
Social?

© 2013 SAP AG. All rights reserved.

Confidential

3
Big Data - Definition
“Big Data” refers to the problems of capturing,
storing, managing, and analyzing massive
amounts of various types of data
Big Data Challenge: turn raw data into insights that drive
business value and manage in a cost effective manner;

Most commonly this refers to terabytes or petabytes of data, stored
in multiple formats, from different internal and external sources, with
strict demands for speed and complexity of analysis

© 2013 SAP AG. All rights reserved.

4
The SAP you need to know
System of Engagement
“Newer SAP”

SAP Cloud
Maintenance & Operations
24/7, SLA’s, DR & HA, Elasticity

mobile

System of Record

Business
Suite
(ERP)

Business
Analytics

“Foundational SAP”

Data Logistics/Quality ETL
In Memory Database Platform
In Memory / Columnar/ MPP/ Federation

© 2013 SAP AG. All rights reserved.

Confidential

5
In Memory Database Platform

Digging Deeper

In Memory / Columnar/ MPP/ Federation

SAP Business Suite

Text

Core
PLM

OLAP

SRM

OLTP

SCM

ERP

Apps

CRM

Custom

Predictive

BI

HANA

SAP
BW

HTTP

Native

Apps

Geospatial

Models

Engines
Logical

memory

HOT

disk

WARM

cached

Bulk/Streaming/Real-time

User Interface
& Applications

COLD

Physical Table(s)
Virtual Tables
Ingest Engines

Federation

Data Logistics

(Data Services , SLT, CEP)
COLD

100101
011010
100101
© 2013 SAP AG. All rights reserved.

Other
DB

Other
ERP

Other Data …
Confidential

6
Open Hadoop Strategy

© 2013 SAP AG. All rights reserved.

Confidential

7
Accelerated BI with SAP BusinessObjects and SAP HANA
One unified and complete BI Suite addressing the full spectrum of BI on SAP HANA

Discovery and Analysis

Dashboards and Apps

Reporting

Discover. Predict. Create.

Build Engaging Experiences

Share Information

 Discover areas to optimize your
business

 Deliver engaging information to
users where they need it

 Securely distribute information
across your organization

 Adapt data to business needs

 Track key performance indicators
and summary data

 Give users the ability to ask and
answer their own questions

 Tell your story with beautiful
visualizations

 Build custom experiences so users
get what they need quickly

 Build printable reports for
operational efficiency

© 2013 SAP AG. All rights reserved.

Confidential

8
Data Logistics
SAP Business
Suite

Trigger
Based, Real
Time

SAP LT
Replication
Server

SAP
BusinessObjects
tools

DB
Connection

SQL

ETL, Batch

SAP BW

Other query
tools

BICS

SQL

MDX

HANA Studio
ODBC
SAP BOBJ
Data Services

Log Based
Non SAP
Data Sources

SAP In-Memory Database
ECDA/ODBC

Sybase Replication
Server

In Memory
Models
Column Store

Event Streams
M2M

SAP Event Stream
Processor *

ODBC

SAP HANA

Data Sources
© 2013 SAP AG. All rights reserved.

* SAP HANA Roadmap
** SAP ERP & BW Extractors
Confidential

9
SAP Big Data Apps

•

Customer Engagement
Intelligence

•

Predictive Analytics RDS

See overview https://community.wdf.sap.corp/docs/DOC-222087
© 2013 SAP AG. All rights reserved.

Confidential

10
Delivering On Your Business Imperatives
Data Science Services
Forecasting Sales and Demand
 Forecast demand and managing
inventory levels in perishable CPG
 Model variant cannibalization and
impact on manufacturer forecasts
 Utility load demand forecasting

Check and Compliance
 Deliver faster response time and
higher throughput of compliance
checks to enable competitive
advantage
 Tackling public fraud waste and abuse
by analyzing records for tax discovery

Optimization

Performance and Insights

 Optimize transport and logistics recover from unforeseen disruptions

 Maximizing guest / customer
experience

 Optimize depth and timing of retail
markdowns to boost sales

 Assess the impact of promotions, and
improve profitability

 Grow deposits not excessive interest
costs

 Directional insight on growing
revenues and basket sizes

Contact “DL BigDataSalesSupport” for more information about SAP Data Science Services
© 2013 SAP AG. All rights reserved.

Confidential

11
HANA + Hadoop
What is Hadoop
 Open source project inspired by Google/Yahoo
 Used at Yahoo, Facebook, eBay, LinkedIn, startups, Fortune 500
enterprises to store and process Petabytes of data on thousands of servers
 Hadoop components
– Cluster of commodity servers
– Distributed storage layer (Hadoop Distributed File System, or HDFS)
– Distributed processing infrastructure (MapReduce programming model)
Cluster of Commodity Servers

Hadoop




NameNode

10s to 1000s DataNode(s)

© 2013 SAP AG. All rights reserved.

Hadoop Software Architecture

Hadoop
Computation Engines
Hive
HBase
Mahout
Pig

Sqoop

…

Map-Reduce

Data storage (Hadoop
Distributed File system)
Confidential

13
Apache Hadoop
Software framework for distributed data processing

 Hadoop Distributed File System
(HDFS) – reliable data storage on
commodity hardware

HDFS
Name Node

(stores metadata)
Data Node

Data Node

 HIVE -- data warehousing solution on
top of Hadoop with direct access to
HDFS and Hbase

(stores actual
data in blocks)

replication

(stores actual
data in blocks)

client

 MapReduce – programing model for
parallel data processing and query
execution

© 2013 SAP AG. All rights reserved.

HDFS

Input

MapReduce

process

HDFS

output

Confidential

14
Why Hadoop?
Pros
 Free software

 Cheap hardware - commodity servers
 Scalable to thousands of nodes and petabytes of data
 Highly fault-tolerant storage and processing
 Flexible – write Java MapReduce programs to do any kind of processing; any
data- no fixed schema needed
 Open source libraries & tools
Cons
 Specialized skillset to administer and develop – Hadoop is not free!
 Require more development (programming MapReduce & other NoSQL tools)
than relational technologies (SQL, stored procedure)
 HIVE/PIG/Impala not as performant nor as mature as relational tech
 Batch-oriented jobs, not real-time
 Less mature in enterprise readiness – security, ETL, management, monitoring,
etc
© 2013 SAP AG. All rights reserved.

Confidential

15
SAP HANA + Hadoop Provides Real-Time on BIG DATA
Combine INSTANT Results with INFINITE Storage

HADOOP

8

SAP HANA

1.0sec

Infinite storage

Instant Results
•

Modern in-memory
platform

•

Distributed disk
platform

•

Transact/analyze in
real-time

•

Store infinite amounts
of unstructured data

•

Native predictive, text,
and spatial algorithms

•

No-SQL access

© 2013 SAP AG. All rights reserved.

Confidential

16

More Related Content

DOC
PrashantDixit_SrAnalyst_4.5yrs_Exp
PDF
Talent Opportunities - September 2021
PDF
Ridhi handa resume_1
DOCX
Raju Goke - Experience
PDF
~a.myCV (2-13) [sql]
PDF
Lithesh Anargha Resume Final 1.0
ODT
David Kuang(word)
DOC
Samarendra - CV
PrashantDixit_SrAnalyst_4.5yrs_Exp
Talent Opportunities - September 2021
Ridhi handa resume_1
Raju Goke - Experience
~a.myCV (2-13) [sql]
Lithesh Anargha Resume Final 1.0
David Kuang(word)
Samarendra - CV

What's hot (20)

DOC
Hadoop Big Data Resume
DOC
Amit Porwal_resume-Latest
DOC
Wael Abdeen Resume
PDF
Hopper services
PDF
PDF
SAP PI online training course content
DOCX
Leonard CV 2016 june
DOC
Soundarya Reddy Resume
PDF
SAP Staffing Practice
PDF
Yumasoft An Outsourcing Software Development Services
PDF
TAG17 - O'Zapft is - Daten zapfen leicht gemacht?
PDF
Oracle Staffing Practice
PDF
Supriya Pandeti Resume
DOC
Rick TongHyun An_ PM
DOC
Rajendra kori it_project lead_9_cv
PDF
Sharique Khan Resume
PPTX
Ojas it services
DOC
Resume_Feb_2016
DOC
Resume-NuwanAmarasighe - NP
DOCX
Alex Bagdonas Resume 20160313
Hadoop Big Data Resume
Amit Porwal_resume-Latest
Wael Abdeen Resume
Hopper services
SAP PI online training course content
Leonard CV 2016 june
Soundarya Reddy Resume
SAP Staffing Practice
Yumasoft An Outsourcing Software Development Services
TAG17 - O'Zapft is - Daten zapfen leicht gemacht?
Oracle Staffing Practice
Supriya Pandeti Resume
Rick TongHyun An_ PM
Rajendra kori it_project lead_9_cv
Sharique Khan Resume
Ojas it services
Resume_Feb_2016
Resume-NuwanAmarasighe - NP
Alex Bagdonas Resume 20160313
Ad

Similar to Big data tim (20)

PDF
GITEX Big Data Conference 2014 – SAP Presentation
PDF
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
PDF
Business intelligence in the era of big data
PPTX
In-Memory Database Platform for Big Data
PDF
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
PPTX
Harnessing Big Data in Real-Time
PPTX
The SAP Startup Focus Program – Tackling Big Data With the Power of Small by ...
PDF
PPTX
Expand a Data warehouse with Hadoop and Big Data
PDF
Solving Big Data Problems using Hortonworks
PDF
IoT Crash Course Hadoop Summit SJ
PDF
Big Data Tools: A Deep Dive into Essential Tools
PDF
IMCSummit 2015 - Day 1 IT Business Track - In-memory computing with SAP HANA:...
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
PDF
TDWI Roundtable: The HANA EDW
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
What's New with SAP BusinessObjects Business Intelligence 4.1?
PPTX
Big Data Performance and Capacity Management
PDF
SAP IQ 16 Product Annoucement
GITEX Big Data Conference 2014 – SAP Presentation
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Business intelligence in the era of big data
In-Memory Database Platform for Big Data
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
Harnessing Big Data in Real-Time
The SAP Startup Focus Program – Tackling Big Data With the Power of Small by ...
Expand a Data warehouse with Hadoop and Big Data
Solving Big Data Problems using Hortonworks
IoT Crash Course Hadoop Summit SJ
Big Data Tools: A Deep Dive into Essential Tools
IMCSummit 2015 - Day 1 IT Business Track - In-memory computing with SAP HANA:...
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
TDWI Roundtable: The HANA EDW
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hadoop Reporting and Analysis - Jaspersoft
What's New with SAP BusinessObjects Business Intelligence 4.1?
Big Data Performance and Capacity Management
SAP IQ 16 Product Annoucement
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction

Big data tim

  • 1. Today we measure available data in zettabytes IN 2011, THE AMOUNT OF DATA SURPASSED 1.8 90% OF THE DATA IN THE WORLD TODAY has been created in the last two years alone ZETTABYTES COMBINED GDP OF: 1.8 ZETTABYTES = 57.5 BILLION 32 GB iPads **IDC Digital Universe Study Extracting Value from Chaos © 2013 SAP AG. All rights reserved. = $34.4 • • • • = TRILLION US • France Japan • UK China • Italy Germany 1 Confidential 1
  • 2. Where is this data? Types and Volumes of Data … Traditional content types, Including unstructured data, …have grown dramatically are growing by up to 80% per year CRM Systems M2M data Transactions Sales Order Mobile ERP Systems Instant Messages Transactions Planning Email Things Sales Order Things Demand Legacy EDW © 2013 SAP AG. All rights reserved. 2013 SAP AG. All rights reserved. © Planning Legacy ERP Structured data grew by Inventory more than 40% per year Mobile Customer 2 2
  • 3. What can’t we see? WHAT CRITICAL “NEW SIGNALS” MIGHT WE BE MISSING? Is it in our ERP Systems? Our M2M data? Social? © 2013 SAP AG. All rights reserved. Confidential 3
  • 4. Big Data - Definition “Big Data” refers to the problems of capturing, storing, managing, and analyzing massive amounts of various types of data Big Data Challenge: turn raw data into insights that drive business value and manage in a cost effective manner; Most commonly this refers to terabytes or petabytes of data, stored in multiple formats, from different internal and external sources, with strict demands for speed and complexity of analysis © 2013 SAP AG. All rights reserved. 4
  • 5. The SAP you need to know System of Engagement “Newer SAP” SAP Cloud Maintenance & Operations 24/7, SLA’s, DR & HA, Elasticity mobile System of Record Business Suite (ERP) Business Analytics “Foundational SAP” Data Logistics/Quality ETL In Memory Database Platform In Memory / Columnar/ MPP/ Federation © 2013 SAP AG. All rights reserved. Confidential 5
  • 6. In Memory Database Platform Digging Deeper In Memory / Columnar/ MPP/ Federation SAP Business Suite Text Core PLM OLAP SRM OLTP SCM ERP Apps CRM Custom Predictive BI HANA SAP BW HTTP Native Apps Geospatial Models Engines Logical memory HOT disk WARM cached Bulk/Streaming/Real-time User Interface & Applications COLD Physical Table(s) Virtual Tables Ingest Engines Federation Data Logistics (Data Services , SLT, CEP) COLD 100101 011010 100101 © 2013 SAP AG. All rights reserved. Other DB Other ERP Other Data … Confidential 6
  • 7. Open Hadoop Strategy © 2013 SAP AG. All rights reserved. Confidential 7
  • 8. Accelerated BI with SAP BusinessObjects and SAP HANA One unified and complete BI Suite addressing the full spectrum of BI on SAP HANA Discovery and Analysis Dashboards and Apps Reporting Discover. Predict. Create. Build Engaging Experiences Share Information  Discover areas to optimize your business  Deliver engaging information to users where they need it  Securely distribute information across your organization  Adapt data to business needs  Track key performance indicators and summary data  Give users the ability to ask and answer their own questions  Tell your story with beautiful visualizations  Build custom experiences so users get what they need quickly  Build printable reports for operational efficiency © 2013 SAP AG. All rights reserved. Confidential 8
  • 9. Data Logistics SAP Business Suite Trigger Based, Real Time SAP LT Replication Server SAP BusinessObjects tools DB Connection SQL ETL, Batch SAP BW Other query tools BICS SQL MDX HANA Studio ODBC SAP BOBJ Data Services Log Based Non SAP Data Sources SAP In-Memory Database ECDA/ODBC Sybase Replication Server In Memory Models Column Store Event Streams M2M SAP Event Stream Processor * ODBC SAP HANA Data Sources © 2013 SAP AG. All rights reserved. * SAP HANA Roadmap ** SAP ERP & BW Extractors Confidential 9
  • 10. SAP Big Data Apps • Customer Engagement Intelligence • Predictive Analytics RDS See overview https://community.wdf.sap.corp/docs/DOC-222087 © 2013 SAP AG. All rights reserved. Confidential 10
  • 11. Delivering On Your Business Imperatives Data Science Services Forecasting Sales and Demand  Forecast demand and managing inventory levels in perishable CPG  Model variant cannibalization and impact on manufacturer forecasts  Utility load demand forecasting Check and Compliance  Deliver faster response time and higher throughput of compliance checks to enable competitive advantage  Tackling public fraud waste and abuse by analyzing records for tax discovery Optimization Performance and Insights  Optimize transport and logistics recover from unforeseen disruptions  Maximizing guest / customer experience  Optimize depth and timing of retail markdowns to boost sales  Assess the impact of promotions, and improve profitability  Grow deposits not excessive interest costs  Directional insight on growing revenues and basket sizes Contact “DL BigDataSalesSupport” for more information about SAP Data Science Services © 2013 SAP AG. All rights reserved. Confidential 11
  • 13. What is Hadoop  Open source project inspired by Google/Yahoo  Used at Yahoo, Facebook, eBay, LinkedIn, startups, Fortune 500 enterprises to store and process Petabytes of data on thousands of servers  Hadoop components – Cluster of commodity servers – Distributed storage layer (Hadoop Distributed File System, or HDFS) – Distributed processing infrastructure (MapReduce programming model) Cluster of Commodity Servers Hadoop    NameNode 10s to 1000s DataNode(s) © 2013 SAP AG. All rights reserved. Hadoop Software Architecture Hadoop Computation Engines Hive HBase Mahout Pig Sqoop … Map-Reduce Data storage (Hadoop Distributed File system) Confidential 13
  • 14. Apache Hadoop Software framework for distributed data processing  Hadoop Distributed File System (HDFS) – reliable data storage on commodity hardware HDFS Name Node (stores metadata) Data Node Data Node  HIVE -- data warehousing solution on top of Hadoop with direct access to HDFS and Hbase (stores actual data in blocks) replication (stores actual data in blocks) client  MapReduce – programing model for parallel data processing and query execution © 2013 SAP AG. All rights reserved. HDFS Input MapReduce process HDFS output Confidential 14
  • 15. Why Hadoop? Pros  Free software  Cheap hardware - commodity servers  Scalable to thousands of nodes and petabytes of data  Highly fault-tolerant storage and processing  Flexible – write Java MapReduce programs to do any kind of processing; any data- no fixed schema needed  Open source libraries & tools Cons  Specialized skillset to administer and develop – Hadoop is not free!  Require more development (programming MapReduce & other NoSQL tools) than relational technologies (SQL, stored procedure)  HIVE/PIG/Impala not as performant nor as mature as relational tech  Batch-oriented jobs, not real-time  Less mature in enterprise readiness – security, ETL, management, monitoring, etc © 2013 SAP AG. All rights reserved. Confidential 15
  • 16. SAP HANA + Hadoop Provides Real-Time on BIG DATA Combine INSTANT Results with INFINITE Storage HADOOP 8 SAP HANA 1.0sec Infinite storage Instant Results • Modern in-memory platform • Distributed disk platform • Transact/analyze in real-time • Store infinite amounts of unstructured data • Native predictive, text, and spatial algorithms • No-SQL access © 2013 SAP AG. All rights reserved. Confidential 16

Editor's Notes

  • #5: Big Data has existed for a while and is far more than just massive volumes of data from novel sources. As customers note it is about how you manage, analyze and use and combine the data. SAP makes Big Data Real.what big data is not –e.g. SOH, BWoH.What would make them big data –e.g. sales forecasting on soh/bwoh; mashing up corporate data with social media data or with call center notes.
  • #6: What I say: HANA is tremendously important to SAP’s vision. We are no longer the “ERP company”, as you may think. Following the acquisitions of Business Objects, Sybase and recently SuccessFactors, SAP now leads several important enterprise business categories. Furthermore, by investing in in-house innovation, we have now assembled a vertically integrated business data management stack, all the way from data management appliances to applications to on-demand application services, providing increased customer value. And HANA is at the heart of this strategy!
  • #8: SAP big data platform: Open StrategyWe are planning on SAP formally reselling a HANA+ Hadoop bundle on SAP’s price list. vendors HortonWorks, Intel, Mappr, cloudera, Hadoop distribution are in every account and we work with them allStrategic announcement coming
  • #11: SAP offers the applications and analytic tools that help you to infuse Big Data insights directly into your business processes, and equipping your employees, partners and customers with access to data in order to uncover and monetize insights
  • #13: Big Data is a big opportunity for most companies and is something that they must embrace to competitive. It has existed for a while and is far more than just massive volumes of data from novel sources. As customers note it is about how you manage, analyze and use and combine the data. SAP makes Big Data Real.
  • #15: Hadoop runs on the Hadoop Distributed File System (HDFS), a distributed file system that scales out on commodity servers. Since Hadoop is file-based, developers don’t need to create a data model to store or process data, which makes Hadoop ideal for managing semi-structured Web data, which comes in many shapes and sizes. Because it is “schema-less,” Hadoop can be used to store and process any kind of data, including structured transactional data and unstructured audio and video data. However, the biggest advantage of Hadoop is that is open source, which means that the up-front costs of implementing a system to process large volumes of data are lower than for commercial systems. However, Hadoop does require companies to purchase and manage dozens, if not hundreds, of servers and train developers and administrators to use this new technology.Apache Hadoop enables applications to work with thousands of independent computers (nodes) which are collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). The main components used in Hadoop to run a job include: Client: submits the MapReduce job.Jobtracker: coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.Task trackers: Run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.Hadoop distributed file system (HDSF)HDFS is file system that sits on top of native file systemDifferent blocks of a file stored in different nodesName node keeps tracks of which blocks are make up a file and where those blocks are locatedAuto rebalances, auto replicationsUniform namespaceMapReduceHadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).Map: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-leve tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.Reduce: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.MapReduce allows for distributed processing of the map and reduction operations. Since each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
  • #17: SAP HANAIn-memory platformStore billions of recordsAnalyze in real-timeBuilt-in predictive, text, and spatial algorithmsHADOOPDistributed disk platformStore infinite amounts of unstructured data Search in batchNon-relational data storeSpecialized skills to implement and codeMany add-on libraries and packages