SlideShare a Scribd company logo
Big Data Practice
A Planning guide to set up Big Things..
Knowing the Big Data Fundamentals
 Big Data describes a holistic information management strategy that includes
and integrates many new types of data and data management alongside
traditional data.
 Big Data has also been defined by the four “V”s: Volume, Velocity, Variety,
and Value
 The massive proliferation of data and the lower cost computing models that
have encouraged broader adoption.
 Big Data has popularized two foundational storage and processing
technologies: Apache Hadoop and the NoSQL database
The Big Questions on Big Data
 The good news is that everyone has questions about Big Data! Both business
and IT are taking risks and experimenting, and there is a healthy bias by all to
learn.
 So big data is an enterprise asset and needs to be managed from business
alignment to governance as an integrated element of your current
information management architecture.
 As we transform from a proof of concept to run at scale, you will run into the
same issues as other information management challenges, namely, skill set
requirements, governance, performance, scalability, management,
integration, security, and access.
THE BIG DATA QUESTIONS
Areas Questions Possible Answers
Business Context
Business Context How will we make use of the data? Sell new products and services » Personalize
customer experiences
Business Usage Which business processes can benefit? Operational ERP/CRM systems » BI and
Reporting systems » Predictive analytics,
modeling, data mining
Data Ownership Do we need to own (and archive) the data?
» Proprietary » Require historical data »
Ensure lineage » Governance
Architecture Vision
Data Storage What storage technologies are best for our
data reservoir?
» HDFS (Hadoop plus others) » File system »
Data Warehouse » RDBMS » NoSQL database
Data Processing What strategy is practical for my
application?
Leave it at the point of capture » Add minor
transformations » ETL data to analytical
platform » Export data to desktops
Performance How to maximize speed of ad hoc query,
data transformations, and analytical
modeling?
Analyze and transform data in real-time
Use parallel processing
» Increase hardware and memory
Continued ..
Areas Questions Possible Answers
Current State
Open Source Experience What experience do we have in open source
Apache projects? (Hadoop, NoSQL, etc)
Scattered experiments » Proof of concepts »
Production experience » Contributor
Analytics Skills To what extent do we employ Data
Scientists and Analysts familiar with
advanced and predictive analytics tools and
techniques?
» Yes
» No
Future State
Best Practices What are the best resources to guide
decisions to build my future state?
Reference architecture » Development
patterns » Operational processes »
Governance structures and polices »
Conferences and communities of interest »
Vendor best practices
Data Types How much transformation is required for
raw unstructured data in the data reservoir?
None »
Derive a fundamental understanding with
schema or
key-value pairs » Enrich data
Data Sources How frequently do sources or content
structure change?
» Frequently » Unpredictable » Never
Data Quality When to apply transformations? » In the network »
In the reservoir »
In the data warehouse
» By the user at point of use
» At run time
What’s Different about Big Data?
 Big Data introduces new technology, processes, and skills to your information architecture
and the people that design, operate, and use them.
 “V”s define attributes of Big Data.
 Big data approaches data structure and analytics differently than traditional information
architectures.
 “schema on write” :
 The traditional data warehouse approach undergo standardized ETL processes and eventually
map into predefined schemas.
 Lengthy Process to amend changes to pre-defined schema.
 “schema on Read” :
 Here data can be captured without requiring a ‘defined’ data structure. The structure will be
derived either from the data itself or through other algorithmic process.
 With new low-cost, inmemory parallel processing hardware/software architectures, such as
HDFS/Hadoop and Spark.
 “Eliminates high cost of moving data through ETL process and brings analytical capabilities in
place
Reference architecture sample
What are the main tools/ technologies
that you employ at Big Data?
 The current big data engagements in the industry are centered on Hadoop.
 For wide Variety ,Various types, Volumes of data the tools that in use are..
 Large unstructured data managed by technologies like Hadoop
 Large structured data managed by technologies such as Teradata and Netezza
 Real-time data managed by the likes of SAP HANA, NoSQL, MongoDB, Cassandra
 Finally the ultimate aim is to provide business and IT user to work with a
variety of technologies that best suits their data analytics requirements.
Big data Market place
Big Data Technologies Types &
Specialised Vendors
Appache Hadoop* Stack
Apache Hadoop* a community development effort
Key development sub project
Apache *Hadoop* Distributed File
System(HDFS)
The primary storage system that uses multiple replicas of data blocks, distributes them on
nodes throught a cluster, and provides high – through put access to application data.
Apache Hadoop Map reduce A programming model and software framework for applications that perform distributed
processing of large datasets on compute clusters.
Apache Hadoop Common Utilities that support the Hadoop framework ,including File system ,remote procedure
call(RPC) and serialization libraries.
Other Related Hadoop Projects
Apache Avro* A data serialization system
Apache Cassandra* A scalable multi master database with single point of failure
Apache Chukwa* A data collection system for monitoring large distributed systems built on HDFS and
mapreduce;includes toolkit for displaying, monitoring and analysing results.
Apache Hbase* A scalable distributed database that supports structured data storage for large tables used for
random, real-time read/write access to big data.
Apache Hive* A data warehouse infrastructure that provides data summuraization,adhoc-quering and
analysing large data sets in Hadoop compatible file systems.
Apache Mahout* A scalable machine learning and data mining library with implementations of wide range of
algorithms, including clustering,classification,collaborative filtering and frequent-paterrn
mining.
Apache Pig* A high level data flow language and execution frame work for expressing parallel data
analytics
Apache Zookeeper* A high performance ,centralized service that maintains configuration information and naming
and provides distributed synchronization and group service for distributed applications.
Hardware requirements
 Here are the recommended specifications for DataNode/TaskTrackers in a
balanced Hadoop cluster:
 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
 64-512GB of RAM
 Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the
higher the network throughput needed)
Start a Big Data Practice
 Identify,Integrate,Colloborate & Solve(IICS) -
1. Establish the Potential values & Objectives
2. Commitment from Senior mgmt.
3. Unveil dedicated COE for Big data
 Key aspects of Big Data COE function -
 1.Big Data team: setting up teams with right mix technical and domain experts.
 2.Big Data Lab: evaluating tools,frameworks & Solutioning
 3.Business Domains: Appliances of big data in various business verticals.
 4.POC’s: Prepare Use-Case, models and poilet projects as testimonials
 5.Knowledge Banks: Blogs, community learning,Artifacts,white papers.
 The lesson to learn is that you will go further faster if you leverage prior
investments and training.
Big Data Architecture Patterns with a
use cases
 Use Case #1: Retail Web Log Analysis In this first example, a leading retailer reported disappointing results
from its web channels during the Christmas season. It is looking to improve customers’ experience at the online
shopping site. Analysts at the retailer will investigate the website navigation pattern, especially abandoned shopping
carts. The architecture challenge is to quickly implement a solution using mostly existing tools, skills, and
infrastructure in order to minimize cost and to quickly deliver a solution to the business. The number of skilled
Hadoop programmers on staff is very few but they do have SQL expertise. Loading all of the data into the existing
Oracle data warehouse enabling the SQL programmers to access the data there is rejected because the data
movement would be extensive and the processing power and storage required would not make economic sense. The
2nd option was to load the data into Hadoop and directly access the data in HDFS using SQL. The conceptual
architecture as shown in Figure 19 provides direct access to the Hadoop Distributed File System by simply associating
it with an Oracle Database external table. Once connected, Oracle Big Data SQL enables traditional SQL tools to
explore the data set.
 The key benefits of this architecture include: » Low cost Hadoop storage » Ability to leverage existing investments
and skills in Oracle SQL and BI tools » No client side software installation » Leverage Oracle data warehouse security
» No data movement into the relational database » Fast ingestion and integration of structured and unstructured data
sets
References
 Step by step Tableau Big data Visualization Projects:
 http://blog.albatrosa.com/visualisation/step-by-step-guide-to-building-your-
tableau-visualisation-big-data-project/
 Google search
 http://www.oracle.com/technetwork/
 www.intel.in/content/dam/..
Finally …..
Thank you

More Related Content

PDF
Data lake benefits
PDF
Top 5 Considerations for a Big Data Solution
PDF
Incorporating the Data Lake into Your Analytic Architecture
PPTX
Big data analytics - hadoop
PDF
Building the Enterprise Data Lake: A look at architecture
PDF
The Emerging Data Lake IT Strategy
PDF
A beginners guide to Cloudera Hadoop
PDF
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data lake benefits
Top 5 Considerations for a Big Data Solution
Incorporating the Data Lake into Your Analytic Architecture
Big data analytics - hadoop
Building the Enterprise Data Lake: A look at architecture
The Emerging Data Lake IT Strategy
A beginners guide to Cloudera Hadoop
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

What's hot (20)

PPTX
Hadoop and Your Data Warehouse
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
PDF
Data Lake,beyond the Data Warehouse
PPTX
Big data and apache hadoop adoption
PPTX
Big Data Analytics with Hadoop
PDF
Creating a Next-Generation Big Data Architecture
PDF
Planing and optimizing data lake architecture
PPTX
Hadoop: Extending your Data Warehouse
PDF
Infrastructure Considerations for Analytical Workloads
PDF
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
PDF
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
PPTX
Why hadoop for data science?
PPTX
Big Data: Setting Up the Big Data Lake
PPTX
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
PPTX
Designing modern dw and data lake
PDF
Building a Data Lake - An App Dev's Perspective
PDF
Data architecture for modern enterprise
PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
PPTX
From Hadoop to Enterprise Data Warehouse
PDF
Hadoop Big Data Lakes Keynote
Hadoop and Your Data Warehouse
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Data Lake,beyond the Data Warehouse
Big data and apache hadoop adoption
Big Data Analytics with Hadoop
Creating a Next-Generation Big Data Architecture
Planing and optimizing data lake architecture
Hadoop: Extending your Data Warehouse
Infrastructure Considerations for Analytical Workloads
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Why hadoop for data science?
Big Data: Setting Up the Big Data Lake
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
Designing modern dw and data lake
Building a Data Lake - An App Dev's Perspective
Data architecture for modern enterprise
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
From Hadoop to Enterprise Data Warehouse
Hadoop Big Data Lakes Keynote
Ad

Similar to Big Data Practice_Planning_steps_RK (20)

PPTX
Architecting Your First Big Data Implementation
PPTX
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
PPTX
Big Data and Hadoop
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PPTX
lec1_Unit 1_rev.pptx_big data aanalytics
PDF
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
PDF
big data analytics introduction chapter 1
PDF
Moving Toward Big Data: Challenges, Trends and Perspectives
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Big data Presentation
PDF
Modern data warehouse
PDF
Modern data warehouse
ODP
re:Introduce Big Data and Hadoop Eco-system.
ODP
re:Introduce Big Data and Hadoop Eco-system.
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PPTX
Introduction to Big Data
PPSX
Big Data
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
Big data
Architecting Your First Big Data Implementation
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Big Data and Hadoop
Lecture 5 - Big Data and Hadoop Intro.ppt
lec1_Unit 1_rev.pptx_big data aanalytics
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
big data analytics introduction chapter 1
Moving Toward Big Data: Challenges, Trends and Perspectives
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Big data Presentation
Modern data warehouse
Modern data warehouse
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Introduction to Big Data
Big Data
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Big data
Ad

Big Data Practice_Planning_steps_RK

  • 1. Big Data Practice A Planning guide to set up Big Things..
  • 2. Knowing the Big Data Fundamentals  Big Data describes a holistic information management strategy that includes and integrates many new types of data and data management alongside traditional data.  Big Data has also been defined by the four “V”s: Volume, Velocity, Variety, and Value  The massive proliferation of data and the lower cost computing models that have encouraged broader adoption.  Big Data has popularized two foundational storage and processing technologies: Apache Hadoop and the NoSQL database
  • 3. The Big Questions on Big Data  The good news is that everyone has questions about Big Data! Both business and IT are taking risks and experimenting, and there is a healthy bias by all to learn.  So big data is an enterprise asset and needs to be managed from business alignment to governance as an integrated element of your current information management architecture.  As we transform from a proof of concept to run at scale, you will run into the same issues as other information management challenges, namely, skill set requirements, governance, performance, scalability, management, integration, security, and access.
  • 4. THE BIG DATA QUESTIONS Areas Questions Possible Answers Business Context Business Context How will we make use of the data? Sell new products and services » Personalize customer experiences Business Usage Which business processes can benefit? Operational ERP/CRM systems » BI and Reporting systems » Predictive analytics, modeling, data mining Data Ownership Do we need to own (and archive) the data? » Proprietary » Require historical data » Ensure lineage » Governance Architecture Vision Data Storage What storage technologies are best for our data reservoir? » HDFS (Hadoop plus others) » File system » Data Warehouse » RDBMS » NoSQL database Data Processing What strategy is practical for my application? Leave it at the point of capture » Add minor transformations » ETL data to analytical platform » Export data to desktops Performance How to maximize speed of ad hoc query, data transformations, and analytical modeling? Analyze and transform data in real-time Use parallel processing » Increase hardware and memory
  • 5. Continued .. Areas Questions Possible Answers Current State Open Source Experience What experience do we have in open source Apache projects? (Hadoop, NoSQL, etc) Scattered experiments » Proof of concepts » Production experience » Contributor Analytics Skills To what extent do we employ Data Scientists and Analysts familiar with advanced and predictive analytics tools and techniques? » Yes » No Future State Best Practices What are the best resources to guide decisions to build my future state? Reference architecture » Development patterns » Operational processes » Governance structures and polices » Conferences and communities of interest » Vendor best practices Data Types How much transformation is required for raw unstructured data in the data reservoir? None » Derive a fundamental understanding with schema or key-value pairs » Enrich data Data Sources How frequently do sources or content structure change? » Frequently » Unpredictable » Never Data Quality When to apply transformations? » In the network » In the reservoir » In the data warehouse » By the user at point of use » At run time
  • 6. What’s Different about Big Data?  Big Data introduces new technology, processes, and skills to your information architecture and the people that design, operate, and use them.  “V”s define attributes of Big Data.  Big data approaches data structure and analytics differently than traditional information architectures.  “schema on write” :  The traditional data warehouse approach undergo standardized ETL processes and eventually map into predefined schemas.  Lengthy Process to amend changes to pre-defined schema.  “schema on Read” :  Here data can be captured without requiring a ‘defined’ data structure. The structure will be derived either from the data itself or through other algorithmic process.  With new low-cost, inmemory parallel processing hardware/software architectures, such as HDFS/Hadoop and Spark.  “Eliminates high cost of moving data through ETL process and brings analytical capabilities in place
  • 8. What are the main tools/ technologies that you employ at Big Data?  The current big data engagements in the industry are centered on Hadoop.  For wide Variety ,Various types, Volumes of data the tools that in use are..  Large unstructured data managed by technologies like Hadoop  Large structured data managed by technologies such as Teradata and Netezza  Real-time data managed by the likes of SAP HANA, NoSQL, MongoDB, Cassandra  Finally the ultimate aim is to provide business and IT user to work with a variety of technologies that best suits their data analytics requirements.
  • 10. Big Data Technologies Types & Specialised Vendors
  • 12. Apache Hadoop* a community development effort Key development sub project Apache *Hadoop* Distributed File System(HDFS) The primary storage system that uses multiple replicas of data blocks, distributes them on nodes throught a cluster, and provides high – through put access to application data. Apache Hadoop Map reduce A programming model and software framework for applications that perform distributed processing of large datasets on compute clusters. Apache Hadoop Common Utilities that support the Hadoop framework ,including File system ,remote procedure call(RPC) and serialization libraries. Other Related Hadoop Projects Apache Avro* A data serialization system Apache Cassandra* A scalable multi master database with single point of failure Apache Chukwa* A data collection system for monitoring large distributed systems built on HDFS and mapreduce;includes toolkit for displaying, monitoring and analysing results. Apache Hbase* A scalable distributed database that supports structured data storage for large tables used for random, real-time read/write access to big data. Apache Hive* A data warehouse infrastructure that provides data summuraization,adhoc-quering and analysing large data sets in Hadoop compatible file systems. Apache Mahout* A scalable machine learning and data mining library with implementations of wide range of algorithms, including clustering,classification,collaborative filtering and frequent-paterrn mining. Apache Pig* A high level data flow language and execution frame work for expressing parallel data analytics Apache Zookeeper* A high performance ,centralized service that maintains configuration information and naming and provides distributed synchronization and group service for distributed applications.
  • 13. Hardware requirements  Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:  12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration  2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz  64-512GB of RAM  Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
  • 14. Start a Big Data Practice  Identify,Integrate,Colloborate & Solve(IICS) - 1. Establish the Potential values & Objectives 2. Commitment from Senior mgmt. 3. Unveil dedicated COE for Big data  Key aspects of Big Data COE function -  1.Big Data team: setting up teams with right mix technical and domain experts.  2.Big Data Lab: evaluating tools,frameworks & Solutioning  3.Business Domains: Appliances of big data in various business verticals.  4.POC’s: Prepare Use-Case, models and poilet projects as testimonials  5.Knowledge Banks: Blogs, community learning,Artifacts,white papers.  The lesson to learn is that you will go further faster if you leverage prior investments and training.
  • 15. Big Data Architecture Patterns with a use cases  Use Case #1: Retail Web Log Analysis In this first example, a leading retailer reported disappointing results from its web channels during the Christmas season. It is looking to improve customers’ experience at the online shopping site. Analysts at the retailer will investigate the website navigation pattern, especially abandoned shopping carts. The architecture challenge is to quickly implement a solution using mostly existing tools, skills, and infrastructure in order to minimize cost and to quickly deliver a solution to the business. The number of skilled Hadoop programmers on staff is very few but they do have SQL expertise. Loading all of the data into the existing Oracle data warehouse enabling the SQL programmers to access the data there is rejected because the data movement would be extensive and the processing power and storage required would not make economic sense. The 2nd option was to load the data into Hadoop and directly access the data in HDFS using SQL. The conceptual architecture as shown in Figure 19 provides direct access to the Hadoop Distributed File System by simply associating it with an Oracle Database external table. Once connected, Oracle Big Data SQL enables traditional SQL tools to explore the data set.  The key benefits of this architecture include: » Low cost Hadoop storage » Ability to leverage existing investments and skills in Oracle SQL and BI tools » No client side software installation » Leverage Oracle data warehouse security » No data movement into the relational database » Fast ingestion and integration of structured and unstructured data sets
  • 16. References  Step by step Tableau Big data Visualization Projects:  http://blog.albatrosa.com/visualisation/step-by-step-guide-to-building-your- tableau-visualisation-big-data-project/  Google search  http://www.oracle.com/technetwork/  www.intel.in/content/dam/..

Editor's Notes

  • #8: Courtesy: Reference Google- www.directbi.com
  • #15: Big Data Team While setting up a team for Big Data, the skills to look for : Data processing: Using 1.Open-source frameworks such as Apache Hadoop, HBase, Hive, PIG, Scala,Spark,Python 2.Big data Vendors- Cloudera,MangoDB,Cassandra, Data analytics: Team needs to have expertise in advanced analytical models, statistical methods, machine learning and data science Big Data Lab Setting up clusters Hardware/Boxes having sufficiently larger RAM than usual for Big Data processing. Big Data Proof-Of-Concepts (Case Studies) Develop proof-of-concepts (POC) projects,use case models to demonstrate our capabilities to showcase to your potential customers as testmonials