SlideShare a Scribd company logo
INTRODUCTION TO BIG DATA
(UNIT 1)
Dr. P. Rambabu, M. Tech., Ph.D., F.I.E.
15-July-2024
Big Data Analytics and Applications
UNIT-I
Introduction to Big Data: Defining Big Data, Big Data Types, Analytics, examples, Technologies, The
evolution of Big Data Architecture.
Basics of Hadoop: Hadoop Architecture, Main Components of Hadoop Framework, Analysis Big data
using Hadoop, Hadoop clustering.
UNIT-II:
MapReduce: Analyzing the data with Unix Tool & Hadoop, Hadoop streaming, Hadoop Pipes.
Hadoop Distributed File System: Design of HDFS, Concepts, Basic File system Operations,
Interfaces, Data Flow.
Hadoop I/O: Data Integrity, Compression, Serialization, File-Based Data Structures.
UNIT-III:
Developing A MapReduce Application: UNIT Tests with MRUNIT, Running Locally on Test Data.
How MapReduce Works: Anatomy of MapReduce Job Run, Classic MapReduce, Yarn, Failures
in Classic MapReduce and Yarn, Job Scheduling, Shuffle and Sort, Task Execution.
MapReduce Types and Formats: MapReduce types, Input Formats, Output Formats.
Introduction to Big Data
Unit 4:
NoSQL Data Management: Types of NoSQL, Query Model for Big Data, Benefits of NoSQL, MongoDB.
Hbase: Data Model and Implementations, Hbase Clients, Hbase Examples, Praxis.
Hive: Comparison with Traditional Databases, HiveQL, Tables, Querying Data, User Defined Functions.
Sqoop: Sqoop Connectors, Text and Binary File Formats, Imports, Working with Imported Data.
FLUME: Apache Flume, Data Sources for FLUME, Components of FLUME Architecture.
Unit 5:
Pig: Grunt, Comparison with Databases, Pig Latin, User Defined Functions, Data Processing Operators.
Spark: Installing steps, Distributed Datasets, Shared Variables, Anatomy of spark Job Run.
Scala: Environment Setup, Basic syntax, Data Types, Functions, Pattern Matching.
Unit 1 - Introduction to Big Data and hadoop.pptx
Big Data Analytics and Applications
Defining Big Data
Big Data refers to extremely large datasets that are difficult to manage, process, and analyze using
traditional data processing tools. The primary characteristics of Big Data are often described by the "3 Vs":
1. Volume: The amount of data generated is vast and continuously growing.
2. Velocity: The speed at which new data is generated and needs to be processed.
3. Variety: The different types of data (structured, semi-structured, and unstructured).
Additional characteristics sometimes included are:
4. Veracity: The quality and accuracy of the data.
5. Value: The potential insights and benefits derived from analyzing the data.
Big Data Types
Big Data can be categorized into three main types:
1. Structured Data: Organized in a fixed schema, usually in tabular form.
Examples include databases, spreadsheets.
2. Semi-structured Data: Does not conform to a rigid structure but contains
tags or markers to separate data elements. Examples include JSON, XML files.
3. Unstructured Data: No predefined format or structure. Examples include
text documents, images, videos, and social media posts.
Big Data Technologies
To manage and analyze Big Data, several technologies and tools are used, including:
1. Hadoop: An open-source framework that allows for the distributed processing of large
datasets across clusters of computers.
2. HDFS (Hadoop Distributed File System): A scalable, fault-tolerant storage system.
3. MapReduce: A programming model for processing large datasets with a distributed
algorithm.
4. Spark: An open-source unified analytics engine for large-scale data processing, known for
its speed and ease of use.
5. NoSQL Databases: Designed to handle large volumes of varied data. Examples include
MongoDB, Cassandra, HBase.
Big Data Technologies
6. Kafka: A distributed streaming platform used for building real-time data pipelines and
streaming applications.
7. Hive: A data warehousing tool built on top of Hadoop for querying and analyzing large
datasets with SQL-like queries.
8. Pig: A high-level platform for creating MapReduce programs used with Hadoop
Examples of Big Data
Big Data is used in various industries and applications:
Healthcare: Analyzing patient data to improve treatment outcomes, predict epidemics, and
reduce costs.
Finance: Detecting fraud, managing risk, and personalizing customer services.
Retail: Optimizing supply chain management, enhancing customer experience, and
improving inventory management.
Telecommunications: Managing network traffic, improving customer service, and preventing
churn.
Social Media: Analyzing user behavior, sentiment analysis, and targeted advertising.
The Evolution of Big Data Architecture
The architecture of Big Data systems has evolved to handle the growing complexity and
demands of data processing. Key stages include:
Batch Processing: Initial systems focused on batch processing large volumes of data using
tools like Hadoop and MapReduce. Data is processed in large chunks at scheduled intervals.
Real-time Processing: The need for real-time data analysis led to the development of
technologies like Apache Storm and Apache Spark Streaming. These systems process data in
real-time or near real-time.
The Evolution of Big Data Architecture
Lambda Architecture: A hybrid approach combining batch and real-time processing to
provide comprehensive data analysis. The Lambda architecture consists of:
 Batch Layer: Stores all historical data and periodically processes it using batch processing.
 Speed Layer: Processes real-time data streams to provide immediate results.
 Serving Layer: Merges results from the batch and speed layers to deliver a unified view.
Kappa Architecture: Simplifies the Lambda architecture by using a single processing pipeline
for both batch and real-time data, typically leveraging stream processing systems.
Evolution of Big Data Architecture
Evolution of Big Data and its ecosystem:
The evolution of Big Data and its ecosystem has undergone significant transformations
over the years. Here's a brief overview:
Early 2000s:- Big Data emerges as a term to describe large, complex datasets.- Hadoop
(2005) and MapReduce (2004) are developed to process large data sets.
2005-2010:- Hadoop becomes the foundation for Big Data processing.- NoSQL databases
like Cassandra (2008), MongoDB (2009), and Couchbase (2010) emerge.- Data
warehousing and business intelligence tools adapt to Big Data.
2010-2015:- Hadoop ecosystem expands with tools like Pig (2010), Hive (2010), and
HBase (2010).- Spark (2010) and Flink (2011) emerge as in-memory processing engines.-
Data science and machine learning gain prominence.
2015-2020:- Cloud-based Big Data services like AWS EMR (2012), Google Cloud Dataproc
(2015), and Azure HDInsight (2013) become popular.- Containers and orchestration tools
like Docker (2013) and Kubernetes (2014) simplify deployment.- Streaming data
processing with Kafka (2011), Storm (2010), and Flink gains traction.
2020-present:- AI and machine learning continue to drive Big Data innovation.- Cloud-
native architectures and serverless computing gain popularity.- Data governance, security,
and ethics become increasingly important.- Emerging trends include edge computing, IoT,
and Explainable AI (XAI).
The Big Data ecosystem has expanded to include:
1. Data ingestion tools e.g., Flume (2011), NiFi (2014)
2. Data processing frameworks e.g., Hadoop (2005), Spark (2010), Flink (2014)
3. NoSQL databases e.g., Hbase (2008), Cassandra (2008), MongoDB (2009),
Couchbase (2011)
4. Data warehousing and BI tools e.g., Hive (2008), Impala (2012), Tableau (2003),
Presto (2013), SparkSQL (2014), Power BI (2015)
5. Streaming data processing e.g., Flink (2010), Kafka (2011), Storm (2011), Spark
Streaming (2013)
6. Machine learning and AI frameworks e.g., Scikit-learn (2010), TensorFlow (2015),
PyTorch (2016)
7. Cloud-based Big Data services e.g., AWS EMR (2009), Azure HDInsight (2013),
Google Cloud Dataproc (2016)
8. Containers and orchestration tools e.g., Docker (2013), Kubernetes (2014)
Evolution of Big Data and its ecosystem
Dr. Rambabu Palaka
Professor
School of Engineering
Malla Reddy University, Hyderabad
Mobile: +91-9652665840
Email: drrambabu@mallareddyuniversity.ac.in

More Related Content

PDF
big data analytics introduction chapter 1
PDF
Big Data
PPTX
Big Data przt.pptx
PPT
Big data analytics, survey r.nabati
PDF
INF2190_W1_2016_public
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
DOCX
Big data and Hadoop overview
PPTX
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
big data analytics introduction chapter 1
Big Data
Big Data przt.pptx
Big data analytics, survey r.nabati
INF2190_W1_2016_public
How Big Data ,Cloud Computing ,Data Science can help business
Big data and Hadoop overview
Unit-1 -2-3- BDA PIET 6 AIDS.pptx

Similar to Unit 1 - Introduction to Big Data and hadoop.pptx (20)

PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PDF
Hadoop Overview
PDF
Big Data Analytics Lecture notes pdf notes
PDF
Moving Toward Big Data: Challenges, Trends and Perspectives
PDF
Big Data Testing Using Hadoop Platform
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PDF
Cloud and Bid data Dr.VK.pdf
PDF
Lecture1 introduction to big data
PPTX
Overview of Big Data by Sunny
PPTX
Big Data/Hadoop Option Analysis
PPTX
Data analytics,...........................
PDF
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
PDF
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
PPTX
Sycamore Quantum Computer 2019 developed.pptx
PPTX
DataJan27.pptxDataFoundationsPresentation
PDF
Social media with big data analytics
PPTX
Big Data Session 1.pptx
PDF
Présentation on radoop
PPTX
Cloud Computing & Big Data
Chapter1-Introduction Εισαγωγικές έννοιες
Hadoop Overview
Big Data Analytics Lecture notes pdf notes
Moving Toward Big Data: Challenges, Trends and Perspectives
Big Data Testing Using Hadoop Platform
Lecture 5 - Big Data and Hadoop Intro.ppt
Cloud and Bid data Dr.VK.pdf
Lecture1 introduction to big data
Overview of Big Data by Sunny
Big Data/Hadoop Option Analysis
Data analytics,...........................
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Sycamore Quantum Computer 2019 developed.pptx
DataJan27.pptxDataFoundationsPresentation
Social media with big data analytics
Big Data Session 1.pptx
Présentation on radoop
Cloud Computing & Big Data
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
Project quality management in manufacturing
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
additive manufacturing of ss316l using mig welding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Welding lecture in detail for understanding
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Mechanical Engineering MATERIALS Selection
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Project quality management in manufacturing
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
additive manufacturing of ss316l using mig welding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
OOP with Java - Java Introduction (Basics)
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Lesson 3_Tessellation.pptx finite Mathematics
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Welding lecture in detail for understanding
Well-logging-methods_new................
composite construction of structures.pdf
Internet of Things (IOT) - A guide to understanding
Mechanical Engineering MATERIALS Selection
Ad

Unit 1 - Introduction to Big Data and hadoop.pptx

  • 1. INTRODUCTION TO BIG DATA (UNIT 1) Dr. P. Rambabu, M. Tech., Ph.D., F.I.E. 15-July-2024
  • 2. Big Data Analytics and Applications UNIT-I Introduction to Big Data: Defining Big Data, Big Data Types, Analytics, examples, Technologies, The evolution of Big Data Architecture. Basics of Hadoop: Hadoop Architecture, Main Components of Hadoop Framework, Analysis Big data using Hadoop, Hadoop clustering. UNIT-II: MapReduce: Analyzing the data with Unix Tool & Hadoop, Hadoop streaming, Hadoop Pipes. Hadoop Distributed File System: Design of HDFS, Concepts, Basic File system Operations, Interfaces, Data Flow. Hadoop I/O: Data Integrity, Compression, Serialization, File-Based Data Structures. UNIT-III: Developing A MapReduce Application: UNIT Tests with MRUNIT, Running Locally on Test Data. How MapReduce Works: Anatomy of MapReduce Job Run, Classic MapReduce, Yarn, Failures in Classic MapReduce and Yarn, Job Scheduling, Shuffle and Sort, Task Execution. MapReduce Types and Formats: MapReduce types, Input Formats, Output Formats.
  • 3. Introduction to Big Data Unit 4: NoSQL Data Management: Types of NoSQL, Query Model for Big Data, Benefits of NoSQL, MongoDB. Hbase: Data Model and Implementations, Hbase Clients, Hbase Examples, Praxis. Hive: Comparison with Traditional Databases, HiveQL, Tables, Querying Data, User Defined Functions. Sqoop: Sqoop Connectors, Text and Binary File Formats, Imports, Working with Imported Data. FLUME: Apache Flume, Data Sources for FLUME, Components of FLUME Architecture. Unit 5: Pig: Grunt, Comparison with Databases, Pig Latin, User Defined Functions, Data Processing Operators. Spark: Installing steps, Distributed Datasets, Shared Variables, Anatomy of spark Job Run. Scala: Environment Setup, Basic syntax, Data Types, Functions, Pattern Matching.
  • 5. Big Data Analytics and Applications Defining Big Data Big Data refers to extremely large datasets that are difficult to manage, process, and analyze using traditional data processing tools. The primary characteristics of Big Data are often described by the "3 Vs": 1. Volume: The amount of data generated is vast and continuously growing. 2. Velocity: The speed at which new data is generated and needs to be processed. 3. Variety: The different types of data (structured, semi-structured, and unstructured). Additional characteristics sometimes included are: 4. Veracity: The quality and accuracy of the data. 5. Value: The potential insights and benefits derived from analyzing the data.
  • 6. Big Data Types Big Data can be categorized into three main types: 1. Structured Data: Organized in a fixed schema, usually in tabular form. Examples include databases, spreadsheets. 2. Semi-structured Data: Does not conform to a rigid structure but contains tags or markers to separate data elements. Examples include JSON, XML files. 3. Unstructured Data: No predefined format or structure. Examples include text documents, images, videos, and social media posts.
  • 7. Big Data Technologies To manage and analyze Big Data, several technologies and tools are used, including: 1. Hadoop: An open-source framework that allows for the distributed processing of large datasets across clusters of computers. 2. HDFS (Hadoop Distributed File System): A scalable, fault-tolerant storage system. 3. MapReduce: A programming model for processing large datasets with a distributed algorithm. 4. Spark: An open-source unified analytics engine for large-scale data processing, known for its speed and ease of use. 5. NoSQL Databases: Designed to handle large volumes of varied data. Examples include MongoDB, Cassandra, HBase.
  • 8. Big Data Technologies 6. Kafka: A distributed streaming platform used for building real-time data pipelines and streaming applications. 7. Hive: A data warehousing tool built on top of Hadoop for querying and analyzing large datasets with SQL-like queries. 8. Pig: A high-level platform for creating MapReduce programs used with Hadoop
  • 9. Examples of Big Data Big Data is used in various industries and applications: Healthcare: Analyzing patient data to improve treatment outcomes, predict epidemics, and reduce costs. Finance: Detecting fraud, managing risk, and personalizing customer services. Retail: Optimizing supply chain management, enhancing customer experience, and improving inventory management. Telecommunications: Managing network traffic, improving customer service, and preventing churn. Social Media: Analyzing user behavior, sentiment analysis, and targeted advertising.
  • 10. The Evolution of Big Data Architecture The architecture of Big Data systems has evolved to handle the growing complexity and demands of data processing. Key stages include: Batch Processing: Initial systems focused on batch processing large volumes of data using tools like Hadoop and MapReduce. Data is processed in large chunks at scheduled intervals. Real-time Processing: The need for real-time data analysis led to the development of technologies like Apache Storm and Apache Spark Streaming. These systems process data in real-time or near real-time.
  • 11. The Evolution of Big Data Architecture Lambda Architecture: A hybrid approach combining batch and real-time processing to provide comprehensive data analysis. The Lambda architecture consists of:  Batch Layer: Stores all historical data and periodically processes it using batch processing.  Speed Layer: Processes real-time data streams to provide immediate results.  Serving Layer: Merges results from the batch and speed layers to deliver a unified view. Kappa Architecture: Simplifies the Lambda architecture by using a single processing pipeline for both batch and real-time data, typically leveraging stream processing systems.
  • 12. Evolution of Big Data Architecture
  • 13. Evolution of Big Data and its ecosystem: The evolution of Big Data and its ecosystem has undergone significant transformations over the years. Here's a brief overview: Early 2000s:- Big Data emerges as a term to describe large, complex datasets.- Hadoop (2005) and MapReduce (2004) are developed to process large data sets. 2005-2010:- Hadoop becomes the foundation for Big Data processing.- NoSQL databases like Cassandra (2008), MongoDB (2009), and Couchbase (2010) emerge.- Data warehousing and business intelligence tools adapt to Big Data. 2010-2015:- Hadoop ecosystem expands with tools like Pig (2010), Hive (2010), and HBase (2010).- Spark (2010) and Flink (2011) emerge as in-memory processing engines.- Data science and machine learning gain prominence.
  • 14. 2015-2020:- Cloud-based Big Data services like AWS EMR (2012), Google Cloud Dataproc (2015), and Azure HDInsight (2013) become popular.- Containers and orchestration tools like Docker (2013) and Kubernetes (2014) simplify deployment.- Streaming data processing with Kafka (2011), Storm (2010), and Flink gains traction. 2020-present:- AI and machine learning continue to drive Big Data innovation.- Cloud- native architectures and serverless computing gain popularity.- Data governance, security, and ethics become increasingly important.- Emerging trends include edge computing, IoT, and Explainable AI (XAI).
  • 15. The Big Data ecosystem has expanded to include: 1. Data ingestion tools e.g., Flume (2011), NiFi (2014) 2. Data processing frameworks e.g., Hadoop (2005), Spark (2010), Flink (2014) 3. NoSQL databases e.g., Hbase (2008), Cassandra (2008), MongoDB (2009), Couchbase (2011) 4. Data warehousing and BI tools e.g., Hive (2008), Impala (2012), Tableau (2003), Presto (2013), SparkSQL (2014), Power BI (2015) 5. Streaming data processing e.g., Flink (2010), Kafka (2011), Storm (2011), Spark Streaming (2013) 6. Machine learning and AI frameworks e.g., Scikit-learn (2010), TensorFlow (2015), PyTorch (2016) 7. Cloud-based Big Data services e.g., AWS EMR (2009), Azure HDInsight (2013), Google Cloud Dataproc (2016) 8. Containers and orchestration tools e.g., Docker (2013), Kubernetes (2014)
  • 16. Evolution of Big Data and its ecosystem
  • 17. Dr. Rambabu Palaka Professor School of Engineering Malla Reddy University, Hyderabad Mobile: +91-9652665840 Email: drrambabu@mallareddyuniversity.ac.in