SlideShare a Scribd company logo
Introduction
to Big Data
Analytics
About Me:
• Sangamesh Kalyan
• From Kalaburagi, Karnataka
• Graduated from PDA College of
Eng. Kalaburagi - 2016
• Senior Data Engineer at
Synchronoss Technologies
• Delivered analytics on 2 major
applications for 12+ customers
Big Data
Pipeline
ELK Stack
Demo
Big Data
Analytics
Importance of
Analytics
01
02
03 06
05
04
Agenda:
Analytics:
• Systematic computational
analysis of data or statistics
• Uses data and math to answer
business questions, discover
relationships, predict
unknown outcomes and
automate decisions
• Uncovers information such as hidden patterns, correlations,
market trends and customer preferences
Types of Analytics:
Descriptiv
e
Analytics
Diagnosti
c
Analytics
Predictive
Analytics
Perspectiv
e
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we
make it
happen?
Difficulty
Value
Importance of Analytics:
Analytics
New
revenue
opportunit
ies
Improve
Customer
Experienc
e
Product
Developm
ent and
Innovatio
n
Better
Decision
Making
Risk
Manageme
nt
More
Effective
Marketing
Early
Detection
of
Problems
Performan
ce
Analysis
Informatio
n
Graphs
Measurement
s
Observations
What is data?
Facts
Number
s
Quantities
c
Formatted, transformed into
well defined data model
Structured Unstructured
Some consistent and definite
characteristics
Semi-structured
Absolute raw form, complex
arrangement and formatting
Data bases
Analog data
GPS
Video/audio
streams
XML
Email
JSON
Types of data
Who generates data?
Web
Cloud
Internet
of things
Social Media
Log Files
Smart
Phones
Data
Sources
What is big
data ?
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
 Big Data is data whole scale, diversity and
complexity require new architecture, techniques,
algorithms and analytics to manage it and extract
the value and hidden knowledge from it
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and
visualization
 Provides opportunities to find new insights from
existing data and also guidelines to capture and
analyze future data
Five Vs of big data
High Speed of
Accumulation of
Data
Inconsistency and
uncertainty in Data
Huge Amount of
Data
Variety
Velocity
Big data
Extract Useful Data
Different Formats of
Data from Various
Sources
Big Data Analytics Pipeline
Logs
IOT
R RDBMS
Cloud
Web
Data Sources
Data
Visualization
Data Processing
Data Storage
Data Storage: HDFS
Fault
Tolerance
High
Availability
Reliability
Economic
Replication
Scalability
Flexibility
Distributed
Storage
Easy to Use
Data Locality
Features of HDFS
HDFS Block Replication
1
2
5
3
4
2
4
5
1
2
5
2
3
4
1
3
5
HDFS
Blocks
1
3
4
Node 1
Node 3
Node 5
Node 4
Node 2
Block size = 64 MB
Replication Factor = 3
HDFS High Availability
Name Node ( file name, no. of replicas, block id,… )
1 2 2 1 4 2 5
5 3 4 3 5 1 4
Data Node - 1 Data Node - 4
Data Node - 8
Data Node - 7
Data Node - 6
Data Node - 3
Data Node - 5
Data Node - 2
Data Nodes
Client
Read File
No
Response
Data Processing: Apache Spark
1
3
5
7
8
2
4
6
Speed
Real Time
Processing
Lazy
Evaluation
Powerful
Caching
Reusability
Dynamic in
Nature
In Memory
Computation
Fault
Tolerance
Features of Spark
Parallel data processing with Apache Spark
1. val r = 1 to 1000 toArray
2. val numbers = sc.parallelize(r, 5);
3. numbers.getNumPartitions
4. numbers.map(num=> { var isPrime = true; for(tmp <- 2 to num/2 )
{ if( num % tmp == 0 ) isPrime = false; } if(isPrime == true) num; })
5. .filter(_ != ())
6. .collect()
7. .foreach(println)
Creates a range of numbers
Divides the range into 5 partitions
Shows the number of partitions RDD
contains
Applies the map function to each partition in the
RDD
Driver will serialize the filter code and send it to all
computers
Spark driver will collect all prime numbers from each
computer
Will execute only on the driver computer and it will
display all prime numbers
Contd..
Spark Driver
1 - 200
201 -
400
401 -
600
601 -
800
801 -
1000
Distributes numbers to
5 computers
Each Partition/ Computer will execute the Map Function in parallel on
its set of number
Contd..
Spark Driver
1 - 200
201 -
400
401 -
600
601 -
800
801 -
1000
Spark Driver Serializes the
following map code and send
it to all partitions or
computers for execution
For ex. Computer 1 will send back all prime numbers in range 1 - 200.
Computer 2 will send back all prime numbers in range 201 - 400
Each partition/ Computer will send back it’s
set of prime numbers to driver program
Data Visualization:
• Data visualization is the
graphical representation
of information and data.
• By using visual elements
like charts, graphs, and
maps.
• Data visualization tools
provide an accessible way
to see and understand
trends, outliers, and
patterns in data.
big-data-anallytics.pptx
Importance of Data Visualization
Solves data
inefficiencies and
absorb vast amount of
data presented in
visual formats
Identifies errors and
inaccuracies in data
quickly
Helps in quick decision
making
Explore business
insights and achieve
business goals in the
right direction
It promotes story
telling and conveys the
right message to
audience
Access Real – Time
information and assist
in management
functions
Optimize and instantly
retrieve data via-
tailored reports
Stay on top of the game
by discovering latest
trends
Healthcare & research
Manufacturing
Mobile telecom
Oil and Gas
Finance and banking
Retail and E-Commerce
Big Data Analytics application areas
big-data-anallytics.pptx
Demo: Block Diagram
Transform Database
Ingest
Application
Logs
Data Source Data Processing Data
Storage
Data
Visualization
ELK Stack
Dataflow
Log
Log
Log
Logstash
Elasticsearch
Application Logging
LOGTIMESTAMP|REMOTEIPADDRESS|CLIENTPLATFORM|CLIENTIDENTIFI
ER|USERAGENT|REQUESTMETHOD|REQUESTURL|QUERYSTRING|STATUSC
ODE|BYTESSENT|PROCESSTIME|CONTENTLENGTH|SERVERNAME|ACCEPT|
APPLICATIONIDENTIFIER
[24/Sep/2019:11:00:05 +0000]|92.18.121.125%1|Desktop|Windows/Windows 10 64-
bit|SyncDrive/17.3.0.46 (gb_GB; Desktop) Windows/Windows 10 64-
bit|GET|/dv/api/user/145ba57910fa4182944ff3b270744120/repository|200|1854|60|-
|10.181.4.169|application/vnd.newbay.dv-1.14+xml|SyncDrive.6b49a597-d2f1-4841-93b8-
7d833f789980
[24/Sep/2019:11:04:22 +0000]|82.132.247.126%1|HANDSET|Apple/iPhone
iPhone11,6|SyncDrive/17.3.420 (en_GB; iOS 13.0) Apple/iPhone
iPhone11,6|GET|/dv/api/user/4b4b1aec1e054f66a301db142b6f93fe/repository|200|3508|17
|-|10.181.4.44|application/vnd.newbay.dv-1.15+json|syncdrive.4983c941
Logs: Access logs
TSL=20190924:02:17:02.763
+0000,EID=11,IP=10.181.0.67;10.181.4.67,TS1=1569291422243,TS2=15692914227
63,TT=FILE,TI=fcbba657946367b3e24378e9dea9d0fa109842d90190b32ec8f9e68
5ce62a4fa,SI=DV,TE=310420,ST=0,EC=,CP=DesktopMac,CI=Mac/Mac OSX
10.14.6,CID=89620267-3acf-4f20-a2d2-
c8d7cc498d2e,SRC=DV,PT=/dv/api/user/00e35fe011e14553b1a919511abb8d88/r
epository/SyncDrive/file/content,FS=3199894,FT=image/jpeg,FK=fcbba65794636
7b3e24378e9dea9d0fa109842d90190b32ec8f9e685ce62a4fa,FN=btcc-
19.jpg,RM=GET,RCL=-
1,RSC=image/jpeg,XMID=dv.user.repository.file.content.get,LCID=00e35fe011e14
553b1a919511abb8d88,UA=SyncDrive/17.3.21.17 (gb_GB; DesktopMac) Mac/Mac
OSX 10.14.6,
KPI logs
Simulation of Application logging
Analysis
Archivin
g
Alerting
Monitorin
g
Logstash
Logstash Pipeline
Filter
Block Diagram
Output
Input
Files
Files
Logstash Demo
SN Input Filter Output
1 CL None CL
2 File None File + CL
3 File MUTATE + GROK File + CL
4 File MUTATE + GROK + DROP File
5 File MUTATE + GROK + DROP ES single index
6 File MUTATE + GROK + DROP ES multi index
7 File Ruby File + CL
Flowchart
Log Files
Grok Filter (Align
data with column)
if
[REQUE
STURL]
=
"/dv/api/
status"
Drop the
document
Demo
Access
Demo
Error
if
[STATUS
]=
"ERROR"
Yes
Yes
No
Distributed full-
text search
engine
NoSQL
database
Analytic
s engine
Writte
n in
Java
Lucene
based (Solr)
Real-time
Schema less
Open
source
Restful
Interface
Easy to
scale
Failover
mechanism
Elasticsearch
Relational DB || Elasticsearch
Relational DB
Elasticsearc
h
Database Index
Table Type
Row Document
Column Field
HTTP Methods
PUT
Create Index
Mapping
GET
Search
Index/Type
POST
Search
Index/Type with
body
Query
Index/Type
DELETE
Delete Index
Demo: Elasticsearch
1. Create index
2. Post JSON document
3. Retrieve document
4. Schema
5. Search
6. Filter
7. Query
8. Aggregation
Create an Index
PUT /apartment
Response:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "apartment"
}
Questions:
Mail ID :
sangameshkalyan@
gmai.com
LinkedIn ID :
https://www.linked
in.com/in/sangame
sh-kalyan-
612269a6/
big-data-anallytics.pptx

More Related Content

PDF
Defining Data Science: A Comprehensive Overview
PDF
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PPTX
KNIME Meetup 2016-04-16
DOCX
Sivrama Sarma - Profile_July_2015
PPTX
What is Data analytics? How is data analytics a better career option?
PPT
Recom Banking Solution
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Defining Data Science: A Comprehensive Overview
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
KNIME Meetup 2016-04-16
Sivrama Sarma - Profile_July_2015
What is Data analytics? How is data analytics a better career option?
Recom Banking Solution
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...

Similar to big-data-anallytics.pptx (20)

PDF
Overview of business intelligence
PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PDF
Splunk Business Analytics
DOCX
Vadlamudi saketh30 (ml)
PDF
Analytics in Your Enterprise
PDF
Rahul Chauhan Resume - Data Scientist.pdf
PDF
Rahul Chauhan - Data Scientist Resume.pdf
PDF
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
PPTX
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
PPTX
Data preparation and processing chapter 2
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
PDF
Internship Presentation.pdf
PDF
Machine Data Analytics
PDF
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
PDF
SAS Training session - By Pratima
PDF
Tom Martens - Cube Ware - The big data challenge - bo
PDF
Roadmap for Enterprise Graph Strategy
PPTX
Azure Databricks for Data Scientists
PDF
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
Overview of business intelligence
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Splunk Business Analytics
Vadlamudi saketh30 (ml)
Analytics in Your Enterprise
Rahul Chauhan Resume - Data Scientist.pdf
Rahul Chauhan - Data Scientist Resume.pdf
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Data preparation and processing chapter 2
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Internship Presentation.pdf
Machine Data Analytics
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
SAS Training session - By Pratima
Tom Martens - Cube Ware - The big data challenge - bo
Roadmap for Enterprise Graph Strategy
Azure Databricks for Data Scientists
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
Ad

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Ad

big-data-anallytics.pptx

Editor's Notes

  • #5: https://www.datapine.com/dashboard-examples-and-templates/facebook
  • #7: https://www.simplilearn.com/what-is-big-data-analytics-article
  • #9: https://www.astera.com/type/blog/structured-semi-structured-and-unstructured-data/
  • #11: The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
  • #16: https://data-flair.training/blogs/hadoop-high-availability-tutorial/
  • #18: https://www.linkedin.com/pulse/parallel-data-processing-apache-spark-shahzad-aslam/
  • #22: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #23: https://splashbi.com/importance-purpose-benefit-of-data-visualization-tools/
  • #24: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #29: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #31: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #32: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #34: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #35: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #36: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #37: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #38: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003:
  • #40: https://www.oracle.com/a/ocom/docs/top-22-use-cases-for-big-data.pdf?source=:ad:pas:go:dg:a_apac:71700000084253891-58700007130462449-p64171651466:RC_WWMK210119P00066C0003: