big-data-anallytics.pptx

Introduction
to Big Data
Analytics

About Me:
• Sangamesh Kalyan
• From Kalaburagi, Karnataka
• Graduated from PDA College of
Eng. Kalaburagi - 2016
• Senior Data Engineer at
Synchronoss Technologies
• Delivered analytics on 2 major
applications for 12+ customers

Big Data
Pipeline
ELK Stack
Demo
Big Data
Analytics
Importance of
Analytics
01
02
03 06
05
04
Agenda:

Analytics:
• Systematic computational
analysis of data or statistics
• Uses data and math to answer
business questions, discover
relationships, predict
unknown outcomes and
automate decisions
• Uncovers information such as hidden patterns, correlations,
market trends and customer preferences

Types of Analytics:
Descriptiv
e
Analytics
Diagnosti
c
Analytics
Predictive
Analytics
Perspectiv
e
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we
make it
happen?
Difficulty
Value

Importance of Analytics:
Analytics
New
revenue
opportunit
ies
Improve
Customer
Experienc
e
Product
Developm
ent and
Innovatio
n
Better
Decision
Making
Risk
Manageme
nt
More
Effective
Marketing
Early
Detection
of
Problems
Performan
ce
Analysis

Informatio
n
Graphs
Measurement
s
Observations
What is data?
Facts
Number
s
Quantities
c

Formatted, transformed into
well defined data model
Structured Unstructured
Some consistent and definite
characteristics
Semi-structured
Absolute raw form, complex
arrangement and formatting
Data bases
Analog data
GPS
Video/audio
streams
XML
Email
JSON
Types of data

Who generates data?
Web
Cloud
Internet
of things
Social Media
Log Files
Smart
Phones
Data
Sources

What is big
data ?
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
 Big Data is data whole scale, diversity and
complexity require new architecture, techniques,
algorithms and analytics to manage it and extract
the value and hidden knowledge from it
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and
visualization
 Provides opportunities to find new insights from
existing data and also guidelines to capture and
analyze future data

Five Vs of big data
High Speed of
Accumulation of
Data
Inconsistency and
uncertainty in Data
Huge Amount of
Data
Variety
Velocity
Big data
Extract Useful Data
Different Formats of
Data from Various
Sources

Big Data Analytics Pipeline
Logs
IOT
R RDBMS
Cloud
Web
Data Sources
Data
Visualization
Data Processing
Data Storage

Data Storage: HDFS
Fault
Tolerance
High
Availability
Reliability
Economic
Replication
Scalability
Flexibility
Distributed
Storage
Easy to Use
Data Locality
Features of HDFS

HDFS Block Replication
1
2
5
3
4
2
4
5
1
2
5
2
3
4
1
3
5
HDFS
Blocks
1
3
4
Node 1
Node 3
Node 5
Node 4
Node 2
Block size = 64 MB
Replication Factor = 3

HDFS High Availability
Name Node ( file name, no. of replicas, block id,… )
1 2 2 1 4 2 5
5 3 4 3 5 1 4
Data Node - 1 Data Node - 4
Data Node - 8
Data Node - 7
Data Node - 6
Data Node - 3
Data Node - 5
Data Node - 2
Data Nodes
Client
Read File
No
Response

Data Processing: Apache Spark
1
3
5
7
8
2
4
6
Speed
Real Time
Processing
Lazy
Evaluation
Powerful
Caching
Reusability
Dynamic in
Nature
In Memory
Computation
Fault
Tolerance
Features of Spark

Parallel data processing with Apache Spark
1. val r = 1 to 1000 toArray
2. val numbers = sc.parallelize(r, 5);
3. numbers.getNumPartitions
4. numbers.map(num=> { var isPrime = true; for(tmp <- 2 to num/2 )
{ if( num % tmp == 0 ) isPrime = false; } if(isPrime == true) num; })
5. .filter(_ != ())
6. .collect()
7. .foreach(println)
Creates a range of numbers
Divides the range into 5 partitions
Shows the number of partitions RDD
contains
Applies the map function to each partition in the
RDD
Driver will serialize the filter code and send it to all
computers
Spark driver will collect all prime numbers from each
computer
Will execute only on the driver computer and it will
display all prime numbers

Contd..
Spark Driver
1 - 200
201 -
400
401 -
600
601 -
800
801 -
1000
Distributes numbers to
5 computers
Each Partition/ Computer will execute the Map Function in parallel on
its set of number

Contd..
Spark Driver
1 - 200
201 -
400
401 -
600
601 -
800
801 -
1000
Spark Driver Serializes the
following map code and send
it to all partitions or
computers for execution
For ex. Computer 1 will send back all prime numbers in range 1 - 200.
Computer 2 will send back all prime numbers in range 201 - 400
Each partition/ Computer will send back it’s
set of prime numbers to driver program

Data Visualization:
• Data visualization is the
graphical representation
of information and data.
• By using visual elements
like charts, graphs, and
maps.
• Data visualization tools
provide an accessible way
to see and understand
trends, outliers, and
patterns in data.

Importance of Data Visualization
Solves data
inefficiencies and
absorb vast amount of
data presented in
visual formats
Identifies errors and
inaccuracies in data
quickly
Helps in quick decision
making
Explore business
insights and achieve
business goals in the
right direction
It promotes story
telling and conveys the
right message to
audience
Access Real – Time
information and assist
in management
functions
Optimize and instantly
retrieve data via-
tailored reports
Stay on top of the game
by discovering latest
trends

Healthcare & research
Manufacturing
Mobile telecom
Oil and Gas
Finance and banking
Retail and E-Commerce
Big Data Analytics application areas

Demo: Block Diagram
Transform Database
Ingest
Application
Logs
Data Source Data Processing Data
Storage
Data
Visualization

Dataflow
Log
Log
Log
Logstash
Elasticsearch

TSL=20190924:02:17:02.763
+0000,EID=11,IP=10.181.0.67;10.181.4.67,TS1=1569291422243,TS2=15692914227
63,TT=FILE,TI=fcbba657946367b3e24378e9dea9d0fa109842d90190b32ec8f9e68
5ce62a4fa,SI=DV,TE=310420,ST=0,EC=,CP=DesktopMac,CI=Mac/Mac OSX
10.14.6,CID=89620267-3acf-4f20-a2d2-
c8d7cc498d2e,SRC=DV,PT=/dv/api/user/00e35fe011e14553b1a919511abb8d88/r
epository/SyncDrive/file/content,FS=3199894,FT=image/jpeg,FK=fcbba65794636
7b3e24378e9dea9d0fa109842d90190b32ec8f9e685ce62a4fa,FN=btcc-
19.jpg,RM=GET,RCL=-
1,RSC=image/jpeg,XMID=dv.user.repository.file.content.get,LCID=00e35fe011e14
553b1a919511abb8d88,UA=SyncDrive/17.3.21.17 (gb_GB; DesktopMac) Mac/Mac
OSX 10.14.6,
KPI logs

Simulation of Application logging

Analysis
Archivin
g
Alerting
Monitorin
g
Logstash

Logstash Pipeline
Filter
Block Diagram
Output
Input
Files
Files

Logstash Demo
SN Input Filter Output
1 CL None CL
2 File None File + CL
3 File MUTATE + GROK File + CL
4 File MUTATE + GROK + DROP File
5 File MUTATE + GROK + DROP ES single index
6 File MUTATE + GROK + DROP ES multi index
7 File Ruby File + CL

Flowchart
Log Files
Grok Filter (Align
data with column)
if
[REQUE
STURL]
=
"/dv/api/
status"
Drop the
document
Demo
Access
Demo
Error
if
[STATUS
]=
"ERROR"
Yes
Yes
No

Distributed full-
text search
engine
NoSQL
database
Analytic
s engine
Writte
n in
Java
Lucene
based (Solr)
Real-time
Schema less
Open
source
Restful
Interface
Easy to
scale
Failover
mechanism
Elasticsearch

Relational DB || Elasticsearch
Relational DB
Elasticsearc
h
Database Index
Table Type
Row Document
Column Field

HTTP Methods
PUT
Create Index
Mapping
GET
Search
Index/Type
POST
Search
Index/Type with
body
Query
Index/Type
DELETE
Delete Index

Demo: Elasticsearch
1. Create index
2. Post JSON document
3. Retrieve document
4. Schema
5. Search
6. Filter
7. Query
8. Aggregation

Create an Index
PUT /apartment
Response:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "apartment"
}

Questions:
Mail ID :
sangameshkalyan@
gmai.com
LinkedIn ID :
https://www.linked
in.com/in/sangame
sh-kalyan-
612269a6/

big-data-anallytics.pptx

More Related Content

Similar to big-data-anallytics.pptx (20)

Recently uploaded (20)

big-data-anallytics.pptx

Editor's Notes