SlideShare a Scribd company logo
Q 1.
(a)
Explain the Stream Data Model Architecture with a neat diagram.
In analogy to a database-management system, we can view a stream processor as a kind of
data-management system, the high-level organization of which is suggested in Fig.
Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between elements
of one stream need not be uniform. The fact that the rate of arrival of stream elements is not
under the control of the system distinguishes stream processing from the processing of data
that goes on within a database-management system. The latter system controls the rate at
which data is read from the disk, and therefore never has to worry about data getting lost as it
attempts to execute queries. Streams may be archived in a large archival store, but we assume
it is not possible to answer queries from the archival store. It could be examined only under
special circumstances using time-consuming retrieval processes. There is also a working store,
into which summaries or parts of streams may be placed, and which can be used for answering
queries. The working store might be disk, or it might be main memory, depending on how fast
we need to process queries. But either way, it is of sufficiently limited capacity that it cannot
store all the data from all the streams.
2
What is bloom filter? Determine the probability of false positivenness in Bloom Filter.
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values
to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart
is equally likely to hit any target. After throwing the darts, how many targets can we expect to
be hit at least once?
 The probability that a given dart will not hit a given target is (x − 1)/x
 The probability that none of the y darts will hit a given target is ((x−1)/x)^y
 We can write this expression as (1 – 1 x )^x( y x ).
 Using the approximation (1−ǫ)1/ǫ = 1/e for small E we conclude that the probability
that none of the y darts hit a given target is e−y/x.
3. Explain Girvan Newman Algorithm .Detect communities for the following graph using Girvan
Newman Algorithm(Edge Betweenness mentioned in the graph)
 In order to find out between edges, we need to calculate shortest paths from going
through each of the edges.
 Girvan - Newman Algorithm visits each node X once and computes the number of
shortest paths from X to each of the other nodes that go through each of the edges.
 The algorithm begins by performing a breadth first search [BFS] of the graph, starting
at the node X.
 The edges that go between node at the same level can never be a part of a shortest path
from X.
 Edges DAG edge will be part of at-least one shortest path from root X.
 To complete the betweeness calculation, we have to repeat this calculation for every
node as the root and sum the contributions.
 After calculations, following graph shows final betweenness values:
 We can cluster by taking the in order to increasing betweenness and add them to the
graph at a time.
 We can remove edge with highest value to cluster the graph.
 In the example graph we remove edge BD to get two communities as follows:
4) Define PageRank . Calculate page rank for the following graph
5 Explain Flajolet-Martin Algorithm.Perform FM for the stream 1.3.2,1,2,3,4,3,1,2,3,1……….
Flajolet-Martin algorithm approximates the number of unique objects in a stream or a
database in one pass. If the stream contains n elements with m of them unique, this algorithm
runs in O(n)O(n) time and needs O(log(m))O(log(m)) memory.
Algorithm:
1. Create a bit vector (bit array) of sufficient length L, such that 2L>n2L>n, the number
of elements in the stream. Usually a 64-bit vector is sufficient since 264264 is quite
large for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i0i. So initialize each bit to 0.
3. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
4. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
Example S=1,3,2,1,2,3,4,3,1,2,3,1S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5h(x)=(6x+1) mod 5
Assume |b| = 5
R = max( r(a) ) = 5
So no. of distinct elements = N=2R=25=32
6 Write psuedocode for pagerank calculation using MapReduce. What is the role of combiners
in performing the pagerank calculation?
Combiners: (2 Marks)
There are two reasons
1. We might wish to add terms for v ′ i , the ith component of the result vector v, at the
Map tasks. This improvement is the same as using a combiner, since the Reduce
function simply adds terms with a common key. Recall that for a MapReduce
implementation of matrix–vector multiplication, the key is the value of i for which a
term mijvj is intended.
2. We might not be using MapReduce at all, but rather executing the iteration step at a
single machine or a collection of machines.
7. Explain CURE clustering algorithm with an example.
The CURE (Clustering Using Representatives) Algorithm is large scale clustering algorithm
in the point assignment classs which assumes Euclidean space. It does not assume anything
about the shape of clusters; they need not be normally distributed, and can even have strange
bends, S-shapes, or even rings.
Instead of representing clusters by their centroid, it uses a collection of representative points,
as the name implies.
The CURE algorithm is divided into into phases:
1. Initialization in CURE
2. Completion of the CURE Algorithm
Initialization in CURE:
1. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
2. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the K-means method.
3. Move each of the representative points a fixed fraction of the distance between its
location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note
that this step requires a Euclidean space, since otherwise, there might not be any notion
of a line between two points.
Completion of the CURE Algorithm:
The next phase of CURE is to merge two clusters if they have a pair of representative points,
one from each cluster, that are sufficiently close. The user may pick the distance that defines
“close.” This merging step can repeat, until there are no more sufficiently close clusters.

More Related Content

PDF
Array and Pointers
PDF
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
PPT
358 33 powerpoint-slides_15-hashing-collision_chapter-15
PDF
Bloom Filters: An Introduction
ZIP
Hashing
PDF
Fuzzy c means_realestate_application
PDF
pre
PDF
Data Representation of Strings
Array and Pointers
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
358 33 powerpoint-slides_15-hashing-collision_chapter-15
Bloom Filters: An Introduction
Hashing
Fuzzy c means_realestate_application
pre
Data Representation of Strings

What's hot (20)

PPT
Concept of hashing
PPTX
Dynamic Memory & Linked Lists
PDF
08 Hash Tables
PPTX
Hashing Techniques in Data Structures Part2
PPT
Clustering
PPT
Hashing PPT
PPT
Ch17 Hashing
PPT
Advance algorithm hashing lec II
PPTX
Principal component analysis
PPTX
K-means clustering algorithm
PDF
Machine learning hands on clustering
PDF
Machine learning (11)
PPT
4.4 hashing
PDF
K means clustering
PPT
Advance algorithm hashing lec I
PPTX
Hashing 1
PDF
Hashing and Hash Tables
PPTX
Searching Algorithms
PPTX
Quadratic probing
PPT
Data Structure and Algorithms Hashing
Concept of hashing
Dynamic Memory & Linked Lists
08 Hash Tables
Hashing Techniques in Data Structures Part2
Clustering
Hashing PPT
Ch17 Hashing
Advance algorithm hashing lec II
Principal component analysis
K-means clustering algorithm
Machine learning hands on clustering
Machine learning (11)
4.4 hashing
K means clustering
Advance algorithm hashing lec I
Hashing 1
Hashing and Hash Tables
Searching Algorithms
Quadratic probing
Data Structure and Algorithms Hashing
Ad

Similar to Bigdata analytics (20)

PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
PDF
ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
DOCX
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
DOCX
User_42751212015Module1and2pagestocompetework.pdf.docx
PPTX
PYTHON ALGORITHMS, DATA STRUCTURE, SORTING TECHNIQUES
PDF
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
PDF
Data Structures Design Notes.pdf
PDF
H010223640
DOCX
Summerization notes for descriptive statistics using r
PDF
Clustering in Machine Learning.pdf
PDF
Fusing Transformations of Strict Scala Collections with Views
DOCX
8.clustering algorithm.k means.em algorithm
PPTX
VCE Unit 01 (2).pptx
DOCX
Neural nw k means
PDF
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
PPTX
Data streaming algorithms
PDF
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
PDF
Comparison Between Clustering Algorithms for Microarray Data Analysis
PPTX
Datamining with R
DOCX
Optimization Of Fuzzy Bexa Using Nm
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
User_42751212015Module1and2pagestocompetework.pdf.docx
PYTHON ALGORITHMS, DATA STRUCTURE, SORTING TECHNIQUES
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Data Structures Design Notes.pdf
H010223640
Summerization notes for descriptive statistics using r
Clustering in Machine Learning.pdf
Fusing Transformations of Strict Scala Collections with Views
8.clustering algorithm.k means.em algorithm
VCE Unit 01 (2).pptx
Neural nw k means
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Data streaming algorithms
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Comparison Between Clustering Algorithms for Microarray Data Analysis
Datamining with R
Optimization Of Fuzzy Bexa Using Nm
Ad

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
Construction Project Organization Group 2.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
Project quality management in manufacturing
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CH1 Production IntroductoryConcepts.pptx
additive manufacturing of ss316l using mig welding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lecture Notes Electrical Wiring System Components
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
Geodesy 1.pptx...............................................
Construction Project Organization Group 2.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Project quality management in manufacturing
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
R24 SURVEYING LAB MANUAL for civil enggi
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Bigdata analytics

  • 1. Q 1. (a) Explain the Stream Data Model Architecture with a neat diagram. In analogy to a database-management system, we can view a stream processor as a kind of data-management system, the high-level organization of which is suggested in Fig. Any number of streams can enter the system. Each stream can provide elements at its own schedule; they need not have the same data rates or data types, and the time between elements of one stream need not be uniform. The fact that the rate of arrival of stream elements is not under the control of the system distinguishes stream processing from the processing of data that goes on within a database-management system. The latter system controls the rate at which data is read from the disk, and therefore never has to worry about data getting lost as it attempts to execute queries. Streams may be archived in a large archival store, but we assume it is not possible to answer queries from the archival store. It could be examined only under special circumstances using time-consuming retrieval processes. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries. But either way, it is of sufficiently limited capacity that it cannot store all the data from all the streams. 2 What is bloom filter? Determine the probability of false positivenness in Bloom Filter. A Bloom filter consists of: 1. An array of n bits, initially all 0’s. 2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n buckets, corresponding to the n bits of the bit-array. 3. A set S of m key values. The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while rejecting most of the stream elements whose keys are not in S. The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart is equally likely to hit any target. After throwing the darts, how many targets can we expect to be hit at least once?  The probability that a given dart will not hit a given target is (x − 1)/x  The probability that none of the y darts will hit a given target is ((x−1)/x)^y
  • 2.  We can write this expression as (1 – 1 x )^x( y x ).  Using the approximation (1−ǫ)1/ǫ = 1/e for small E we conclude that the probability that none of the y darts hit a given target is e−y/x. 3. Explain Girvan Newman Algorithm .Detect communities for the following graph using Girvan Newman Algorithm(Edge Betweenness mentioned in the graph)  In order to find out between edges, we need to calculate shortest paths from going through each of the edges.  Girvan - Newman Algorithm visits each node X once and computes the number of shortest paths from X to each of the other nodes that go through each of the edges.  The algorithm begins by performing a breadth first search [BFS] of the graph, starting at the node X.  The edges that go between node at the same level can never be a part of a shortest path from X.  Edges DAG edge will be part of at-least one shortest path from root X.  To complete the betweeness calculation, we have to repeat this calculation for every node as the root and sum the contributions.  After calculations, following graph shows final betweenness values:  We can cluster by taking the in order to increasing betweenness and add them to the graph at a time.  We can remove edge with highest value to cluster the graph.  In the example graph we remove edge BD to get two communities as follows:
  • 3. 4) Define PageRank . Calculate page rank for the following graph
  • 4. 5 Explain Flajolet-Martin Algorithm.Perform FM for the stream 1.3.2,1,2,3,4,3,1,2,3,1………. Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass. If the stream contains n elements with m of them unique, this algorithm runs in O(n)O(n) time and needs O(log(m))O(log(m)) memory. Algorithm: 1. Create a bit vector (bit array) of sufficient length L, such that 2L>n2L>n, the number of elements in the stream. Usually a 64-bit vector is sufficient since 264264 is quite large for most purposes. 2. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i0i. So initialize each bit to 0. 3. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i. So initialize each bit to 0. 4. The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in 0i. So initialize each bit to 0. Example S=1,3,2,1,2,3,4,3,1,2,3,1S=1,3,2,1,2,3,4,3,1,2,3,1 h(x)=(6x+1) mod 5h(x)=(6x+1) mod 5 Assume |b| = 5 R = max( r(a) ) = 5 So no. of distinct elements = N=2R=25=32 6 Write psuedocode for pagerank calculation using MapReduce. What is the role of combiners in performing the pagerank calculation?
  • 5. Combiners: (2 Marks) There are two reasons 1. We might wish to add terms for v ′ i , the ith component of the result vector v, at the Map tasks. This improvement is the same as using a combiner, since the Reduce function simply adds terms with a common key. Recall that for a MapReduce implementation of matrix–vector multiplication, the key is the value of i for which a term mijvj is intended. 2. We might not be using MapReduce at all, but rather executing the iteration step at a single machine or a collection of machines. 7. Explain CURE clustering algorithm with an example. The CURE (Clustering Using Representatives) Algorithm is large scale clustering algorithm in the point assignment classs which assumes Euclidean space. It does not assume anything about the shape of clusters; they need not be normally distributed, and can even have strange bends, S-shapes, or even rings. Instead of representing clusters by their centroid, it uses a collection of representative points, as the name implies. The CURE algorithm is divided into into phases: 1. Initialization in CURE 2. Completion of the CURE Algorithm Initialization in CURE: 1. Take a small sample of the data and cluster it in main memory. In principle, any clustering method could be used, but as CURE is designed to handle oddly shaped clusters, it is often advisable to use a hierarchical method in which clusters are merged when they have a close pair of points.
  • 6. 2. Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible, using the K-means method. 3. Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note that this step requires a Euclidean space, since otherwise, there might not be any notion of a line between two points. Completion of the CURE Algorithm: The next phase of CURE is to merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close. The user may pick the distance that defines “close.” This merging step can repeat, until there are no more sufficiently close clusters.