SlideShare a Scribd company logo
DISTRIBUTED AND STREAMING
GRAPH PROCESSING TECHNIQUES
P. N. LIAKOS
University of Athens, June 15th, 2018
For the Fat Lady
Say you wish to study . . .
UoA Panagiotis Liakos Motivation 2/63
Say you wish to study . . .
. . . and now imagine that the data is available . . .
UoA Panagiotis Liakos Motivation 2/63
. . . but it’s more than you can handle!
11100111000010000011010100100100000101100100110011101010111111000
10101011010101001111101111010001010100001100111011011101100001000
01011110001010010101000001101010101100100111000101101011010101010
00010010001101110111001010110001110010100001000101010001110001010
10001010001000111000000011010010110111101101110010000001111101110
01010110011111010101001010000001100001100100100010100010111100100
01100100001101010001100000111000010001011100100011100010100000110
00100110011000010001011100001001000100010110010111110011110101001
01000010001010110001110010000111010010101011100001111000000111001
11000100101110110011100000001001001110011110011000000001011100001
10111100101110101110110101000101000101110100100100111010001011110
10101010011011011000110011011001011101000010011010111110000100110
00010000001110001111001110001100001011001011001101111101001011010
01111101111010110101111010100011111101100111001010110100110110001
00100101001111111100110010010100001111101110100101101111100010000
00011110111001011110000111000101100001101111011010110111001100100
01010010001001000000000111101000011001000101001101000101000101110
01000011110010100101101110011000111011100110111010101110010111110
2
billion
users!
500 million tweets per day!
1
trillion
URLs!
UoA Panagiotis Liakos Motivation 3/63
Large Scale Graph Processing
Active Research Areas:
Distributed Graph Processing Graph Streams
Worker 1 Worker 2
Worker 3
127
2
9
11
12
10
14
17
18
20
Pregel-like
graph processing
system
UoA Panagiotis Liakos Motivation 4/63
Our Contribution
Distributed Graph Processing (Pregel paradigm)
Memory Optimization
Opinion Formation
Local Community Detection
Graph Streams
Streaming Community Detection
Sampling High Quality Content
UoA Panagiotis Liakos Motivation 5/63
Outline
1 Realizing Memory-Optimized Distributed Graph Processing
2 COEUS: Community detection via seed-set expansion on
graph streams
3 Scalable Link Community Detection: A Local
Dispersion-aware Approach
4 Rhea: Adaptively Sampling High Quality Content from
Social Activity Streams
5 On the Impact of Social Cost in Opinion Dynamics
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 6/63
Apache Giraph Memory Usage [CEK+
15]
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 7/63
Apache Giraph Memory Usage [CEK+
15]
• Ineffective memory usage
• Partitioning hardens the task of compression
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 7/63
Related memory optimization approaches
Apache Giraph [CEK+15]:
does not exploit the redundancy in real-world graphs
Ligra+ [SDB15]:
compression techniques on a shared-memory system
halved space usage at the cost of slower execution
Gbase[KTS+12]:
does not follow the vertex-centric model
requires decompression
GraphChi [KBG], FlashGraph [ZMB+15] & Graphene [LH17]:
maintain graph data on disks
reasonable performance with very modest requirements
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 8/63
Contribution
We present a number of novel techniques that:
1 offer space efficient-representations of out-edges,
2 allow fast mining (in-situ) of the graph elements without
the need of decompression,
3 enable the execution of graph algorithms in
memory-constrained settings, and
4 ease the task of memory management, thus allowing
faster execution.
– PL, Katia Papakonstantinopoulou, Alex Delis: Memory-Optimized Distributed Graph Processing through
Novel Compression Techniques. ACM CIKM 2016 & ACAC 2016
– PL, Katia Papakonstantinopoulou, Alex Delis: Realizing Memory-Optimized Distributed Graph Processing.
IEEE Trans. Knowl. Data Eng. 30(4), 2018
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 9/63
Properties of real-world graphs
Locality of reference: the majority of the edges of a
graph link vertices that are close to each other in the
order
Similarity (or copy property): vertices that are close to
each other in the order tend to have many common
out-neighbors
http://www.di.uoa.gr/
http://www.di.uoa.gr/css/foostyle.css
http://www.di.uoa.gr/images/logo.gif
http://www.di.uoa.gr/about/
http://www.di.uoa.gr/staff/
http://www.di.uoa.gr/about/
http://www.di.uoa.gr/css/foostyle.css
http://www.di.uoa.gr/images/logo.gif
http://www.di.uoa.gr/directions.html
http://www.di.uoa.gr/about/
http://www.di.uoa.gr/staff/
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 10/63
BVEdges (based on [BV04])
2, 9, 10, 11, 12, 14, 17, 18, 20, 127
⇓
9 − 12, 2, 14, 17, 18, 20, 127
⇓
(1)2
4 bytes
γ(0)(9)2 ζ(13)ζ(11) ζ(2)ζ(0)ζ(1) ζ(106)
4 bytes
number of
intervals interval residuals
{
{
{
1 bit 7 bits 7 bits 4 bits 3 bits 4 bits 11 bits
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 11/63
BVEdges (based on [BV04])
2, 9, 10, 11, 12, 14, 17, 18, 20, 127
⇓
9 − 12, 2, 14, 17, 18, 20, 127
⇓
(1)2
4 bytes
γ(0)(9)2 ζ(13)ζ(11) ζ(2)ζ(0)ζ(1) ζ(106)
4 bytes
number of
intervals interval residuals
{
{
{
1 bit 7 bits 7 bits 4 bits 3 bits 4 bits 11 bits
• bit-encoding involves significant overhead
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 11/63
IntervalResidualEdges
2, 9, 10, 11, 12, 14, 17, 18, 20, 127
⇓
9 − 12, 17 − 18, 2, 14, 20, 127
⇓
(2)2
4 bytes
number of
intervals
(9)2 (4)2 (17)2 (2)2
4 +1 bytes
1st
interval
4 +1 bytes
2nd
interval
(2)2 (14)2 (20)2 (127)2
4 bytes 4 bytes 4 bytes 4 bytes
residuals
{UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 12/63
IntervalResidualEdges
2, 9, 10, 11, 12, 14, 17, 18, 20, 127
⇓
9 − 12, 17 − 18, 2, 14, 20, 127
⇓
(2)2
4 bytes
number of
intervals
(9)2 (4)2 (17)2 (2)2
4 +1 bytes
1st
interval
4 +1 bytes
2nd
interval
(2)2 (14)2 (20)2 (127)2
4 bytes 4 bytes 4 bytes 4 bytes
residuals
{• avoids expensive encodings & bit streams
• significant compression due to locality of reference
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 12/63
IndexedBitArrayEdges
2, 9, 10, 11, 12, 14, 17, 18 20, 127
⇓
{2}, {9, 10, 11, 12, 14}, {17, 18 20}, {127}
⇓
...
(0)2
(1)2
(2)2
(15)2
4 bytes 1 byte
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 13/63
IndexedBitArrayEdges
2, 9, 10, 11, 12, 14, 17, 18 20, 127
⇓
{2}, {9, 10, 11, 12, 14}, {17, 18 20}, {127}
⇓
...
(0)2
(1)2
(2)2
(15)2
4 bytes 1 byte
• avoids expensive encodings & bit streams
• significant compression due to locality of reference
• memory-efficient retrieval of out-edges
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 13/63
VariableByteWeights
32, 378
⇓
{32}, {256 + 122}
⇓
0
1 byte
1st weight 2nd weight
{
{
1 00 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 01
1 byte 1 byte
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 14/63
VariableByteWeights
32, 378
⇓
{32}, {256 + 122}
⇓
0
1 byte
1st weight 2nd weight
{
{
1 00 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 01
1 byte 1 byte
• weights of edges exhibit
right-skewed distributions [BBPSV04, MAF]
• log128(n) + 1 bytes to represent an integer n
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 14/63
RedBlackTreeEdges
2, 9, 10, 11, 12, 14, 17, 18 20, 127
⇓
127
20
18
17
14
12
11
10
9
2
key: 8 bytes (long)
le�: 4 bytes (compressed oop)
right: 4 bytes (compressed oop)
color: 1 byte (boolean)
depth: 2 bytes (short)
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 15/63
RedBlackTreeEdges
2, 9, 10, 11, 12, 14, 17, 18 20, 127
⇓
127
20
18
17
14
12
11
10
9
2
key: 8 bytes (long)
le�: 4 bytes (compressed oop)
right: 4 bytes (compressed oop)
color: 1 byte (boolean)
depth: 2 bytes (short)
• does not waste space for empty buckets
• significant savings using primitive data types
• cost-free iterations with regards to space
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 15/63
Experimental Evaluation
Space-efficiency
Performance:
when the available memory is not constrained
when the available memory is constrained
when adding/removing edges
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 16/63
Space Efficiency Comparison
ByteArray- BVEdges IntervalRe- IndexedBit-
graph Edges sidualEdges ArrayEdges
uk-2007-05@100000 22.61 MB 6.41 MB (0.96 MB) 7.92 MB 8.91 MB
uk-2007-05@1000000 279.16 MB 67.36 MB (10.54 MB) 82.7 MB 97.79 MB
ljournal-2008 866.36 MB 386.73 MB (117.68 MB) 497.52 MB 648.52 MB
indochina-2004 1,511.67 MB 442.34 MB (48.03 MB) 646.03 MB 554.23 MB
hollywood-2011 1,381.91 MB 287.53 MB (145.85 MB) 613.52 MB 676.88 MB
uk-2002 2,733.6 MB 1,092.82 MB (116.39 MB) 1,224.07 MB 1,255.67 MB
arabic-2005 4,820.09 MB 1,428.97 MB (187.58 MB) 1,674.75 MB 1,849.83 MB
uk-2005 7,401.88 MB 2,383.54 MB (279.45 MB) 2,728.74 MB 2,928.81 MB
twitter-2010 11,189.88 MB 4,628.48 MB (2,600.07 MB) 7,127.76 MB 8,888.50 MB
sk-2005 14,829.64 MB 4,889.85 MB (607.92 MB) 5,657.79 MB 6,354.17 MB
BVEdges < 40% ByteArrayEdges
impressive savings with all techniques
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 17/63
Performance / small-scale graphs
0
5
10
15
20
25
30
8 workers 4 workers 2 workers
Executiontime(inminutes)
ByteArrayEdges
BVEdges
IntervalResidualEdges
IndexedBitArrayEdges
BVEdges is inferior speedwise
IntervalResidualEdges and IndexedBitArrayEdges
already show signs of improvement
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 18/63
Performance / large-scale graphs
0
1
2
3
4
5
0 5 10 15 20 25 30
Executiontime(inminutes)
Supersteps of PageRank execution
ByteArrayEdges
BVEdges
IntervalResidualEdges
IndexedBitArrayEdges
ByteArrayEdges’s performance fluctuates
due to excessive garbage collection.
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 19/63
Performance / large-scale graphs
0
50
100
150
200
uk-2005 (5 workers) uk-2005 (4 workers)FAILED
Executiontime(inminutes) ByteArrayEdges
BVEdges
IntervalResidualEdges
IndexedBitArrayEdges
IntervalResidualEdges is faster than ByteArrayEdges
IndexedBitArrayEdges outperforms all
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 20/63
Performance / algorithms involving mutations
0
2
4
6
8
10
12
10 20 30 40 50 60 70 80 90 100
Executiontime(min)
Maximum Mutations Allowed
HashMapEdges
RedBlackTreeEdges
RedBlackEdges requires less than half of the space that
HashMapEdges needs
the performance of HashMapEdges deteriorates significantly
as the number of allowed mutations grows
UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 21/63
Outline
1 Realizing Memory-Optimized Distributed Graph Processing
2 COEUS: Community detection via seed-set expansion on
graph streams
3 Scalable Link Community Detection: A Local
Dispersion-aware Approach
4 Rhea: Adaptively Sampling High Quality Content from
Social Activity Streams
5 On the Impact of Social Cost in Opinion Dynamics
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 22/63
Climate change conversation on Twitter
carbonbrief.org
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
Climate change conversation on Twitter
carbonbrief.org
real-world
networks
are massive!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
Climate change conversation on Twitter
carbonbrief.org
real-world
networks
are massive!
change rapidly!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
Climate change conversation on Twitter
carbonbrief.org
real-world
networks
are massive!
change rapidly!
exhibit commu-
nity structure!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
Motivation
We want to extract the community structure of nodes
in a network that changes rapidly.
Many useful applications:
we can provide more informative & engaging social
network feeds
we can enhance the efficiency of recommender systems
Size of graph data appears to be ever-increasing:
Facebook has more than 2 billion registered users
Google indexes more than 1 trillion unique URLs
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 24/63
Our context
5
2
8
3
6
4
7
1
9 8
2 3
...
Communities initialized
with seed-sets
Graph stream
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 25/63
Related Work - Static Graph
Non-overlapping Algorithms:
[GN02, NG04, BGLL08, CNM04, PL05, RB11]
Edge Betweeness
Modularity maximization
Random-walks
Overlapping Algorithms: [PDFV05, ABL10, EL09]
Clique Percolation
Hierarchical Link Clustering
More Scalable Overlapping Algorithms:
[CRGP12, YL13, GS12, WGD13]
Egonets
Matrix Factorization
Seed-set Expansion
Local Algorithms: [KG14, LHBH15, HSB+15]
Focus is shifted to local structure
Seed-set Expansion
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 26/63
Related Work - Graph Stream / Dynamic Graph
Yun et al. [YLP14]:
rows of the adjacency matrix of the graph are revealed
sequentially
Zakrzewska and Bader [ZB15]:
dynamic graphs
seed set expansion
incrementally adjust to dynamic changes
Hollocou et al. [HMBL17]:
if we pick uniformly at random an edge of the graph, this
edge is more likely to link nodes of the same community,
than nodes from distinct communities
if we process edges in a random order we expect many
intra-community edges to arrive before the
inter-community edges
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 27/63
Contribution
We propose COEUS:
A novel community detection algorithm that operates on a graph stream,
using space sublinear to the number of edges.
We also suggest:
A PageRank-like A Novel Clustering Technique
Edge Quality Variation for Community Size Determination
We are extremely competitive with non-streaming approaches and
our execution time and space requirements are astonishingly low.
– PL, Alexandros Ntoulas, Alex Delis: COEUS: Community detection via seed-set expansion on graph streams.
IEEE BigData 2017 & HDMS 2018
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 28/63
COEUS*
*
the axis of heaven around which the constellations revolved
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
COEUS*
*
the axis of heaven around which the constellations revolved
Community detection
O
via seed-set Expansion
U
on graph Streams
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
COEUS*
*
the axis of heaven around which the constellations revolved
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
Community Participation Value
No universal definition of what a community is!
We define community participation of node u in community C:
cp(u) =
|{(u, v) ∈ E : v ∈ C}|
|{(u, v) ∈ E}|
,
the fraction of its adjacent nodes in the graph
that are part of the community.
Our evaluation does not consider a particular quality function.
Effectiveness is measured using ground-truth communities.
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 30/63
Space Complexity
We maintain for every community c:
the set of nodes that constitute community c,
the degree of each node u ∈ V, and
the community degree of each node u ∈ c.
Number of communities might be large!
COUNT-MIN SKETCH:
Sublinear space data structure providing strong accuracy guarantees.
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 31/63
Our method in a glance
Initialize the communities
using the seed-sets
Process the edge stream and
populate the communities
Prune the communities
Termination: COEUS handles
both finite & infinite streams and
can be stopped at will.
Algorithm 1: COEUS
input : A set of community seed-sets K , and a graph stream S
output : A set of communities C
begin
foreach K ∈ K do
C {};
foreach k ∈ K do
C[k] = 1;
C .put(C);
while ∃(u, v) ∈ S do
degreeV [u]+ = 1;
degreeV [v]+ = 1;
foreach C ∈ C do
if u ∈ C then
degreeC [v]+ = 1;
if v ∈ C then
degreeC [u]+ = 1;
if u ∈ C then
C.put(v);
if v ∈ C then
C.put(u);
processedElements+ = 1;
if processedElements mod W == 0 then
C prune(C, s, degreeV , degreeC );
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 32/63
Reckoning in edge quality
w.r.t. each community
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 33/63
PageRank-like Edge Quality variation
Updating the community degrees:
We do not consider the level of involvement of
the adjacent nodes in the community.
All nodes included in a community provide increments of 1
to all of their adjacent nodes.
Reckoning in edge quality:
We improve over a simple community degree measure
by considering the edge quality of nodes w.r.t. each community.
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 34/63
COEUScp
The increment for each node
grows with its involvement
in the community.
If this value is high, then
the probability that an adja-
cent node is a member of the
community is also high.
Algorithm 2: COEUScp
input : A set of community seed-sets K , and a graph stream S
output : A set of communities C
begin
foreach K ∈ K do
C {};
foreach k ∈ K do
C[k] = 1;
C .put(C);
while ∃(u, v) ∈ S do
degreeV [u]+ = 1;
degreeV [v]+ = 1;
foreach C ∈ C do
if u ∈ C then
degreeC [v]+ =
degreeC [u]
degreeV [u]
;
if v ∈ C then
degreeC [u]+ =
degreeC [v]
degreeV [v]
;
if u ∈ C then
C.put(v);
if v ∈ C then
C.put(u);
processedElements+ = 1;
if processedElements mod W == 0 then
C prune(C, s, degreeV , degreeC );
COEUS main-
tains its focus in
each community
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 35/63
Determining
the size of each community
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 36/63
Community size
Nodes are associated with community participation values
The size of the community may be smaller
than the one COEUS examines
We need to derive automatically the size of a community!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 37/63
cp values for a random COEUS community
0
0.01
0.02
0.03
0.04
25 50 75 100
cp
Rank
community nodes
tail nodes
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
cp values for a random COEUS community
0
0.01
0.02
0.03
0.04
25 50 75 100
cp
Rank
community nodes
tail nodes
clearly visible tail!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
cp values for a random COEUS community
0
0.01
0.02
0.03
0.04
25 50 75 100
cp
Rank
community nodes
tail nodes
clearly visible tail!
constant threshold
value won’t work!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
COEUScp
Sort the nodes
with regard to their
community participation
value
Calculate the average dis-
tance between two consecu-
tive nodes
Remove nodes until the dis-
tance becomes larger than
the average
Algorithm 3: DROPTAIL
input : A community C and the cp values ∀u ∈ C
output : The community C after irrelevant nodes are removed
begin
ˆC reverseSort(C);
totalDifference 0;
previous 0;
foreach c ∈ ˆC do
if previous > 0 then
totalDifference cp(c) − previous;
previous cp(c);
averageDifference
totalDifference
ˆC.size()−1
;
previous 0;
foreach c ∈ ˆC do
if previous > 0 then
difference cp(c) − previous;
previous cp(c);
if difference < averageDifference then
ˆC.remove(c);
else
break;
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 39/63
Experimental Evaluation
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 40/63
Dataset
Graphs Type Nodes Edges Av. Degree Av. Community Size
DBLP Co-authorship 317, 080 1, 049, 866 3.31 22.45
Amazon Co-purchasing 334, 863 925, 872 2.76 13.49
Youtube Social 1, 134, 890 2, 987, 624 2.63 14.59
LiveJournal Social 3, 997, 962 34, 681, 189 8.67 27.80
Orkut Social 3, 072, 441 117, 185, 083 38.14 215.72
Friendster Social 65, 608, 366 1, 806, 067, 135 27.53 46.81
Networks exceeding 1.8 billion links
Accompanying ground-truth communities allow for
the evaluation of accuracy
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 41/63
Impact of reckoning in edge quality
0
0.2
0.4
0.6
0.8
1
Am
azon
D
BLP
Youtube
LiveJournalO
rkut
Friendster
F1-score
CoEuS1
CoEuScp
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 42/63
Impact of reckoning in edge quality
0
0.2
0.4
0.6
0.8
1
Am
azon
D
BLP
Youtube
LiveJournalO
rkut
Friendster
F1-score
CoEuS1
CoEuScp
our variation
heavily impacts
the effective-
ness of COEUS
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 42/63
Effectiveness of dropTail algorithm
0
0.2
0.4
0.6
0.8
1
Am
azon
D
BLP
Youtube
LiveJournalO
rkut
Friendster
F1-score
CoEuScp
CoEuScp-auto
LEMON
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
Effectiveness of dropTail algorithm
0
0.2
0.4
0.6
0.8
1
Am
azon
D
BLP
Youtube
LiveJournalO
rkut
Friendster
F1-score
CoEuScp
CoEuScp-auto
LEMON
CoEuS handles
a graph stream
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
Effectiveness of dropTail algorithm
0
0.2
0.4
0.6
0.8
1
Am
azon
D
BLP
Youtube
LiveJournalO
rkut
Friendster
F1-score
CoEuScp
CoEuScp-auto
LEMON
CoEuS is
extremely competi-
tive w.r.t. accuracy!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
Execution Time Comparison
Graphs COEUS LEMON
Amazon 0.0458 sec 3.1197 sec
DBLP 0.0575 sec 7.2756 sec
Youtube 0.176 sec 11.3834 sec
LiveJournal 1.573 sec 28.14 sec
Orkut 7.5171 sec −
Friendster 158.6547 sec −
COEUS is considerably faster than previous approaches
Not indicative of COEUS speed in a streaming setting
COEUS is able to derive the communities on demand!
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 44/63
Space Requirements Comparison
Graphs COEUS LEMON
Amazon 21.36MB 155.74MB
DBLP 21.36MB 156.49MB
Youtube 21.36MB 457.62MB
LiveJournal 21.36MB 2, 652.99MB
Orkut 21.36MB −
Friendster 21.36MB −
COEUS uses two COUNT-MIN sketches to hold a graph
its requirements depend only on the desired approximation quality
LEMON maintains the adjacency lists of a graph
thus, it requires significantly more space
UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 45/63
Outline
1 Realizing Memory-Optimized Distributed Graph Processing
2 COEUS: Community detection via seed-set expansion on
graph streams
3 Scalable Link Community Detection: A Local
Dispersion-aware Approach
4 Rhea: Adaptively Sampling High Quality Content from
Social Activity Streams
5 On the Impact of Social Cost in Opinion Dynamics
UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 46/63
Community Detection
Can we extract the community structure
of a node in a network?
UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 47/63
Contribution
We focus on the neighbors of a single node in the network
to achieve efficiency and scalability
We build on:
Hierarchical Link Clustering Dispersion-based measures
We produce a more accurate and intuitive community
structure around a node for numerous real-world networks
– PL, Alexandros Ntoulas, Alex Delis: Scalable link community detection: A local dispersion-aware approach.
IEEE BigData 2016 & ACAC 2016 & HDMS 2017
– PL, Alexandros Ntoulas, Alex Delis: Uncovering Local Hierarchical Overlapping Communities at Scale.
Extended version undergoing review
UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 48/63
Outline
1 Realizing Memory-Optimized Distributed Graph Processing
2 COEUS: Community detection via seed-set expansion on
graph streams
3 Scalable Link Community Detection: A Local
Dispersion-aware Approach
4 Rhea: Adaptively Sampling High Quality Content from
Social Activity Streams
5 On the Impact of Social Cost in Opinion Dynamics
UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 49/63
UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 50/63
500 million tweets
sent each day!
UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 50/63
Contribution
We study the problem of extracting high-quality samples
of a social activity stream.
Related work:
White-lists of users [GSB+12, WLP+12, GZB+13, ZBG+16].
Authoritative users through network attributes
[ZAA07, JA07, ACD+08, PC11, BBC+13] (not streams).
We propose RHEA:
A high quality content sampling algorithm that forms a
network of authorities as it processes a social activity stream,
and samples only the activity of the top-K authoritative users.
– PL, Alexandros Ntoulas, Alex Delis: Rhea: Adaptively Sampling Authoritative Content from Social Activity
Streams. IEEE BigData 2017
UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 51/63
Outline
1 Realizing Memory-Optimized Distributed Graph Processing
2 COEUS: Community detection via seed-set expansion on
graph streams
3 Scalable Link Community Detection: A Local
Dispersion-aware Approach
4 Rhea: Adaptively Sampling High Quality Content from
Social Activity Streams
5 On the Impact of Social Cost in Opinion Dynamics
UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 52/63
Formation of opinions in a social context
intrinsic belief
+
friends’ expressed
opinions
expressed
opinion
UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 53/63
Basic notions of the model
We use:
a variation of the DeGroot model due to Friedkin and Johnsen [FJ90]
and the corresponding game of [BKO11].
Each user i maintains:
An intrinsic belief si An expressed opinion zi
Remains constant Updated iteratively through averaging
The cost a user suffers emanates from:
Suppressing her intrinsic belief Disagreeing with her friends
UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 54/63
Contribution
1 We analyze user activity in and verify that social interaction
results in influence on opinions among the participants.
2 We implement over Spark (GraphX) a distributed algorithm
At each time step user i updates zi to minimize her cost:
zi =
si+ j∈N(i) wijzj
1+ j∈N(i) wij
N(i): the set of nodes that i follows
wij : the strength of the influence of j on i
3 The algorithm terminates when z converges to the unique Nash
equilibrium, where the social cost is minimized
4 The resulting Nash equilibria are illustrative of how users really
behave.
– PL, Katia Papakonstantinopoulou: On the Impact of Social Cost in Opinion Dynamics. AAAI ICWSM 2016
& AGATHA 2016
UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 55/63
Open Directions
1 Many opportunities for memory optimization in
distributed graph processing systems
2 Ground-truth communities should better portray
the functional role of a network’s nodes
3 Distributed streaming community detection
4 Authorities VS Fake news
5 Empirical analysis of the opinion formation process in
other social networks
UoA Panagiotis Liakos Open Directions 56/63
References I
[ABL10] Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann.
Link communities reveal multiscale complexity in networks.
Nature, 466(7307):761–764, 2010.
[ACD+08] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne.
Finding high-quality content in social media.
In Proc. of the Int. Conf. on Web Search and Web Data Mining, WSDM 2008, Palo Alto, California,
USA, February 11-12, 2008, pages 183–194, 2008.
[BBC+13] Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, and Giuliano Vesci.
Choosing the right crowd: expert finding in social networks.
In Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pages
637–648, 2013.
[BBPSV04] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani.
The architecture of complex weighted networks.
Proc. of the National Academy of Sciences of the United States of America, 101(11):3747–3752, 2004.
[BGLL08] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre.
Fast unfolding of communities in large networks.
Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.
[BKO11] David Bindel, Jon M. Kleinberg, and Sigal Oren.
How bad is forming your own opinion?
In FOCS, pages 57–66, 2011.
[BV04] Paolo Boldi and Sebastiano Vigna.
The webgraph framework I: compression techniques.
In Proc. of the 13th Int. Conf. on World Wide Web, New York, NY, USA, May 17-20, pages 595–602,
2004.
UoA Panagiotis Liakos References 57/63
References II
[CEK+15] Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan.
One trillion edges: Graph processing at facebook-scale.
PVLDB, 8(12):1804–1815, 2015.
[CNM04] Aaron Clauset, Mark EJ Newman, and Cristopher Moore.
Finding community structure in very large networks.
Physical review E, 70(6):066111, 2004.
[CRGP12] Michele Coscia, Giulio Rossetti, Fosca Giannotti, and Dino Pedreschi.
DEMON: a local-first discovery method for overlapping communities.
In Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages
615–623, 2012.
[EL09] TS Evans and R Lambiotte.
Line graphs, link partitions, and overlapping communities.
Physical Review E, 80:016105, 2009.
[FJ90] N.E. Friedkin and E.C. Johnsen.
Social influence and opinions.
Journal of Mathematical Sociology, 15(3-4):193–206, 1990.
[GN02] Michelle Girvan and Mark EJ Newman.
Community structure in social and biological networks.
Proc. of the National Academy of Sciences, 99(12):7821–7826, 2002.
[GS12] David F Gleich and C Seshadhri.
Vertex neighborhoods, low conductance cuts, and good seeds for local community methods.
In Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages
597–605, 2012.
[GSB+12] Saptarshi Ghosh, Naveen Kumar Sharma, Fabrício Benevenuto, Niloy Ganguly, and P. Krishna
Gummadi.
Cognos: crowdsourcing search for topic experts in microblogs.
In The 35th Int. ACM SIGIR Conf. on research and development in Information Retrieval, SIGIR ’12,
Portland, OR, USA, August 12-16, 2012, pages 575–590, 2012.
UoA Panagiotis Liakos References 58/63
References III
[GZB+13] Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Kumar Sharma, Niloy
Ganguly, and P. Krishna Gummadi.
On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream.
In 22nd ACM Int. Conf. on Information and Knowledge Management, CIKM’13, San Francisco, CA,
USA, October 27 - November 1, 2013, pages 1739–1744, 2013.
[HMBL17] A. Hollocou, J. Maudet, T. Bonald, and M. Lelarge.
A linear streaming algorithm for community detection in very large networks.
ArXiv e-prints, March 2017.
[HSB+15] Kun He, Yiwei Sun, David Bindel, John E. Hopcroft, and Yixuan Li.
Detecting overlapping communities from local spectral subspaces.
In IEEE International Conference on Data Mining, Atlantic City, NJ, USA, pages 769–774, 2015.
[JA07] Pawel Jurczyk and Eugene Agichtein.
Discovering authorities in question answer communities by using link analysis.
In Proc. of the 16th ACM Conf. on Information and Knowledge Management, CIKM 2007, Lisbon,
Portugal, November 6-10, 2007, pages 919–922, 2007.
[KBG] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin.
Graphchi: Large-scale graph computation on just a PC.
In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012,
Hollywood, CA, USA, October 8-10, pages 31–46.
[KG14] Kyle Kloster and David F. Gleich.
Heat kernel based community detection.
In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
New York, NY, USA, pages 1386–1395, 2014.
[KTS+12] U. Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos.
GBASE: an efficient analysis platform for large graphs.
VLDB J., 21(5):637–650, 2012.
UoA Panagiotis Liakos References 59/63
References IV
[LH17] Hang Liu and H. Howie Huang.
Graphene: Fine-grained io management for graph computing.
In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 285–300, Santa Clara,
CA, 2017. USENIX Association.
[LHBH15] Yixuan Li, Kun He, David Bindel, and John E Hopcroft.
Uncovering the small community structure in large networks: A local spectral approach.
In Proc. of the 24th Int. Conf. on World Wide Web, pages 658–668, 2015.
[MAF] Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: patterns and a generator.
In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Las Vegas,
Nevada, USA, August 24-27, 2008, pages 524–532.
[NG04] M. E. J. Newman and M. Girvan.
Finding and evaluating community structure in networks.
Phys. Rev. E, 69(2):026113, February 2004.
[PC11] Aditya Pal and Scott Counts.
Identifying topical authorities in microblogs.
In Proc. of the 4th International Conference on Web Search and Web Data Mining, WSDM 2011,
Hong Kong, China, February 9-12, 2011, pages 45–54, 2011.
[PDFV05] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek.
Uncovering the overlapping community structure of complex networks in nature and society.
Nature, 435(7043):814–818, 2005.
[PL05] Pascal Pons and Matthieu Latapy.
Computing communities in large networks using random walks.
In Computer and Information Sciences-ISCIS 2005, pages 284–293. 2005.
[RB11] Martin Rosvall and Carl T Bergstrom.
Multilevel compression of random walks on networks reveals hierarchical organization in large
integrated systems.
PloS one, 6(4):e18209, 2011.
UoA Panagiotis Liakos References 60/63
References V
[SDB15] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch.
Smaller and faster: Parallel processing of compressed graphs with ligra+.
In 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, pages 403–412, 2015.
[WGD13] Joyce Jiyoung Whang, David F Gleich, and Inderjit S Dhillon.
Overlapping community detection using seed set expansion.
In Proc. of the 22nd ACM Int. Conf. on Information & Knowledge Management, pages 2099–2108,
2013.
[WLP+12] Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier.
It’s not in their tweets: Modeling topical expertise of twitter users.
In 2012 Int. Conf. on Privacy, Security, Risk and Trust, PASSAT 2012, and 2012 Int. Conf. on Social
Computing, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012, pages 91–100, 2012.
[YL13] Jaewon Yang and Jure Leskovec.
Overlapping community detection at scale: a nonnegative matrix factorization approach.
In Proc. of the 6th ACM int. Conf. on Web Search and Data Mining, pages 587–596, 2013.
[YLP14] Se-Young Yun, Marc Lelarge, and Alexandre Proutière.
Streaming, memory limited algorithms for community detection.
In Advances in Neural Information Processing Systems 27: Annual Conference on Neural
Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages
3167–3175, 2014.
[ZAA07] Jun Zhang, Mark S. Ackerman, and Lada A. Adamic.
Expertise networks in online communities: structure and algorithms.
In Proc. of the 16th Int. Conf. on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12,
2007, pages 221–230, 2007.
[ZB15] Anita Zakrzewska and David A. Bader.
A dynamic algorithm for local community detection in graphs.
In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, ASONAM 2015, Paris, France, August 25 - 28, 2015, pages 559–564, 2015.
UoA Panagiotis Liakos References 61/63
References VI
[ZBG+16] Muhammad Bilal Zafar, Parantapa Bhattacharya, Niloy Ganguly, Saptarshi Ghosh, and Krishna P.
Gummadi.
On the wisdom of experts vs. crowds: Discovering trustworthy topical news in microblogs.
In Proc. of the 19th ACM Conf. on Computer-Supported Cooperative Work & Social Computing,
CSCW 2016, San Francisco, CA, USA, February 27 - March 2, 2016, pages 437–450, 2016.
[ZMB+15] Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S.
Szalay.
Flashgraph: Processing billion-node graphs on an array of commodity ssds.
In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 45–58, Santa Clara,
CA, 2015. USENIX Association.
UoA Panagiotis Liakos References 62/63
thank you!
Special thanks to:
Alex Delis, Katia Papakonstantinopoulou, Alexandros Ntoulas,
Michael Sioutis, Nikos Leonardos, Katerina El Raheb & Alexis Antoniadis
UoA Panagiotis Liakos Acknowledgements 63/63

More Related Content

PDF
Using MapReduce for Large–scale Medical Image Analysis
PDF
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
PDF
Ashwin_Thesis
PDF
Ling liu part 01:big graph processing
PDF
Graph Analysis Beyond Linear Algebra
PDF
Ling liu part 02:big graph processing
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
PPTX
Graph processing
Using MapReduce for Large–scale Medical Image Analysis
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Ashwin_Thesis
Ling liu part 01:big graph processing
Graph Analysis Beyond Linear Algebra
Ling liu part 02:big graph processing
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Graph processing

Similar to Distributed and Streaming Graph Processing Techniques (20)

PPTX
Data Structures and Agorithm: DS 21 Graph Theory.pptx
PDF
Updating PageRank for Streaming Graphs
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PDF
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
PPT
graphGraphGraph data structure data structure2.ppt
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PPTX
GDM 2011 Talk
PDF
A Lightweight Infrastructure for Graph Analytics
PPT
Distributed Streams
PDF
UNIT-2.pdf advanced data structure notes
PDF
graph representation.pdf
PDF
Graph processing - Graphlab
PPTX
Data Structures - Introduction to Graph.pptx
PDF
Scaling PageRank to 100 Billion Pages
PPTX
Data Structure and algorithms - Graph1.pptx
PPTX
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
PPT
Graphs.ppt of mathemaics we have to clar all doubts
PPT
Graphs.ppt
Data Structures and Agorithm: DS 21 Graph Theory.pptx
Updating PageRank for Streaming Graphs
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
graphGraphGraph data structure data structure2.ppt
Graph Stream Processing : spinning fast, large scale, complex analytics
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
GDM 2011 Talk
A Lightweight Infrastructure for Graph Analytics
Distributed Streams
UNIT-2.pdf advanced data structure notes
graph representation.pdf
Graph processing - Graphlab
Data Structures - Introduction to Graph.pptx
Scaling PageRank to 100 Billion Pages
Data Structure and algorithms - Graph1.pptx
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Graphs.ppt of mathemaics we have to clar all doubts
Graphs.ppt
Ad

Recently uploaded (20)

PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Lesson notes of climatology university.
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Pre independence Education in Inndia.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Classroom Observation Tools for Teachers
PPTX
Institutional Correction lecture only . . .
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
Lesson notes of climatology university.
Insiders guide to clinical Medicine.pdf
Microbial diseases, their pathogenesis and prophylaxis
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPH.pptx obstetrics and gynecology in nursing
Pre independence Education in Inndia.pdf
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Computing-Curriculum for Schools in Ghana
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Classroom Observation Tools for Teachers
Institutional Correction lecture only . . .
O5-L3 Freight Transport Ops (International) V1.pdf
Ad

Distributed and Streaming Graph Processing Techniques

  • 1. DISTRIBUTED AND STREAMING GRAPH PROCESSING TECHNIQUES P. N. LIAKOS University of Athens, June 15th, 2018 For the Fat Lady
  • 2. Say you wish to study . . . UoA Panagiotis Liakos Motivation 2/63
  • 3. Say you wish to study . . . . . . and now imagine that the data is available . . . UoA Panagiotis Liakos Motivation 2/63
  • 4. . . . but it’s more than you can handle! 11100111000010000011010100100100000101100100110011101010111111000 10101011010101001111101111010001010100001100111011011101100001000 01011110001010010101000001101010101100100111000101101011010101010 00010010001101110111001010110001110010100001000101010001110001010 10001010001000111000000011010010110111101101110010000001111101110 01010110011111010101001010000001100001100100100010100010111100100 01100100001101010001100000111000010001011100100011100010100000110 00100110011000010001011100001001000100010110010111110011110101001 01000010001010110001110010000111010010101011100001111000000111001 11000100101110110011100000001001001110011110011000000001011100001 10111100101110101110110101000101000101110100100100111010001011110 10101010011011011000110011011001011101000010011010111110000100110 00010000001110001111001110001100001011001011001101111101001011010 01111101111010110101111010100011111101100111001010110100110110001 00100101001111111100110010010100001111101110100101101111100010000 00011110111001011110000111000101100001101111011010110111001100100 01010010001001000000000111101000011001000101001101000101000101110 01000011110010100101101110011000111011100110111010101110010111110 2 billion users! 500 million tweets per day! 1 trillion URLs! UoA Panagiotis Liakos Motivation 3/63
  • 5. Large Scale Graph Processing Active Research Areas: Distributed Graph Processing Graph Streams Worker 1 Worker 2 Worker 3 127 2 9 11 12 10 14 17 18 20 Pregel-like graph processing system UoA Panagiotis Liakos Motivation 4/63
  • 6. Our Contribution Distributed Graph Processing (Pregel paradigm) Memory Optimization Opinion Formation Local Community Detection Graph Streams Streaming Community Detection Sampling High Quality Content UoA Panagiotis Liakos Motivation 5/63
  • 7. Outline 1 Realizing Memory-Optimized Distributed Graph Processing 2 COEUS: Community detection via seed-set expansion on graph streams 3 Scalable Link Community Detection: A Local Dispersion-aware Approach 4 Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 5 On the Impact of Social Cost in Opinion Dynamics UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 6/63
  • 8. Apache Giraph Memory Usage [CEK+ 15] UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 7/63
  • 9. Apache Giraph Memory Usage [CEK+ 15] • Ineffective memory usage • Partitioning hardens the task of compression UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 7/63
  • 10. Related memory optimization approaches Apache Giraph [CEK+15]: does not exploit the redundancy in real-world graphs Ligra+ [SDB15]: compression techniques on a shared-memory system halved space usage at the cost of slower execution Gbase[KTS+12]: does not follow the vertex-centric model requires decompression GraphChi [KBG], FlashGraph [ZMB+15] & Graphene [LH17]: maintain graph data on disks reasonable performance with very modest requirements UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 8/63
  • 11. Contribution We present a number of novel techniques that: 1 offer space efficient-representations of out-edges, 2 allow fast mining (in-situ) of the graph elements without the need of decompression, 3 enable the execution of graph algorithms in memory-constrained settings, and 4 ease the task of memory management, thus allowing faster execution. – PL, Katia Papakonstantinopoulou, Alex Delis: Memory-Optimized Distributed Graph Processing through Novel Compression Techniques. ACM CIKM 2016 & ACAC 2016 – PL, Katia Papakonstantinopoulou, Alex Delis: Realizing Memory-Optimized Distributed Graph Processing. IEEE Trans. Knowl. Data Eng. 30(4), 2018 UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 9/63
  • 12. Properties of real-world graphs Locality of reference: the majority of the edges of a graph link vertices that are close to each other in the order Similarity (or copy property): vertices that are close to each other in the order tend to have many common out-neighbors http://www.di.uoa.gr/ http://www.di.uoa.gr/css/foostyle.css http://www.di.uoa.gr/images/logo.gif http://www.di.uoa.gr/about/ http://www.di.uoa.gr/staff/ http://www.di.uoa.gr/about/ http://www.di.uoa.gr/css/foostyle.css http://www.di.uoa.gr/images/logo.gif http://www.di.uoa.gr/directions.html http://www.di.uoa.gr/about/ http://www.di.uoa.gr/staff/ UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 10/63
  • 13. BVEdges (based on [BV04]) 2, 9, 10, 11, 12, 14, 17, 18, 20, 127 ⇓ 9 − 12, 2, 14, 17, 18, 20, 127 ⇓ (1)2 4 bytes γ(0)(9)2 ζ(13)ζ(11) ζ(2)ζ(0)ζ(1) ζ(106) 4 bytes number of intervals interval residuals { { { 1 bit 7 bits 7 bits 4 bits 3 bits 4 bits 11 bits UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 11/63
  • 14. BVEdges (based on [BV04]) 2, 9, 10, 11, 12, 14, 17, 18, 20, 127 ⇓ 9 − 12, 2, 14, 17, 18, 20, 127 ⇓ (1)2 4 bytes γ(0)(9)2 ζ(13)ζ(11) ζ(2)ζ(0)ζ(1) ζ(106) 4 bytes number of intervals interval residuals { { { 1 bit 7 bits 7 bits 4 bits 3 bits 4 bits 11 bits • bit-encoding involves significant overhead UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 11/63
  • 15. IntervalResidualEdges 2, 9, 10, 11, 12, 14, 17, 18, 20, 127 ⇓ 9 − 12, 17 − 18, 2, 14, 20, 127 ⇓ (2)2 4 bytes number of intervals (9)2 (4)2 (17)2 (2)2 4 +1 bytes 1st interval 4 +1 bytes 2nd interval (2)2 (14)2 (20)2 (127)2 4 bytes 4 bytes 4 bytes 4 bytes residuals {UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 12/63
  • 16. IntervalResidualEdges 2, 9, 10, 11, 12, 14, 17, 18, 20, 127 ⇓ 9 − 12, 17 − 18, 2, 14, 20, 127 ⇓ (2)2 4 bytes number of intervals (9)2 (4)2 (17)2 (2)2 4 +1 bytes 1st interval 4 +1 bytes 2nd interval (2)2 (14)2 (20)2 (127)2 4 bytes 4 bytes 4 bytes 4 bytes residuals {• avoids expensive encodings & bit streams • significant compression due to locality of reference UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 12/63
  • 17. IndexedBitArrayEdges 2, 9, 10, 11, 12, 14, 17, 18 20, 127 ⇓ {2}, {9, 10, 11, 12, 14}, {17, 18 20}, {127} ⇓ ... (0)2 (1)2 (2)2 (15)2 4 bytes 1 byte UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 13/63
  • 18. IndexedBitArrayEdges 2, 9, 10, 11, 12, 14, 17, 18 20, 127 ⇓ {2}, {9, 10, 11, 12, 14}, {17, 18 20}, {127} ⇓ ... (0)2 (1)2 (2)2 (15)2 4 bytes 1 byte • avoids expensive encodings & bit streams • significant compression due to locality of reference • memory-efficient retrieval of out-edges UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 13/63
  • 19. VariableByteWeights 32, 378 ⇓ {32}, {256 + 122} ⇓ 0 1 byte 1st weight 2nd weight { { 1 00 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 01 1 byte 1 byte UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 14/63
  • 20. VariableByteWeights 32, 378 ⇓ {32}, {256 + 122} ⇓ 0 1 byte 1st weight 2nd weight { { 1 00 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 01 1 byte 1 byte • weights of edges exhibit right-skewed distributions [BBPSV04, MAF] • log128(n) + 1 bytes to represent an integer n UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 14/63
  • 21. RedBlackTreeEdges 2, 9, 10, 11, 12, 14, 17, 18 20, 127 ⇓ 127 20 18 17 14 12 11 10 9 2 key: 8 bytes (long) le�: 4 bytes (compressed oop) right: 4 bytes (compressed oop) color: 1 byte (boolean) depth: 2 bytes (short) UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 15/63
  • 22. RedBlackTreeEdges 2, 9, 10, 11, 12, 14, 17, 18 20, 127 ⇓ 127 20 18 17 14 12 11 10 9 2 key: 8 bytes (long) le�: 4 bytes (compressed oop) right: 4 bytes (compressed oop) color: 1 byte (boolean) depth: 2 bytes (short) • does not waste space for empty buckets • significant savings using primitive data types • cost-free iterations with regards to space UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 15/63
  • 23. Experimental Evaluation Space-efficiency Performance: when the available memory is not constrained when the available memory is constrained when adding/removing edges UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 16/63
  • 24. Space Efficiency Comparison ByteArray- BVEdges IntervalRe- IndexedBit- graph Edges sidualEdges ArrayEdges uk-2007-05@100000 22.61 MB 6.41 MB (0.96 MB) 7.92 MB 8.91 MB uk-2007-05@1000000 279.16 MB 67.36 MB (10.54 MB) 82.7 MB 97.79 MB ljournal-2008 866.36 MB 386.73 MB (117.68 MB) 497.52 MB 648.52 MB indochina-2004 1,511.67 MB 442.34 MB (48.03 MB) 646.03 MB 554.23 MB hollywood-2011 1,381.91 MB 287.53 MB (145.85 MB) 613.52 MB 676.88 MB uk-2002 2,733.6 MB 1,092.82 MB (116.39 MB) 1,224.07 MB 1,255.67 MB arabic-2005 4,820.09 MB 1,428.97 MB (187.58 MB) 1,674.75 MB 1,849.83 MB uk-2005 7,401.88 MB 2,383.54 MB (279.45 MB) 2,728.74 MB 2,928.81 MB twitter-2010 11,189.88 MB 4,628.48 MB (2,600.07 MB) 7,127.76 MB 8,888.50 MB sk-2005 14,829.64 MB 4,889.85 MB (607.92 MB) 5,657.79 MB 6,354.17 MB BVEdges < 40% ByteArrayEdges impressive savings with all techniques UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 17/63
  • 25. Performance / small-scale graphs 0 5 10 15 20 25 30 8 workers 4 workers 2 workers Executiontime(inminutes) ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges BVEdges is inferior speedwise IntervalResidualEdges and IndexedBitArrayEdges already show signs of improvement UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 18/63
  • 26. Performance / large-scale graphs 0 1 2 3 4 5 0 5 10 15 20 25 30 Executiontime(inminutes) Supersteps of PageRank execution ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges ByteArrayEdges’s performance fluctuates due to excessive garbage collection. UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 19/63
  • 27. Performance / large-scale graphs 0 50 100 150 200 uk-2005 (5 workers) uk-2005 (4 workers)FAILED Executiontime(inminutes) ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges IntervalResidualEdges is faster than ByteArrayEdges IndexedBitArrayEdges outperforms all UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 20/63
  • 28. Performance / algorithms involving mutations 0 2 4 6 8 10 12 10 20 30 40 50 60 70 80 90 100 Executiontime(min) Maximum Mutations Allowed HashMapEdges RedBlackTreeEdges RedBlackEdges requires less than half of the space that HashMapEdges needs the performance of HashMapEdges deteriorates significantly as the number of allowed mutations grows UoA Panagiotis Liakos Realizing Memory-Optimized Distributed Graph Processing 21/63
  • 29. Outline 1 Realizing Memory-Optimized Distributed Graph Processing 2 COEUS: Community detection via seed-set expansion on graph streams 3 Scalable Link Community Detection: A Local Dispersion-aware Approach 4 Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 5 On the Impact of Social Cost in Opinion Dynamics UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 22/63
  • 30. Climate change conversation on Twitter carbonbrief.org UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
  • 31. Climate change conversation on Twitter carbonbrief.org real-world networks are massive! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
  • 32. Climate change conversation on Twitter carbonbrief.org real-world networks are massive! change rapidly! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
  • 33. Climate change conversation on Twitter carbonbrief.org real-world networks are massive! change rapidly! exhibit commu- nity structure! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 23/63
  • 34. Motivation We want to extract the community structure of nodes in a network that changes rapidly. Many useful applications: we can provide more informative & engaging social network feeds we can enhance the efficiency of recommender systems Size of graph data appears to be ever-increasing: Facebook has more than 2 billion registered users Google indexes more than 1 trillion unique URLs UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 24/63
  • 35. Our context 5 2 8 3 6 4 7 1 9 8 2 3 ... Communities initialized with seed-sets Graph stream UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 25/63
  • 36. Related Work - Static Graph Non-overlapping Algorithms: [GN02, NG04, BGLL08, CNM04, PL05, RB11] Edge Betweeness Modularity maximization Random-walks Overlapping Algorithms: [PDFV05, ABL10, EL09] Clique Percolation Hierarchical Link Clustering More Scalable Overlapping Algorithms: [CRGP12, YL13, GS12, WGD13] Egonets Matrix Factorization Seed-set Expansion Local Algorithms: [KG14, LHBH15, HSB+15] Focus is shifted to local structure Seed-set Expansion UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 26/63
  • 37. Related Work - Graph Stream / Dynamic Graph Yun et al. [YLP14]: rows of the adjacency matrix of the graph are revealed sequentially Zakrzewska and Bader [ZB15]: dynamic graphs seed set expansion incrementally adjust to dynamic changes Hollocou et al. [HMBL17]: if we pick uniformly at random an edge of the graph, this edge is more likely to link nodes of the same community, than nodes from distinct communities if we process edges in a random order we expect many intra-community edges to arrive before the inter-community edges UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 27/63
  • 38. Contribution We propose COEUS: A novel community detection algorithm that operates on a graph stream, using space sublinear to the number of edges. We also suggest: A PageRank-like A Novel Clustering Technique Edge Quality Variation for Community Size Determination We are extremely competitive with non-streaming approaches and our execution time and space requirements are astonishingly low. – PL, Alexandros Ntoulas, Alex Delis: COEUS: Community detection via seed-set expansion on graph streams. IEEE BigData 2017 & HDMS 2018 UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 28/63
  • 39. COEUS* * the axis of heaven around which the constellations revolved UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
  • 40. COEUS* * the axis of heaven around which the constellations revolved Community detection O via seed-set Expansion U on graph Streams UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
  • 41. COEUS* * the axis of heaven around which the constellations revolved UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 29/63
  • 42. Community Participation Value No universal definition of what a community is! We define community participation of node u in community C: cp(u) = |{(u, v) ∈ E : v ∈ C}| |{(u, v) ∈ E}| , the fraction of its adjacent nodes in the graph that are part of the community. Our evaluation does not consider a particular quality function. Effectiveness is measured using ground-truth communities. UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 30/63
  • 43. Space Complexity We maintain for every community c: the set of nodes that constitute community c, the degree of each node u ∈ V, and the community degree of each node u ∈ c. Number of communities might be large! COUNT-MIN SKETCH: Sublinear space data structure providing strong accuracy guarantees. UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 31/63
  • 44. Our method in a glance Initialize the communities using the seed-sets Process the edge stream and populate the communities Prune the communities Termination: COEUS handles both finite & infinite streams and can be stopped at will. Algorithm 1: COEUS input : A set of community seed-sets K , and a graph stream S output : A set of communities C begin foreach K ∈ K do C {}; foreach k ∈ K do C[k] = 1; C .put(C); while ∃(u, v) ∈ S do degreeV [u]+ = 1; degreeV [v]+ = 1; foreach C ∈ C do if u ∈ C then degreeC [v]+ = 1; if v ∈ C then degreeC [u]+ = 1; if u ∈ C then C.put(v); if v ∈ C then C.put(u); processedElements+ = 1; if processedElements mod W == 0 then C prune(C, s, degreeV , degreeC ); UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 32/63
  • 45. Reckoning in edge quality w.r.t. each community UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 33/63
  • 46. PageRank-like Edge Quality variation Updating the community degrees: We do not consider the level of involvement of the adjacent nodes in the community. All nodes included in a community provide increments of 1 to all of their adjacent nodes. Reckoning in edge quality: We improve over a simple community degree measure by considering the edge quality of nodes w.r.t. each community. UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 34/63
  • 47. COEUScp The increment for each node grows with its involvement in the community. If this value is high, then the probability that an adja- cent node is a member of the community is also high. Algorithm 2: COEUScp input : A set of community seed-sets K , and a graph stream S output : A set of communities C begin foreach K ∈ K do C {}; foreach k ∈ K do C[k] = 1; C .put(C); while ∃(u, v) ∈ S do degreeV [u]+ = 1; degreeV [v]+ = 1; foreach C ∈ C do if u ∈ C then degreeC [v]+ = degreeC [u] degreeV [u] ; if v ∈ C then degreeC [u]+ = degreeC [v] degreeV [v] ; if u ∈ C then C.put(v); if v ∈ C then C.put(u); processedElements+ = 1; if processedElements mod W == 0 then C prune(C, s, degreeV , degreeC ); COEUS main- tains its focus in each community UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 35/63
  • 48. Determining the size of each community UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 36/63
  • 49. Community size Nodes are associated with community participation values The size of the community may be smaller than the one COEUS examines We need to derive automatically the size of a community! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 37/63
  • 50. cp values for a random COEUS community 0 0.01 0.02 0.03 0.04 25 50 75 100 cp Rank community nodes tail nodes UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
  • 51. cp values for a random COEUS community 0 0.01 0.02 0.03 0.04 25 50 75 100 cp Rank community nodes tail nodes clearly visible tail! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
  • 52. cp values for a random COEUS community 0 0.01 0.02 0.03 0.04 25 50 75 100 cp Rank community nodes tail nodes clearly visible tail! constant threshold value won’t work! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 38/63
  • 53. COEUScp Sort the nodes with regard to their community participation value Calculate the average dis- tance between two consecu- tive nodes Remove nodes until the dis- tance becomes larger than the average Algorithm 3: DROPTAIL input : A community C and the cp values ∀u ∈ C output : The community C after irrelevant nodes are removed begin ˆC reverseSort(C); totalDifference 0; previous 0; foreach c ∈ ˆC do if previous > 0 then totalDifference cp(c) − previous; previous cp(c); averageDifference totalDifference ˆC.size()−1 ; previous 0; foreach c ∈ ˆC do if previous > 0 then difference cp(c) − previous; previous cp(c); if difference < averageDifference then ˆC.remove(c); else break; UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 39/63
  • 54. Experimental Evaluation UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 40/63
  • 55. Dataset Graphs Type Nodes Edges Av. Degree Av. Community Size DBLP Co-authorship 317, 080 1, 049, 866 3.31 22.45 Amazon Co-purchasing 334, 863 925, 872 2.76 13.49 Youtube Social 1, 134, 890 2, 987, 624 2.63 14.59 LiveJournal Social 3, 997, 962 34, 681, 189 8.67 27.80 Orkut Social 3, 072, 441 117, 185, 083 38.14 215.72 Friendster Social 65, 608, 366 1, 806, 067, 135 27.53 46.81 Networks exceeding 1.8 billion links Accompanying ground-truth communities allow for the evaluation of accuracy UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 41/63
  • 56. Impact of reckoning in edge quality 0 0.2 0.4 0.6 0.8 1 Am azon D BLP Youtube LiveJournalO rkut Friendster F1-score CoEuS1 CoEuScp UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 42/63
  • 57. Impact of reckoning in edge quality 0 0.2 0.4 0.6 0.8 1 Am azon D BLP Youtube LiveJournalO rkut Friendster F1-score CoEuS1 CoEuScp our variation heavily impacts the effective- ness of COEUS UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 42/63
  • 58. Effectiveness of dropTail algorithm 0 0.2 0.4 0.6 0.8 1 Am azon D BLP Youtube LiveJournalO rkut Friendster F1-score CoEuScp CoEuScp-auto LEMON UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
  • 59. Effectiveness of dropTail algorithm 0 0.2 0.4 0.6 0.8 1 Am azon D BLP Youtube LiveJournalO rkut Friendster F1-score CoEuScp CoEuScp-auto LEMON CoEuS handles a graph stream UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
  • 60. Effectiveness of dropTail algorithm 0 0.2 0.4 0.6 0.8 1 Am azon D BLP Youtube LiveJournalO rkut Friendster F1-score CoEuScp CoEuScp-auto LEMON CoEuS is extremely competi- tive w.r.t. accuracy! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 43/63
  • 61. Execution Time Comparison Graphs COEUS LEMON Amazon 0.0458 sec 3.1197 sec DBLP 0.0575 sec 7.2756 sec Youtube 0.176 sec 11.3834 sec LiveJournal 1.573 sec 28.14 sec Orkut 7.5171 sec − Friendster 158.6547 sec − COEUS is considerably faster than previous approaches Not indicative of COEUS speed in a streaming setting COEUS is able to derive the communities on demand! UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 44/63
  • 62. Space Requirements Comparison Graphs COEUS LEMON Amazon 21.36MB 155.74MB DBLP 21.36MB 156.49MB Youtube 21.36MB 457.62MB LiveJournal 21.36MB 2, 652.99MB Orkut 21.36MB − Friendster 21.36MB − COEUS uses two COUNT-MIN sketches to hold a graph its requirements depend only on the desired approximation quality LEMON maintains the adjacency lists of a graph thus, it requires significantly more space UoA Panagiotis Liakos COEUS: Community detection via seed-set expansion on graph streams 45/63
  • 63. Outline 1 Realizing Memory-Optimized Distributed Graph Processing 2 COEUS: Community detection via seed-set expansion on graph streams 3 Scalable Link Community Detection: A Local Dispersion-aware Approach 4 Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 5 On the Impact of Social Cost in Opinion Dynamics UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 46/63
  • 64. Community Detection Can we extract the community structure of a node in a network? UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 47/63
  • 65. Contribution We focus on the neighbors of a single node in the network to achieve efficiency and scalability We build on: Hierarchical Link Clustering Dispersion-based measures We produce a more accurate and intuitive community structure around a node for numerous real-world networks – PL, Alexandros Ntoulas, Alex Delis: Scalable link community detection: A local dispersion-aware approach. IEEE BigData 2016 & ACAC 2016 & HDMS 2017 – PL, Alexandros Ntoulas, Alex Delis: Uncovering Local Hierarchical Overlapping Communities at Scale. Extended version undergoing review UoA Panagiotis Liakos Scalable Link Community Detection: A Local Dispersion-aware Approach 48/63
  • 66. Outline 1 Realizing Memory-Optimized Distributed Graph Processing 2 COEUS: Community detection via seed-set expansion on graph streams 3 Scalable Link Community Detection: A Local Dispersion-aware Approach 4 Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 5 On the Impact of Social Cost in Opinion Dynamics UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 49/63
  • 67. UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 50/63
  • 68. 500 million tweets sent each day! UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 50/63
  • 69. Contribution We study the problem of extracting high-quality samples of a social activity stream. Related work: White-lists of users [GSB+12, WLP+12, GZB+13, ZBG+16]. Authoritative users through network attributes [ZAA07, JA07, ACD+08, PC11, BBC+13] (not streams). We propose RHEA: A high quality content sampling algorithm that forms a network of authorities as it processes a social activity stream, and samples only the activity of the top-K authoritative users. – PL, Alexandros Ntoulas, Alex Delis: Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams. IEEE BigData 2017 UoA Panagiotis Liakos Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 51/63
  • 70. Outline 1 Realizing Memory-Optimized Distributed Graph Processing 2 COEUS: Community detection via seed-set expansion on graph streams 3 Scalable Link Community Detection: A Local Dispersion-aware Approach 4 Rhea: Adaptively Sampling High Quality Content from Social Activity Streams 5 On the Impact of Social Cost in Opinion Dynamics UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 52/63
  • 71. Formation of opinions in a social context intrinsic belief + friends’ expressed opinions expressed opinion UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 53/63
  • 72. Basic notions of the model We use: a variation of the DeGroot model due to Friedkin and Johnsen [FJ90] and the corresponding game of [BKO11]. Each user i maintains: An intrinsic belief si An expressed opinion zi Remains constant Updated iteratively through averaging The cost a user suffers emanates from: Suppressing her intrinsic belief Disagreeing with her friends UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 54/63
  • 73. Contribution 1 We analyze user activity in and verify that social interaction results in influence on opinions among the participants. 2 We implement over Spark (GraphX) a distributed algorithm At each time step user i updates zi to minimize her cost: zi = si+ j∈N(i) wijzj 1+ j∈N(i) wij N(i): the set of nodes that i follows wij : the strength of the influence of j on i 3 The algorithm terminates when z converges to the unique Nash equilibrium, where the social cost is minimized 4 The resulting Nash equilibria are illustrative of how users really behave. – PL, Katia Papakonstantinopoulou: On the Impact of Social Cost in Opinion Dynamics. AAAI ICWSM 2016 & AGATHA 2016 UoA Panagiotis Liakos On the Impact of Social Cost in Opinion Dynamics 55/63
  • 74. Open Directions 1 Many opportunities for memory optimization in distributed graph processing systems 2 Ground-truth communities should better portray the functional role of a network’s nodes 3 Distributed streaming community detection 4 Authorities VS Fake news 5 Empirical analysis of the opinion formation process in other social networks UoA Panagiotis Liakos Open Directions 56/63
  • 75. References I [ABL10] Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, 2010. [ACD+08] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. Finding high-quality content in social media. In Proc. of the Int. Conf. on Web Search and Web Data Mining, WSDM 2008, Palo Alto, California, USA, February 11-12, 2008, pages 183–194, 2008. [BBC+13] Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, and Giuliano Vesci. Choosing the right crowd: expert finding in social networks. In Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pages 637–648, 2013. [BBPSV04] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani. The architecture of complex weighted networks. Proc. of the National Academy of Sciences of the United States of America, 101(11):3747–3752, 2004. [BGLL08] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. [BKO11] David Bindel, Jon M. Kleinberg, and Sigal Oren. How bad is forming your own opinion? In FOCS, pages 57–66, 2011. [BV04] Paolo Boldi and Sebastiano Vigna. The webgraph framework I: compression techniques. In Proc. of the 13th Int. Conf. on World Wide Web, New York, NY, USA, May 17-20, pages 595–602, 2004. UoA Panagiotis Liakos References 57/63
  • 76. References II [CEK+15] Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 8(12):1804–1815, 2015. [CNM04] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, 2004. [CRGP12] Michele Coscia, Giulio Rossetti, Fosca Giannotti, and Dino Pedreschi. DEMON: a local-first discovery method for overlapping communities. In Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 615–623, 2012. [EL09] TS Evans and R Lambiotte. Line graphs, link partitions, and overlapping communities. Physical Review E, 80:016105, 2009. [FJ90] N.E. Friedkin and E.C. Johnsen. Social influence and opinions. Journal of Mathematical Sociology, 15(3-4):193–206, 1990. [GN02] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proc. of the National Academy of Sciences, 99(12):7821–7826, 2002. [GS12] David F Gleich and C Seshadhri. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 597–605, 2012. [GSB+12] Saptarshi Ghosh, Naveen Kumar Sharma, Fabrício Benevenuto, Niloy Ganguly, and P. Krishna Gummadi. Cognos: crowdsourcing search for topic experts in microblogs. In The 35th Int. ACM SIGIR Conf. on research and development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012, pages 575–590, 2012. UoA Panagiotis Liakos References 58/63
  • 77. References III [GZB+13] Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Kumar Sharma, Niloy Ganguly, and P. Krishna Gummadi. On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream. In 22nd ACM Int. Conf. on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pages 1739–1744, 2013. [HMBL17] A. Hollocou, J. Maudet, T. Bonald, and M. Lelarge. A linear streaming algorithm for community detection in very large networks. ArXiv e-prints, March 2017. [HSB+15] Kun He, Yiwei Sun, David Bindel, John E. Hopcroft, and Yixuan Li. Detecting overlapping communities from local spectral subspaces. In IEEE International Conference on Data Mining, Atlantic City, NJ, USA, pages 769–774, 2015. [JA07] Pawel Jurczyk and Eugene Agichtein. Discovering authorities in question answer communities by using link analysis. In Proc. of the 16th ACM Conf. on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, November 6-10, 2007, pages 919–922, 2007. [KBG] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, pages 31–46. [KG14] Kyle Kloster and David F. Gleich. Heat kernel based community detection. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pages 1386–1395, 2014. [KTS+12] U. Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. GBASE: an efficient analysis platform for large graphs. VLDB J., 21(5):637–650, 2012. UoA Panagiotis Liakos References 59/63
  • 78. References IV [LH17] Hang Liu and H. Howie Huang. Graphene: Fine-grained io management for graph computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 285–300, Santa Clara, CA, 2017. USENIX Association. [LHBH15] Yixuan Li, Kun He, David Bindel, and John E Hopcroft. Uncovering the small community structure in large networks: A local spectral approach. In Proc. of the 24th Int. Conf. on World Wide Web, pages 658–668, 2015. [MAF] Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Weighted graphs and disconnected components: patterns and a generator. In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pages 524–532. [NG04] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69(2):026113, February 2004. [PC11] Aditya Pal and Scott Counts. Identifying topical authorities in microblogs. In Proc. of the 4th International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011, pages 45–54, 2011. [PDFV05] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005. [PL05] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks. In Computer and Information Sciences-ISCIS 2005, pages 284–293. 2005. [RB11] Martin Rosvall and Carl T Bergstrom. Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. PloS one, 6(4):e18209, 2011. UoA Panagiotis Liakos References 60/63
  • 79. References V [SDB15] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch. Smaller and faster: Parallel processing of compressed graphs with ligra+. In 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, pages 403–412, 2015. [WGD13] Joyce Jiyoung Whang, David F Gleich, and Inderjit S Dhillon. Overlapping community detection using seed set expansion. In Proc. of the 22nd ACM Int. Conf. on Information & Knowledge Management, pages 2099–2108, 2013. [WLP+12] Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier. It’s not in their tweets: Modeling topical expertise of twitter users. In 2012 Int. Conf. on Privacy, Security, Risk and Trust, PASSAT 2012, and 2012 Int. Conf. on Social Computing, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012, pages 91–100, 2012. [YL13] Jaewon Yang and Jure Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In Proc. of the 6th ACM int. Conf. on Web Search and Data Mining, pages 587–596, 2013. [YLP14] Se-Young Yun, Marc Lelarge, and Alexandre Proutière. Streaming, memory limited algorithms for community detection. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3167–3175, 2014. [ZAA07] Jun Zhang, Mark S. Ackerman, and Lada A. Adamic. Expertise networks in online communities: structure and algorithms. In Proc. of the 16th Int. Conf. on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, pages 221–230, 2007. [ZB15] Anita Zakrzewska and David A. Bader. A dynamic algorithm for local community detection in graphs. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2015, Paris, France, August 25 - 28, 2015, pages 559–564, 2015. UoA Panagiotis Liakos References 61/63
  • 80. References VI [ZBG+16] Muhammad Bilal Zafar, Parantapa Bhattacharya, Niloy Ganguly, Saptarshi Ghosh, and Krishna P. Gummadi. On the wisdom of experts vs. crowds: Discovering trustworthy topical news in microblogs. In Proc. of the 19th ACM Conf. on Computer-Supported Cooperative Work & Social Computing, CSCW 2016, San Francisco, CA, USA, February 27 - March 2, 2016, pages 437–450, 2016. [ZMB+15] Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity ssds. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 45–58, Santa Clara, CA, 2015. USENIX Association. UoA Panagiotis Liakos References 62/63
  • 81. thank you! Special thanks to: Alex Delis, Katia Papakonstantinopoulou, Alexandros Ntoulas, Michael Sioutis, Nikos Leonardos, Katerina El Raheb & Alexis Antoniadis UoA Panagiotis Liakos Acknowledgements 63/63