Uprising microblogs: A Bayesian network retrieval model for tweet search

Uprising microblogs: A Bayesian network
retrieval model for tweet search

Lamjed Ben Jabeur, Lynda Tamine and Mohand Boughanem
IRIT, Université Paul Sabatier

A Bayesian network retrieval model for tweet search

Outline

1. Microblogging service
2. Tweet search
3. Bayesian network topology
4. Computing conditional probabilities
5. Experimental evaluation
6. Conclusion and future work

2

Microblogging service

Microblog?

“ Microblogging is a new form of communication [….]
that enables users to broadcast and share information
about their activities, opinions and status. [Java et
al.2007].
”
• Microblog post
– Short (140 characters)
1 billions Publications /week
– Real-time 50 millions Publications /day
– Social motivation 177 million Publications in mars 2011
– Mobile device +106 millions User accounts

3


Tweet, retweet et hashtag ?

“
Jack Dorsey 21 Mars 06  1ier Tweet
inviting coworkers #oilspill

“
Stephen Colbert 21 Juin 2010  Golden Tweet Award 2010
In honor of oil-soaked birds, 'tweets' are now 'gurgles. http://bit.ly/cIhZNf

“
Wendy's 8 Juin 2011  Golden Tweet Award 2011
RT for a good cause. Each Retweet sends 50¢ to help kids in foster care. #TreatItFwd

“
CORIA11 16 mars 2010
CORIA 2011 : Université d'Avignon #CORIA11 http://yfrog.com/h3y

““
MohBoughanem 17 Mars 2010
@coria2011 well visualized, quickly found
MohBoughanem CORIA11 17 Mars 2010
4
@coria2011 well visualized, quickly found


Social information network

5

Tweet search

Microblog IR

• Users overwhelmed by the huge quantity of tweets
– Important publication rate
– Diverse sources of information
Difficulty to accessing to interesting posts

• Microblog IR tasks
– Person search and follower suggestion
– Trend extraction
– Opinion search
– Tweet search
6

Tweet search

Tweet search task

“ real-time search task, where the user wishes to see the
most recent but relevant information to the query. (Ounis
et al., 2011).
”
“ adhoc search on Twitter, where a user’s information need is

”
represented by a query at a specific time. (Ounis et al., 2011).

• Search motivations
– access to concise and credible information
– access to fresh and real-time news
– follow an event
– collect opinions and public sentiments
7

Tweet search

Related work

1. Spatio-temporel context
TwitterStand (Sankaranarayanan J. et al, 2009) TweetSieve (Grinev M et al, 2009)

2. Microblog features
– followership, tweets, retweets, reply, hashtags, URLs
– Linear combination (Nagmoti et al., 2010)
– Learn to Rank (Duan Y et al., 2010)

8

Tweet search

Related work

3. Social network structure
– Indegree, Retweet et Mention influence (Cha et al.,
2010).,TweetRank, FollowerRank (Nagmoti et al., 2010).
– Authority (Kwak et al., 2010)
– Influence (Kwak et al., 2010), TwitterRank (Weng et al., 2010),
Popularity (Duan et al.,2010)

9

Tweet search

Contributions
topical
• Relevance features:
– Term occurrence
– social influence
– time magnitude

• Bayesian network model
temporal social

10

Bayesian network topology

Definitions and notations

• Query: q  0,1 q, q
• Term: ki  0,1 k , ki i

• Term configuration: k
example : k1 , k 2

k   k1 , k2 ), (k1 , k2 ), (k1 , k2 ), (k1 , k2 )
(
• Tweet: t j  0,1 ti , ti
• Microblogger: uk  0,1 uk , uk
11

Bayesian network topology

Network nodes and edges

Query q

Terms k1 k2 k3

Tweets t1 t2 t3

Microbloggers u1 u2

12


Term frequency
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 


 a
if k i  t j
F ( ki , t j )
1  1
F (ki , t j )   tf ki ,t j 0,8
 0
a=0,1

 otherwise 0,6 a=0,25

0,4 a=0,5

0,2 a=0,75

0 a=1

0 5 tf ki ,t j10
16


Hashtag
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 


 b if # k i  t j
1 
H (ki , t j )   tf #ki ,t j
 b otherwise


17


Time magnitude
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 


tweets
df k i, j
T ( ki , t j )  30

j 20
t1
10
t2
0
1 2 tems
3 4 5


 j  t k ,  t j   t k  t  time

18


Tweet length
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 


1
L(t j ) 
1  avgtl  tltj

19


Microblogger
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 

1
P( t j | u k ) 
u k

20


Social influence
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 

P(uk )  Inf (uk )

PageRank on Retweet Social Network
1 Inf G 1 (ui )
k
Inf Gk (ui )  d  (1  d )  w j ,i
U u j ,e ( u j ,ui )E O(u j )
 (u j )   (u j )
w j ,i 
 (u j )
21


Social influence
  
  k |on(i,k )1  
k  i k i |on(i,k )  0 

 (u j )   (ui )
wi , j 
 (ui )
22

Experimental evaluation

TREC 2011 Microblog
NESTOR
Microblog Search Engine

Tweets 16 141 812 Microbloggers 5 356 432
Retweets 1 128 179 Retweet relationships 1 060 551
Tweet 1 860 112 Social network of retweets: nodes 5 495 081
Terms 7 781 775 Social network of retweets: edges 1 024 914
Hashtags 455 179 Giant component 11.12%

Term frequency Hashtags Tweet length

1.5E8 1.5E 7 1.5E 6

0 5 10 0 5 10 0 20
Term frequency, hashtags and length distributions
23


Queries and ground truth

• “Arab Spring” query dataset (25 queries)
– Topical
“Number of protesters in Tahrir”, “Tunisian revolution”

– Temporal
“ElBaradei arrvies in Egypt”, “Clashes in Tahrir”, “SMS Down Egypt”

– Social
“Wael Ghonim”, “Mubarak dissolves government”

• User rating (relevant, not relevent)
• Tweets ranked by Score; p@10; p@20
24


Configurations and baselines

BNTS Bayesian network model for tweet search*
BNTS-L BNTS, Tweet length feature disabled
BNTS-T BNTS, Time magnitude feature disabled
BNTS-H BNTS, Hashtag feature disabled
BNTS-S BNTS, Social influence feature disabled
BM25 Okapi BM25
VSM Vector Space Model
BM Boolean Model

*   0.25, a  0.25, b  0.4, t  1h, d  0.15

25


Features impact

BNTS BNTS-L BNTS-T BNTS-H BNTS-S

0,584 0,58
0,552 0,532 0,548 0,542 0,528
0,502

0,294
0,256

p@10 p@20

26


Features impact
Topical
0,7533 0,7333
0,7233
0,66 0,6867 0,6867 0,6833 0,6833

0,3767
0,2867

p@10 p@20
27


Features impact
Temporal

0,4333
0,4
0,3333 0,35
0,3 0,3167
0,2333
0,2

0,1
0,0667

p@10 p@20
28


Features impact
Social
BNTS BNTS-L BNTS-T BNTS-H BNTS-L
0,3714
0,3286 0,3286 0,3286 0,3357

0,2714 0,2857
0,2429 0,2571
0,2

p@10 p@20
29


Retrieval effectiveness

p@10 p@20
BNTS 0,552 0,548
BM25 0,576 -4% 0,494 11%
BM 0,416 ** 33% 0,382 ** 34%
VSM 0,376 ** 47% 0,36 ** 52%
30

A Bayesian network retrieval model for tweet search

Conclusion and future work

• Tweet search model
– Normalized Term frequency
– Time magnitude
– Social influence
• Integrating relevance factors within a Bayesian network
• Query profile impact features performances.
• Our model outperforms traditional IR baselines.
• Future work
– Automatically detect optimal time window
– Select appropriate feature depending on the query profile
31

Thank you for your attention!

Follow me on Twitter!
http://twitter.com/amjedbj


Term frequency normalization

• BNTS.K
p @ 30
 1 tf ki ,t j  

0,35
P(t kj | k )   
0,3 k ki k t j tf ki ,t j

0,25

0,2

0,15

0,1

0,05

0
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

 34


Time window

• BNTS.KO
p @ 30
0,32
 t t 
oe :  oe  , oe  
0,315  2 2

0,31

0,305

0,3

0,295

jours
0,29

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
35
t


Retrieval effectiveness
isiFDL DFReeKLIM30 BNTS Médiane Nestor BM25 Disjunctive
0,5
0,45
0,4
0,35
0,3
0,25
0,2
0,15
0,1
0,05
0
p@30 MAP
36


TREC Microblogs 2011
Ranked by time Ranked by score
All rel High rel All rel
p@30 MAP p@30 MAP p@30 MAP
Nestor* 0.2027 0.1305 0.0838 0.1287 0.2218 0.1384
Nestor-S* 0.2027 0.1305 0.0838 0.1286 0.2184 0.1360
Nestor-T 0.2082 0.1343 0.0585 0.0912 0.1912 0.1196
Nestor-L 0.2048 0.1306 0.0565 0.0867 0.2293 0.1426
Median 0.2592 0.1433 0.2646 0.1381

37


TREC Microblogs 2011
Système Seuil p@10 p@20 p@30 Map
1 Somme IDF des termes présents 30 0,3633 0,3316 0,3333 0,1759
2 BM25 30 0,3571 0,3245 0,2973 0,1546
3 Proportion des termes présents 30 0,2653 0,2561 0,2782 0,14
4 Somme des fréquences booléennes 30 0,2571 0,2663 0,2755 0,1387
5 EBM (AND) 30 0,3041 0,2918 0,2714 0,1282
6 Réseau d’inférence Bayésien 30 0,302 0,2888 0,2687 0,1274
7 Somme TF*IDF 30 0,302 0,2888 0,2687 0,1274
8 VSM 30 0,302 0,2888 0,2687 0,1274
9 Somme TF 30 0,2327 0,2276 0,2238 0,1066
10 Nestor 0,2857 0,2347 0,2027 0,1305
11 EBM (OR) 30 0,1837 0,1786 0,166 0,0541
12 Sommes des fréquences des Hashtags 30 0,1612 0,1541 0,1469 0,0512
13 Lucene-Baseline 1000 0,1612 0,1143 0,0986 0,1411
14 Somme TF (normalise par longueur) 30 0,0816 0,0673 0,0612 0,0223
15 Ordre chronologique inverse 30 0,0184 0,0255 0,0218 0,0082

38

Uprising microblogs: A Bayesian network retrieval model for tweet search

More Related Content

Viewers also liked (7)

Similar to Uprising microblogs: A Bayesian network retrieval model for tweet search (8)

More from Lamjed Ben Jabeur (6)

Recently uploaded (20)

Uprising microblogs: A Bayesian network retrieval model for tweet search