SlideShare a Scribd company logo
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




        Blog Clustering and Community
         Discovery in the Blogosphere
                           An Overview


                            Ahmad Ammari
              Research Fellow (User / Community Modelling)
OUTLINE

• Significance
• Research Challenges
• Network – Based Blog Clustering Approach
• Content – Based Blog Clustering Approach
• Hybrid – Based Blog Clustering Approach
• Evaluation
• Conclusion
The Blogosphere is Huge
 100% Growth Rate for
  every 5 months consistently
  for the last 4 years
 Over 120,000 new blogs
  created every day
 1.4 new Blog every second
(Technorati, 2009)
Why Clustering Blogs?

• For Bloggers / Readers:
 o Can focus on the clusters
   they “belong to”
• Improve Recommender
  Engines:
 o Suggest related content to
   other cluster members
 o Suggest similar bloggers
   to network / follow
Why Clustering Blogs?
• For Search Engines:
 o Improve indexing
   mechanisms
 o Improve the delivery
   of the search results
   by organizing similar
   results together
 o Enhance the
                            • Meta Search Engine: Yippy / Clusty
   navigability of search
                            • Retrieve results from many engines
   results
                            • Cluster them into 'clouds' based on
                              their contextual contents
Why Clustering Blogs?
• For Sociocultural / Political
  Studies:
    o Uncovering trending
      social, cultural, & political
      correlations within
      blogging communities
•    e.g. Harvard Arab
     Blogosphere Study, 2009
    o Baseline assessment of
      networked public sphere in
      Middle East Blogs
    o Relationships to politics,
      media, religion, culture,
      international affairs
Research Challenges
• Existing approaches in webpage clustering & web community
  discovery are explored in the blogosphere
• Applicability Challenges due to Key Differences between the
  Blogosphere & the Web
          Blog Posts                 Web Pages
    Short-lived References       Long-lived References
    Monitoring Community
                               Relative Temporal Stability
     Temporal Dynamics
    Multi-Theme Contents           Focused Contents
    Emergent Text Analysis      Traditional Text Analysis
       Missing Citations           Available Citations
Blog Clusters Vs. Community Discovery
• Research Trend: Researchers find it is more prevalent to
  leverage content information to identify clusters of blog topics
  and network information to discover blog communities
• Proposal: Both content and network information can be used
  / combined to identify blog Topic clusters and/or blog
  communities
Graph – Based Clustering Approach
Spectral Clustering - Example
Spectral Clustering - Example
k-Means Clustering




• Assign k centroids
  Randomly
• Assign points to
  closest centroids
• Recalculate and
  move centroids
• Repeat until
  centroids are stable
Content – Based Estimation of W
• Blog graph could be extremely
  sparse due to the casual nature      1)     -neighbourhood
  of bloggers
• Sparsity Solution:
  o Edges between blogs are
    derived using content similarity   2)    k Nearest Neighbor kNN
• Given:


                                       3)   Fully Connected Graph
Content – Based Clustering Approaches
• Blog Contents are used to compute Similarity
• Text - Similarity Measure
 o Cosine Measure




• Spherical k-Means
 o Version of k-means clustering that uses cosine similarity
   instead of Euclidean similarity
Content Pre-Processing
         • Urban Dictionary: http://www.urbandictionary.com/
         • Edited by People
Acronyms • 5,677,798 definitions since 1999


              • Articles (a, an, the ..)
              • Demonstratives (this, that, these ..)   •   Conjunctions (for, and, both …)
Stop Words
 Removal      • Quantifiers (all, few, many … )         •   Prepositions (on ,beneath, over …)


             • Affix Stemmers                      e.g indefinitely    definite
             • Porter’s stemmer (Suffix Stripping)
Stemming




Weighting
Vector Space Model
Singular Values as Blog Post Features
Hybrid - based Clustering approach
• Blog Community can be defined as a set of nodes
  in a graph that link more frequently within this set
  than outside it and the set shares similar tags
  (Java et al, 2008)
Evaluation
• Data Set Description




• First Data Set: citation network of academic publications
   o Six categories: Agents, Artificial Intelligence (AI), Databases
      (DB), Human Computer Interaction (HCI), Information
      Retrieval (IR) and Machine Learning (ML)
   o Binary document-term matrix (Presence / Absence of Terms)
• Second Data Set: Subgraph of Weblogging Ecosystems (WWE)
  workshop
   o Tags fetched from del.icio.us, a well-known social
      bookmarking site
   o Corresponding Homepages downloaded
• Performed Clustering Performance Comparisons between
  Hybrid & NCut (Network – based) Approaches
Tag Distribution in Discovered Communities



                          Top five tags associated with
                          10 communities found using
                               the Ncut Approach




                          Top five tags associated with
                          10 communities found using
                                Hybrid Clustering
Confusion Matrix Comparison




NCut                       Hybrid
  Average Cluster Similarity




NCut                       Hybrid
Cluster Similarity Vs AVG Doc Similarity




    NCut                        Hybrid
Conclusion
• Both content and network information can be used to
  identify blog clusters or blog communities
• Accompanying content information (user – defined tags,
  unstructured contents, agglomerative terms / features) with
  network information lead to better coherent blog clusters
  and more distinct blog communities than restricted network
  – based information
• Matrix Factorization Techniques (LSA, SVD) reduce
  Sparsity and High Dimensionality of Content – based
  Clustering Information whereas Threshold – based filtration
  techniques are used
• There should be more work to be done to consider the
  temporal dynamics in blog clustering for blogging
  interaction patterns and community evolutions monitoring
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




                         Thank You
                            Ahmad Ammari
              Research Fellow (User / Community Modelling)

More Related Content

PDF
The Power of Known Peers: A Study in Two Domains
PPT
Open Corpus Adaptive Hypermedia
PDF
Reading Preference and Behavior on Wikipedia
PPT
Ritss Scholarly Communication Klj0907
PPTX
Open access futures in the humanities and social sciences a one day confer...
PPTX
Spectral clustering
PDF
IJCAI13 Paper review: Large-scale spectral clustering on graphs
PDF
ICWSM12 Brief Review
The Power of Known Peers: A Study in Two Domains
Open Corpus Adaptive Hypermedia
Reading Preference and Behavior on Wikipedia
Ritss Scholarly Communication Klj0907
Open access futures in the humanities and social sciences a one day confer...
Spectral clustering
IJCAI13 Paper review: Large-scale spectral clustering on graphs
ICWSM12 Brief Review

Similar to Blog clustering (20)

PPT
Trust influence and social media
PPT
Can you trust everything?
PDF
Dg24698702
PPT
Blogosphere
PPT
Blogosphere
PPT
Blogosphere
PPT
Lecture2 - Writing and collaboration via Web 2.0 and Social Networking
PDF
Blogosphere by FrancoSH
PDF
Jx2517481755
PDF
Jx2517481755
PPT
Willamette digital humanities seminar 2009, part 1
PPTX
Blogosphere
PPTX
we 2.0.pptx
PPT
Using Tags and Clustering to Identify Topic-specific Blogs
PPT
Web 2.0 2006: Implications for the LMS
PPT
Emerging Technologies
PPT
Web 2.0
PPT
Web 2.0 and pedagogy overview, 2007
PPT
Social software in education: an early 2007 overview
PPT
Web 2.0 and pedagogy overview, June 2007
Trust influence and social media
Can you trust everything?
Dg24698702
Blogosphere
Blogosphere
Blogosphere
Lecture2 - Writing and collaboration via Web 2.0 and Social Networking
Blogosphere by FrancoSH
Jx2517481755
Jx2517481755
Willamette digital humanities seminar 2009, part 1
Blogosphere
we 2.0.pptx
Using Tags and Clustering to Identify Topic-specific Blogs
Web 2.0 2006: Implications for the LMS
Emerging Technologies
Web 2.0
Web 2.0 and pedagogy overview, 2007
Social software in education: an early 2007 overview
Web 2.0 and pedagogy overview, June 2007
Ad

More from Ahmad Ammari (6)

PPTX
Itecn453 lec01
PPTX
Cis 2303 lo1 part 1_weeks_1_2 - student ver
PPTX
Itec410 lec01
PPTX
Distributed data mining
PPT
You tube Group Profiling Services
PPTX
Aum workshop paper_presentation
Itecn453 lec01
Cis 2303 lo1 part 1_weeks_1_2 - student ver
Itec410 lec01
Distributed data mining
You tube Group Profiling Services
Aum workshop paper_presentation
Ad

Recently uploaded (20)

PDF
History ppt on World War 2 and its consequences
PDF
The Most Dynamic Lawyer to Watch 2025.pdf
PDF
KAL 007 Manual: The Russian Shootdoown of Civilian Plane on 09/01/1983
PPTX
India’s Response to the Rohingya Refugee Crisis: Balancing National Security,...
DOCX
Breaking Now – Latest Live News Updates from GTV News HD
PDF
Supereme Court history functions and reach.pdf
PDF
Samaya Jyothi Live News Telugu | Breaking & Trusted Updates
PDF
Conflict, Narrative and Media -An Analysis of News on Israel-Palestine Confli...
PDF
Role of federalism in the indian society
PPTX
ASEANOPOL: The Multinational Police Force
DOCX
End Of The Age TV Program: Depicting the Actual Truth in a World of Lies
PDF
Aron Govil on Why America Lacks Skilled Engineers.pdf
PPTX
The-Evolution-of-Public-Human-Resource-Management (1).pptx
PPTX
Bridging Horizons_ Indo-Thai Cultural and Tourism Synergy in a Competitive Asia.
DOC
证书结业SU毕业证,莫道克大学毕业证假学位证
PDF
Chandrababu Naidu's Vision: Transforming Andhra Pradesh into India's Drone Ca...
PDF
Executive an important link between the legislative and people
DOCX
Memecoin news and insights on memecoinist
PDF
Theories of federalism showcasing india .pdf
PPTX
Sir Creek Conflict: History and its importance
History ppt on World War 2 and its consequences
The Most Dynamic Lawyer to Watch 2025.pdf
KAL 007 Manual: The Russian Shootdoown of Civilian Plane on 09/01/1983
India’s Response to the Rohingya Refugee Crisis: Balancing National Security,...
Breaking Now – Latest Live News Updates from GTV News HD
Supereme Court history functions and reach.pdf
Samaya Jyothi Live News Telugu | Breaking & Trusted Updates
Conflict, Narrative and Media -An Analysis of News on Israel-Palestine Confli...
Role of federalism in the indian society
ASEANOPOL: The Multinational Police Force
End Of The Age TV Program: Depicting the Actual Truth in a World of Lies
Aron Govil on Why America Lacks Skilled Engineers.pdf
The-Evolution-of-Public-Human-Resource-Management (1).pptx
Bridging Horizons_ Indo-Thai Cultural and Tourism Synergy in a Competitive Asia.
证书结业SU毕业证,莫道克大学毕业证假学位证
Chandrababu Naidu's Vision: Transforming Andhra Pradesh into India's Drone Ca...
Executive an important link between the legislative and people
Memecoin news and insights on memecoinist
Theories of federalism showcasing india .pdf
Sir Creek Conflict: History and its importance

Blog clustering

  • 1. School of something Computing FACULTY OF ENGINEERING OTHER Blog Clustering and Community Discovery in the Blogosphere An Overview Ahmad Ammari Research Fellow (User / Community Modelling)
  • 2. OUTLINE • Significance • Research Challenges • Network – Based Blog Clustering Approach • Content – Based Blog Clustering Approach • Hybrid – Based Blog Clustering Approach • Evaluation • Conclusion
  • 3. The Blogosphere is Huge  100% Growth Rate for every 5 months consistently for the last 4 years  Over 120,000 new blogs created every day  1.4 new Blog every second (Technorati, 2009)
  • 4. Why Clustering Blogs? • For Bloggers / Readers: o Can focus on the clusters they “belong to” • Improve Recommender Engines: o Suggest related content to other cluster members o Suggest similar bloggers to network / follow
  • 5. Why Clustering Blogs? • For Search Engines: o Improve indexing mechanisms o Improve the delivery of the search results by organizing similar results together o Enhance the • Meta Search Engine: Yippy / Clusty navigability of search • Retrieve results from many engines results • Cluster them into 'clouds' based on their contextual contents
  • 6. Why Clustering Blogs? • For Sociocultural / Political Studies: o Uncovering trending social, cultural, & political correlations within blogging communities • e.g. Harvard Arab Blogosphere Study, 2009 o Baseline assessment of networked public sphere in Middle East Blogs o Relationships to politics, media, religion, culture, international affairs
  • 7. Research Challenges • Existing approaches in webpage clustering & web community discovery are explored in the blogosphere • Applicability Challenges due to Key Differences between the Blogosphere & the Web Blog Posts Web Pages Short-lived References Long-lived References Monitoring Community Relative Temporal Stability Temporal Dynamics Multi-Theme Contents Focused Contents Emergent Text Analysis Traditional Text Analysis Missing Citations Available Citations
  • 8. Blog Clusters Vs. Community Discovery • Research Trend: Researchers find it is more prevalent to leverage content information to identify clusters of blog topics and network information to discover blog communities • Proposal: Both content and network information can be used / combined to identify blog Topic clusters and/or blog communities
  • 9. Graph – Based Clustering Approach
  • 12. k-Means Clustering • Assign k centroids Randomly • Assign points to closest centroids • Recalculate and move centroids • Repeat until centroids are stable
  • 13. Content – Based Estimation of W • Blog graph could be extremely sparse due to the casual nature 1) -neighbourhood of bloggers • Sparsity Solution: o Edges between blogs are derived using content similarity 2) k Nearest Neighbor kNN • Given: 3) Fully Connected Graph
  • 14. Content – Based Clustering Approaches • Blog Contents are used to compute Similarity • Text - Similarity Measure o Cosine Measure • Spherical k-Means o Version of k-means clustering that uses cosine similarity instead of Euclidean similarity
  • 15. Content Pre-Processing • Urban Dictionary: http://www.urbandictionary.com/ • Edited by People Acronyms • 5,677,798 definitions since 1999 • Articles (a, an, the ..) • Demonstratives (this, that, these ..) • Conjunctions (for, and, both …) Stop Words Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …) • Affix Stemmers e.g indefinitely definite • Porter’s stemmer (Suffix Stripping) Stemming Weighting
  • 17. Singular Values as Blog Post Features
  • 18. Hybrid - based Clustering approach • Blog Community can be defined as a set of nodes in a graph that link more frequently within this set than outside it and the set shares similar tags (Java et al, 2008)
  • 19. Evaluation • Data Set Description • First Data Set: citation network of academic publications o Six categories: Agents, Artificial Intelligence (AI), Databases (DB), Human Computer Interaction (HCI), Information Retrieval (IR) and Machine Learning (ML) o Binary document-term matrix (Presence / Absence of Terms) • Second Data Set: Subgraph of Weblogging Ecosystems (WWE) workshop o Tags fetched from del.icio.us, a well-known social bookmarking site o Corresponding Homepages downloaded • Performed Clustering Performance Comparisons between Hybrid & NCut (Network – based) Approaches
  • 20. Tag Distribution in Discovered Communities Top five tags associated with 10 communities found using the Ncut Approach Top five tags associated with 10 communities found using Hybrid Clustering
  • 21. Confusion Matrix Comparison NCut Hybrid Average Cluster Similarity NCut Hybrid
  • 22. Cluster Similarity Vs AVG Doc Similarity NCut Hybrid
  • 23. Conclusion • Both content and network information can be used to identify blog clusters or blog communities • Accompanying content information (user – defined tags, unstructured contents, agglomerative terms / features) with network information lead to better coherent blog clusters and more distinct blog communities than restricted network – based information • Matrix Factorization Techniques (LSA, SVD) reduce Sparsity and High Dimensionality of Content – based Clustering Information whereas Threshold – based filtration techniques are used • There should be more work to be done to consider the temporal dynamics in blog clustering for blogging interaction patterns and community evolutions monitoring
  • 24. School of something Computing FACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari Research Fellow (User / Community Modelling)