SlideShare a Scribd company logo
Pavan Kapanipathi 
Kno.e.sis Center, Wright State University 
Dayton, OH 
Frontiers of Cloud Computing and Big Data Workshop 2014 
IBM TJ Watson Research Center 
Yorktown Heights, NY
o Social Web in 60 seconds 
o Twitter 
o Big Data Challenges on Social Web 
o Addressing Volume 
o Hierarchical Interest Graphs 
o Addressing Velocity 
o Tracking Dynamic Topics on Twitter 
o Conclusion 
Overview
Social Web in 60 secs
Twitter 
500M tweets per day
Leveraging Twitter 
o Brands are monitoring Twitter 
o 62% active in 2011 to 97% active in 2013 
o Twitter is used for disaster management 
o 35% of 20M tweets during hurricane sandy 
shared information and news 
o Personalization using Twitter 
o Search Engines use influence scores derived 
from Twitter network.
Challenges 
The Four 
Vs of Big 
Data 
Volume 
Scale of Data 
Velocity 
Streaming Data 
Variety 
Different forms 
of Data 
Veracity 
Uncertainty of 
Data 
• Data Perspective: 12TB of 
data/day1. 
• Information Perspective: 
Information that interests 
me. Reducing information 
overload for users. 
• Tracking dynamic topics on 
Twitter. 
• Improving recall of 
relevant, dynamic 
streaming twitter data for 
real-time Twitter analysis.
Addressing Volume: Information 
Perspective
State of the Art
Addressing Volume: User Perspective 
o Approach 
o Generating interest profiles of users by 
understanding their activities on Twitter. 
o Filtering/Recommending content that matches 
their interests. 
o Determining user interests from tweets 
o Exploiting Knowledge base to gain further 
insights about the interests and infer a 
hierarchical interest graph.
10
Internet 
Semantic 
Search 
Linked Data Metadata 
Technology 
World Wide Web 
Semantic 
Web 
Structured 
Information 
0.5 
0.8 0.2 0.6 
User Interests 
Scores for 
Interests 
11 
0.7 
0.4 
0.3
Hierarchical Interest Graphs 
o Spreading Activation Theory 
o Wikipedia Graph based Distributional 
Semantics 
o Hierarchical Interest Graph with scores 
for each category in the Hierarchy.
Evaluation 
Evaluated the top-10 categories of 
interests derived from the hierarchy 
• 76% Mean Average Precision 
• 98% Mean Reciprocal Recall
Addressing Velocity: Tracking Dynamic 
Topics on Twitter
Tracking Dynamic Events on Twitter 
o Twitris – A Semantic Web application for 
analyzing tweets. 
o Political, Disasters & Healthcare tweets 
o Event relevant tweets 
o Twitter Streaming API, Keywords/geo-location 
based. 
o Dynamic events are not easy to crawl using 
these techniques. 
o Hashtags as queries.
Hashtag Analysis for Dynamic Topics 
Colorado Shooting OWS
Hashtag Analysis for Dynamic Topics 
Hashtags co-occur with 
each other 
Colorado Shooting OWS
Hashtag Analysis for Dynamic Topics 
Powerlaw distribution 
Hashtags co-occur with 
each other 
Colorado Shooting OWS
Hashtag Analysis for Dynamic Topics 
Powerlaw distribution 
Hashtags co-occur with 
each other 
Colorado Shooting OWS 
Very few Hashtags are 
popular. Top 1% can get 
85% of the tweets.
Hashtag Analysis for Dynamic Topics 
Powerlaw distribution 
Hashtags co-occur with 
each other 
Colorado Shooting OWS 
Very few Hashtags are 
popular. Top 1% can get 
85% of the tweets. 
Clustering co-efficient
Hashtag Analysis for Dynamic Topics 
Powerlaw distribution 
Hashtags co-occur with 
each other 
Colorado Shooting OWS 
Very few Hashtags are 
popular. Top 1% can get 
85% of the tweets. 
The top ones co-occur with 
each other the best 
Clustering co-efficient
Approach using Wikipedia 
o Input an event-relevant hashtag and the 
corresponding Event Wikipedia page. 
o Utilize dynamically evolving hyperlink 
structure of Wikipedia Event page. 
o Determine event relevant hashtags based on 
its similarity to event page and its co-occurrence 
with the existing relevant 
hashtags.
Evaluation 
Evaluated tweets comprising of top-relevant 
hashtags detected for 
dynamic topics 
• NDCG - 92% at top-5 Mean Average 
Precision
Conclusion 
o What’s there in this presentation 
o Big Data challenges in leveraging Twitter. 
o Focus on “Information” overload instead of “Data” 
overload. 
o Wikipedia categories in the Hierarchy are 
considered as interests by users. 
o Evolving set of hashtags as queries for dynamic 
events. 
o What I missed (catch me at the poster 
session) 
o How are Knowledge-bases exploited for our work? 
o Impact of Knowledge Bases, specifically Wikipedia.
Thanks 
Contact: 
Email-pavan@knoesis.org 
Twitter:@pavankaps

More Related Content

PDF
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
PDF
What do you do with 280 million tweets from the 2016 U.S. election?
PDF
Workshop on Data Quality Management in Wikidata
PDF
Data Quality and the FAIR principles
PPTX
Python NLP Project
PPTX
Practicing Data Science Responsibly
PPT
Googling and Beyond: Search the Web Effectively
PDF
Advancing Alcohol Behavior Change
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
What do you do with 280 million tweets from the 2016 U.S. election?
Workshop on Data Quality Management in Wikidata
Data Quality and the FAIR principles
Python NLP Project
Practicing Data Science Responsibly
Googling and Beyond: Search the Web Effectively
Advancing Alcohol Behavior Change

What's hot (20)

PPT
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
PDF
Matching Mobile Applications for Cross Promotion
PPTX
From Big Data to the Big Picture
PPTX
Finding statistics2
PPTX
LASI13-Boston Charles Lang, Garron Hillaire
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PPTX
Recruiting Study Participants Online using Amazon's Mechanical Turk
PDF
Crowdsourcing Linked Data Quality Assessment
PPTX
Designing Cybersecurity Policies with Field Experiments
PPTX
CGIAR Collaborative Platform for Gender Research - Gender meets big data
PPT
Search Engagement
PDF
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
PPTX
Chna library research 1 21-15
PPTX
Data Mining: Graph mining and social network analysis
PPTX
ScienceOpen ALPSP Metadata Seminar_27 Nov 2019
PPTX
Knowledge graph construction for research & medicine
PDF
Why altmetrics?
PPTX
Trial Promoter: A Web-Based Tool to Test Stakeholder Engagement in Research o...
PPTX
"Reproducibility from the Informatics Perspective"
PPTX
Open Data and Library Services
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Matching Mobile Applications for Cross Promotion
From Big Data to the Big Picture
Finding statistics2
LASI13-Boston Charles Lang, Garron Hillaire
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Recruiting Study Participants Online using Amazon's Mechanical Turk
Crowdsourcing Linked Data Quality Assessment
Designing Cybersecurity Policies with Field Experiments
CGIAR Collaborative Platform for Gender Research - Gender meets big data
Search Engagement
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
Chna library research 1 21-15
Data Mining: Graph mining and social network analysis
ScienceOpen ALPSP Metadata Seminar_27 Nov 2019
Knowledge graph construction for research & medicine
Why altmetrics?
Trial Promoter: A Web-Based Tool to Test Stakeholder Engagement in Research o...
"Reproducibility from the Informatics Perspective"
Open Data and Library Services
Ad

Viewers also liked (19)

PPTX
Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and ...
PPTX
Are Twitter Users Equal in Predicting Elections? Insights from Republican Pri...
PPTX
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
PPTX
Semantic Analysis of Online Health Information Seeking for Cardiovascular Dis...
PPTX
Integrating Sensor and Social Data for Understanding City Events
PPT
THE 4 X 4 SEMANTIC MODEL : Semantics to Empower Services Science: Using Seman...
PPTX
Stay Awhile and Listen: User Interactions in a Crowdsourced Platform Offerin...
PPTX
Semantic Computing in Real-World: Vertical and Horizontal application, withi...
PPTX
Exploring Synthetic Cannabinoid Effects Using Web Forum Data
PDF
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
PPTX
Representation of Parsimonious Covering Theory in OWL-DL
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
PPTX
The knowledge-driven exploration of integrated biomedical knowledge sources f...
PPTX
Knoesis Student Achievement
PPT
Kno.e.sis Review: late 2012 to mid 2013
PDF
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
PPSX
Satya Sahoo Thesis Defense
PDF
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
PPTX
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and ...
Are Twitter Users Equal in Predicting Elections? Insights from Republican Pri...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Semantic Analysis of Online Health Information Seeking for Cardiovascular Dis...
Integrating Sensor and Social Data for Understanding City Events
THE 4 X 4 SEMANTIC MODEL : Semantics to Empower Services Science: Using Seman...
Stay Awhile and Listen: User Interactions in a Crowdsourced Platform Offerin...
Semantic Computing in Real-World: Vertical and Horizontal application, withi...
Exploring Synthetic Cannabinoid Effects Using Web Forum Data
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Representation of Parsimonious Covering Theory in OWL-DL
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
The knowledge-driven exploration of integrated biomedical knowledge sources f...
Knoesis Student Achievement
Kno.e.sis Review: late 2012 to mid 2013
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
Satya Sahoo Thesis Defense
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Ad

Similar to Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases (20)

PDF
Social media data for Social science research
PPTX
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
PDF
Twitris - Web Information System 2011 Course
PPTX
User Interests Identification From Twitter using Hierarchical Knowledge Base
PPTX
User Interests Identification From Twitter using Hierarchical Knowledge Base
PDF
Pre-defense_talk
PPT
Classifying Twitter Content
PPTX
Big Data for Development: Opportunities and Challenges, Summary Slidedeck
PDF
Big Data: A glimpse of today’s state of play
PPTX
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
PPTX
cse6339-spring15-02.pptx
PPT
Socialsensor project overview and topic discovery in tweeter streams
PDF
An adaptive clustering and classification algorithm for Twitter data streamin...
PDF
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
PDF
GeospatialDataAnalysis
PDF
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
PDF
PDF
Jasist11
PDF
Surfacing Real-World Event Content on Twitter
PDF
Kdd12 tutorial-inf-part-ii
Social media data for Social science research
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Twitris - Web Information System 2011 Course
User Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge Base
Pre-defense_talk
Classifying Twitter Content
Big Data for Development: Opportunities and Challenges, Summary Slidedeck
Big Data: A glimpse of today’s state of play
Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data: Challenges and Expe...
cse6339-spring15-02.pptx
Socialsensor project overview and topic discovery in tweeter streams
An adaptive clustering and classification algorithm for Twitter data streamin...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
GeospatialDataAnalysis
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
Jasist11
Surfacing Real-World Event Content on Twitter
Kdd12 tutorial-inf-part-ii

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Introduction to the R Programming Language
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
[EN] Industrial Machine Downtime Prediction
SAP 2 completion done . PRESENTATION.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
STERILIZATION AND DISINFECTION-1.ppthhhbx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to the R Programming Language
IB Computer Science - Internal Assessment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
Qualitative Qantitative and Mixed Methods.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

  • 1. Pavan Kapanipathi Kno.e.sis Center, Wright State University Dayton, OH Frontiers of Cloud Computing and Big Data Workshop 2014 IBM TJ Watson Research Center Yorktown Heights, NY
  • 2. o Social Web in 60 seconds o Twitter o Big Data Challenges on Social Web o Addressing Volume o Hierarchical Interest Graphs o Addressing Velocity o Tracking Dynamic Topics on Twitter o Conclusion Overview
  • 3. Social Web in 60 secs
  • 5. Leveraging Twitter o Brands are monitoring Twitter o 62% active in 2011 to 97% active in 2013 o Twitter is used for disaster management o 35% of 20M tweets during hurricane sandy shared information and news o Personalization using Twitter o Search Engines use influence scores derived from Twitter network.
  • 6. Challenges The Four Vs of Big Data Volume Scale of Data Velocity Streaming Data Variety Different forms of Data Veracity Uncertainty of Data • Data Perspective: 12TB of data/day1. • Information Perspective: Information that interests me. Reducing information overload for users. • Tracking dynamic topics on Twitter. • Improving recall of relevant, dynamic streaming twitter data for real-time Twitter analysis.
  • 9. Addressing Volume: User Perspective o Approach o Generating interest profiles of users by understanding their activities on Twitter. o Filtering/Recommending content that matches their interests. o Determining user interests from tweets o Exploiting Knowledge base to gain further insights about the interests and infer a hierarchical interest graph.
  • 10. 10
  • 11. Internet Semantic Search Linked Data Metadata Technology World Wide Web Semantic Web Structured Information 0.5 0.8 0.2 0.6 User Interests Scores for Interests 11 0.7 0.4 0.3
  • 12. Hierarchical Interest Graphs o Spreading Activation Theory o Wikipedia Graph based Distributional Semantics o Hierarchical Interest Graph with scores for each category in the Hierarchy.
  • 13. Evaluation Evaluated the top-10 categories of interests derived from the hierarchy • 76% Mean Average Precision • 98% Mean Reciprocal Recall
  • 14. Addressing Velocity: Tracking Dynamic Topics on Twitter
  • 15. Tracking Dynamic Events on Twitter o Twitris – A Semantic Web application for analyzing tweets. o Political, Disasters & Healthcare tweets o Event relevant tweets o Twitter Streaming API, Keywords/geo-location based. o Dynamic events are not easy to crawl using these techniques. o Hashtags as queries.
  • 16. Hashtag Analysis for Dynamic Topics Colorado Shooting OWS
  • 17. Hashtag Analysis for Dynamic Topics Hashtags co-occur with each other Colorado Shooting OWS
  • 18. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS
  • 19. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets.
  • 20. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets. Clustering co-efficient
  • 21. Hashtag Analysis for Dynamic Topics Powerlaw distribution Hashtags co-occur with each other Colorado Shooting OWS Very few Hashtags are popular. Top 1% can get 85% of the tweets. The top ones co-occur with each other the best Clustering co-efficient
  • 22. Approach using Wikipedia o Input an event-relevant hashtag and the corresponding Event Wikipedia page. o Utilize dynamically evolving hyperlink structure of Wikipedia Event page. o Determine event relevant hashtags based on its similarity to event page and its co-occurrence with the existing relevant hashtags.
  • 23. Evaluation Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics • NDCG - 92% at top-5 Mean Average Precision
  • 24. Conclusion o What’s there in this presentation o Big Data challenges in leveraging Twitter. o Focus on “Information” overload instead of “Data” overload. o Wikipedia categories in the Hierarchy are considered as interests by users. o Evolving set of hashtags as queries for dynamic events. o What I missed (catch me at the poster session) o How are Knowledge-bases exploited for our work? o Impact of Knowledge Bases, specifically Wikipedia.