SlideShare a Scribd company logo
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web:
Challenges and Opportunities
Srinath Srinivasa
Open Systems Laboratory
IIIT Bangalore
http://osl.iiitb.ac.in/
sri@iiitb.ac.in
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
http://www.bda2013.net/
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topical Anchors: Given 
a list of noun phrases, 
identify a semantic 
topic for these terms.
Powered by Wikipedia 
co­occurrence graph 
hosted by Agama
Web APIs enable use of 
Topical Anchors in 
third party applications 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Topic Expansion: Given a
term, expands it into
semantically relevant topical
clusters with different
senses.
Uses co-occurrence
datasets from Wikipedia
2006 or 2011.
Web APIs enable use by
third party applications
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
OSL Releases
Agama: A graph database for 
storing large undirected graphs 
for efficient traversal (not 
structure­based retrieval)
Currently Agama powers a co­
occurrence graph of all noun­
phrases from Wikipedia articles 
hosted in OSL, managing 10s of 
millions of nodes and 100s of 
millions of edges 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
More data beats better algorithms..
meets
No data is an island..
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Outline
● Big Data Characteristics
● Big Data Analytics
● Pattern­driven and Model­driven Analytics
● Big Data and the Semantic Web
● Semantic Challenges
● The myth of a global ontology
● Convergent and divergent semantics
● Semantic interoperability 
● Technology Challenges
● Storage, traversal and retrieval of large­scale semantic networks
● Inference on Big Data
● On the road ahead
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
Data that is 
● Too large to be processed by conventional 
databases and data management techniques 
(Volume)
● Too diverse in structure that no single data model 
captures all elements of the data (Variety)
● Transient and/or impermanent, especially when 
pertaining to dynamic phenomena (Velocity)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data
● Transaction records
● Network streams
● Experimental output
● Social media data 
● Demographic records
● Citation data 
● Clickstreams
● Log data
● Weather data 
● …
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Some Big Data Stats
● YouTube users upload 48 hours of video every minute 
http://gigaom.com/2011/05/25/youtube­48­hours­of­video­per­minute/
● Facebook data grows by 500TB daily 
http://www.slashgear.com/facebook­data­grows­by­over­500­tb­daily­23243691/
● WalMart handles more than 1 million customer 
transactions every hour http://www.economist.com/node/15557443
● Akamai analyzes 75 million events per day for 
targeted advertising http://wikibon.org/blog/taming­big­data/
● 90% of data in the world today was created in the last 
2 years http://wikibon.org/blog/big­data­infographics/ 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data Analytics
Examine Big Data for useful (often actionable) 
knowledge
The long spectrum of Big Data Analytics
Pattern identification
Association rule mining
Classification/Clustering
Record Linkage
Security analytics
Complex Event
Processing
Opinion mining
Predictive modeling
Pattern driven
Model driven
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
● Discovery and visualization 
of recurring patterns in 
datasets
● Mostly quantitative
●  Paradigms in pattern 
discovery:
● Sampling and 
aggregation
● Thresholding and 
filtering
Image Source: Wikipedia
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Sampling and Aggregation
● Query based pattern aggregation
● Based on an initial idea of what we are looking 
for
Hypothesis
Data
Query Patterns Aggregation Presentation
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Pattern Driven Analytics
Tresholding and Filtering
● Based on sifting through the entire dataset (or a 
view) to look for “interesting” patterns without 
the context of a query
Data
Interestingness
criteria
Patterns Filtering
and
Segregation
Presentation
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
Analytics as a model­discovery problem
Wedding
Images source: Wikipedia
Observable
Data
Latent
Concept
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Model Driven Analytics
● Pattern discovery coupled with semantic 
modeling
● Non­trivial qualitative modeling challenges
● Model discovery:
● Descriptive model discovery
Fit a model to explain the observed data
● Predictive model discovery
Discover a model that can predict values of data elements 
into the future
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
Image source: Wikipedia
The Linked Data
Cloud as of
September 2011
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Linked Data
● Using Semantic Web technologies to connect data 
elements from disparate data sources
● From Web of Documents to Web of Data
● Elements of Linked Data
● URIs 
● HTTP
● Resource Description Framework (RDF)
● Serialization formats (RDFa, RDF/XML, N3, Turtle, 
and others)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web
Big Data
Semantic Web
Model Discovery
Catalyzation and
Predictive Modeling
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data        Semantic Web
● One of the main elements of the Linked Data Cloud: DBpedia is 
built from a Big Data resource: Wikipedia
● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from 
mining PubMed publications
● Enterprise scale Big Data Analytics helping build organizational 
models, operational intelligence solutions, etc. Example: Anzo 
software suite by Cambridge Semantics (www.cambridgesemantics.com), 
Loom data management suite by Revelytix (www.revelytix.com)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Web       Big Data
Schema.org
● Collection of schemata on various topics that are recognized by major 
search providers and used to semantically interpret web content
SourceMap
● Linked data augmented with web content and crowdsourced data used 
to provide details about companies like their carbon footprint, energy 
use, water use, etc. www.sourcemap.com 
OpenSteetMap
● Linked data augmenting crowdsourced data on www.openstreetmap.org 
helped in detailed mapping of disaster scenario during the Jan 2010 
Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=berners­lee­linked­data)
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Big Data and the Semantic Web: 
Challenges
Semantic challenges
● The myth of a global ontology
● Convergent and divergent semantics
Technology and system challenges
● Characteristics of a semantic graph
● Managing graph structured data
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
Several “core” semantic ontologies exist:
● WordNet
● YAGO
● OpenCyc
● SUMO
However, none of them (even automated ones) can 
capture all possible semantic associations and all 
possible perspectives on a given topic
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
The Myth of a Global Ontology
The open world problem
● We don't know what we don't know.. 
● Representation bias in big data sources
The neutral­but­useless perspective
● Localized, utilitarian descriptions often more useful than neutral, 
global descriptions. Ex: Use of “zones” as a geographical element in 
Indian Railways
● Difficult for disparate perspectives to co­exist in a single Ontology, 
violating design principles like Occam's razor
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent 
Semantics
Wikipedia article on
West Bank
conflict
Palestine POV
Israeli POV
Historians' POV
UN's POV
Encyclopedic Semantics
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Convergent and Divergent 
Semantics
IPL
event schedule
Traffic planning
Advertisement planning
around IPL
Legal structuring
around IPL
TV programme
scheduling
Security
planning
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Interoperability
● Binary predicates like RDF may not capture 
complete semantics of the association
But it is too difficult to work with higher­order predicates
● Semantic queries are characterized by contextual 
relevance and default assumptions
● Linked Data can be useful primarily within the 
context of a model
Model­building from predicates as complex a problem as 
identifying predicates from data
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic Challenges: Summary
● Hard to distinguish data from noise without a model
Especially hard when we are using data to help build a model!
● There may not be a single global model explaining the data
● Model construction as challenging, if not more challenging, as predicate 
mining
● No clarity on the underlying processes that aid in knowledge aggregation
Knowledge aggregation happens differently depending on the kind of 
knowledge being aggregated (encyclopedic versus operational knowledge) 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Tech Challenges
Storing Big Semantic Data
● Semantic data not amenable to physical access coherence to be 
efficiently stored in relational tables
● Logical proximity of triples, more important than physical 
proximity
● Read/Write storage models change logical proximity
● RDF graphs tend to be extremely dense and/or clustered
● Need efficient methods of graph storage and retrieval 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
● Databases optimized to store and retrieve interrelated 
sets of triples of the form (subject, predicate, object) 
● Query models based on answering graph queries 
(usually in SPARQL) rather than SQL queries
●  Main design criteria: storage and read­ahead policies of 
triples based on their logical proximity rather than 
physical proximity in order to enable Bulk Synchronous 
Parallel (BSP) processing
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
AllegroGraph  (http://www.franz.com/agraph/allegrograph/)
● NoSQL Graph based native storage for RDF triples
● ACID compliant
● Interfaces with Solr for free text indexing 
● Triple and text level indexing
● MongoDB integration
● RDFS++ Reasoning with dynamic materialization 
● SPARQL queries on named graphs and Prolog based 
inferencing engine
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Sesame http://www.openrdf.org/
●  Open source Java framework for parsing, storing, 
querying and inferencing over RDF data 
● Collections of RDF triples can be manipulated in memory 
using a graph data model
● Compliant with SPARQL 1.1 protocol recommendation 
● Provides two levels of APIs: SAIL (Storage and Inference 
Layer) for low level RDF processing and Repository layer 
for programmatic interfacing with Sesame
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Mulgara http://www.mulgara.org/ 
● Native storage model for RDF
● Supports multiple models (databases) per server
● ACID transactions and concurrency support 
● Copy­on­write­ cache semantics
● Full­text search and support for data types
● Primarily useful as a repository – no evidence of 
support for logical inferences over RDF 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Semantic store for Big Data
Other examples:
● InfiniteGraph from Objectivity http://www.objectivity.com/
● Big­Data http://www.bigdata.com/bigdata/blog/ 
– A high scale­out storage and computing engine
● Agama https://github.com/arrac/agama/wiki/Agama 
– Storage, search and traversal support (Ruby library) for 
very large graphs 
● Neo4j http://www.neo4j.org/ 
– Embedded, disk­based transactional graph database 
written in Java 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
● Problem: Find factual answers to specific questions by 
reasoning over large­scale data.  
● Performing extremely large­scale deductions over large 
semantic datasets in interactive response time 
● Need to contend with potentially inconsistent predicates, 
incomplete or missing values and default assumptions
● Varieties of inference over datasets
● Deduction
● Induction
● Abduction
● Statistical inference
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Logical inference over Big Data
Common approaches for scalable inferencing:
● Horn clause inferencing
● Variants of random walks on knowledge graphs
● Distributed MCMC (Markov Chain Monte Carlo) 
methods
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Horn Clauses
Horn clauses are predicates of the form:
atomic sentence with no negation and a single consequent
Horn clause knowledge bases can be resolved using “backward 
chaining” starting from the consequent and building a tree of 
antecedents until they are grounded in facts
Horn clause resolution can be scaled over large datasets by 
parallelizing resolutions using MapReduce 
 
p1∧p2∧...∧pn →u
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Random walks on RDF graphs as a means of:
● Belief materialization
● Soft inference
a c e
d f
b
R R
R
R
Assuming transitivity of R
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Random Walks on Big Data
Large scale graph processing solutions for 
scaling random walks over Big Data: 
● Apache Giraph http://giraph.apache.org/ 
● Pregel [Malewicz et al., 2010]
● Grappa http://www.cs.washington.edu/node/4217/ 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A “generic” problem solving method based on local 
sampling, useful for soft inferences on semantic data
Time homogeneous Markov Chain:
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
A homogeneous Markov chain can be represented as a set of 
“states” and “transition probabilities” across states
Given an initial “prior” probability distribution across states  
         the “stationary distribution” or “equilibrium condition” 
is defined as: 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
MCMC
Markov Chain Monte Carlo
Given a state space S and an “equilibrium” distribution       
choose a sample s of the state space S so that a Markov chain 
on s results in      as the stationary distribution
MCMC for logical inference
For a logical inference problem, the equilibrium condition 
would be of the form [0,1]m
 defined over a set of m predicates
Example Sampling algorithms for MCMC
Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling 
Metropolis­Hastings algorithm 
http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Scaling MCMC for Big Data
Distributed MCMC
Several models are explored for distributing MCMC computations 
over large datasets making them amenable to diffusing 
computations. Some examples include: [Murray 2010; Singh et al 
2011]
Distributional models for MCMC beyond the scope of this talk.. 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
On the road ahead..
Some promising directions for Big Data and 
Semantics
● Diffusion models for large scale inference
● Cognitive models for semantics over large scale data
● Model­based reasoning and reasoning across models
● Soft (probabilistic) inferences, confidence measures, 
relevance feedback
● Continuous learning over Big Data 
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
Thank You!
Big Data Tech Conclave, 26—27 April 2013
Bangalore, India
References
● Neal Madras. Introduction to Markov Chain Monte Carlo. 
http://www.cs.cornell.edu/selman/cs475/lectures/intro­mcmc­lukas.pdf 
● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz 
Czajkowski. 2010. Pregel: a system for large­scale graph processing. In Proceedings of the 2010 ACM SIGMOD International 
Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135­146. DOI=10.1145/1807167.1807184 
http://doi.acm.org/10.1145/1807167.1807184
● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In 
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for 
Computational Linguistics, Stroudsburg, PA, USA, 529­539. 
● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores, 
Clusters and Clouds. http://lccc.eecs.berkeley.edu/ 
● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the 
Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, 
Stroudsburg, PA, USA, 79­88.
● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first­order Horn clauses from web 
text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10). 
Association for Computational Linguistics, Stroudsburg, PA, USA, 1088­1098.
● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large­scale cross­document 
coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the 
Association for Computational Linguistics: Human Language Technologies ­ Volume 1 (HLT '11), Vol. 1. Association for 
Computational Linguistics, Stroudsburg, PA, USA, 793­803.   

More Related Content

PDF
Using the Semantic Web Stack to Make Big Data Smarter
KEY
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
PDF
How Semantics Solves Big Data Challenges
PDF
Industry Ontologies: Case Studies in Creating and Extending Schema.org
PDF
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
PPTX
Enterprise knowledge graphs
PPTX
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
PDF
Structured Data for the Financial Industry
Using the Semantic Web Stack to Make Big Data Smarter
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
How Semantics Solves Big Data Challenges
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Enterprise knowledge graphs
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Structured Data for the Financial Industry

What's hot (20)

PPTX
Semantics for Big Data Integration and Analysis
PPTX
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
PDF
Sebastian Hellmann
PDF
The Bounties of Semantic Data Integration for the Enterprise
PPT
The Power of Semantic Technologies to Explore Linked Open Data
PDF
Building Knowledge Graphs in 10 steps
PDF
Graph Realities
PPTX
A possible future role of schema.org for business reporting
PDF
RAPIDS cuGraph – Accelerating all your Graph needs
PDF
Property graph vs. RDF Triplestore comparison in 2020
PDF
Accelerating Time to Research Using CloudBank
PPTX
Rank | Analyse | Lead | Search
PPTX
Interaction with Linked Data
PPT
Graph db
PDF
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
PPTX
Linked data for Enterprise Data Integration
PDF
Koneksys - Offering Services to Connect Data using the Data Web
PDF
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
PPTX
The Semantic Data Web, Sören Auer, University of Leipzig
PDF
How to Reveal Hidden Relationships in Data and Risk Analytics
Semantics for Big Data Integration and Analysis
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Sebastian Hellmann
The Bounties of Semantic Data Integration for the Enterprise
The Power of Semantic Technologies to Explore Linked Open Data
Building Knowledge Graphs in 10 steps
Graph Realities
A possible future role of schema.org for business reporting
RAPIDS cuGraph – Accelerating all your Graph needs
Property graph vs. RDF Triplestore comparison in 2020
Accelerating Time to Research Using CloudBank
Rank | Analyse | Lead | Search
Interaction with Linked Data
Graph db
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Linked data for Enterprise Data Integration
Koneksys - Offering Services to Connect Data using the Data Web
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
The Semantic Data Web, Sören Auer, University of Leipzig
How to Reveal Hidden Relationships in Data and Risk Analytics
Ad

Viewers also liked (15)

PDF
Semantic Technologies for Big Data
PPTX
Is data sharing the privilege of a few? Bringing Linked Data to those without...
PPT
Inference using owl 2.0 semantics
PDF
From Big Data to Smart Data
PPTX
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술
PPTX
Big Data and Semantic Web in Manufacturing
PPTX
9 Data Mining Challenges From Data Scientists Like You
PDF
What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi...
PDF
Big Data: Analisi del Sentiment
PDF
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
PPTX
NLTK - Natural Language Processing in Python
PPT
The World Wide Web Power Point
PDF
Internet and World Wide Web
PPT
world wide web
PPTX
Ppt on internet
Semantic Technologies for Big Data
Is data sharing the privilege of a few? Bringing Linked Data to those without...
Inference using owl 2.0 semantics
From Big Data to Smart Data
시스템 엔지니어가 바라보는 시맨틱웹과 빅데이터 기술
Big Data and Semantic Web in Manufacturing
9 Data Mining Challenges From Data Scientists Like You
What is the role of cloud computing, web 2.0, and web 3.0 semantic technologi...
Big Data: Analisi del Sentiment
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
NLTK - Natural Language Processing in Python
The World Wide Web Power Point
Internet and World Wide Web
world wide web
Ppt on internet
Ad

Similar to Big Data and the Semantic Web: Challenges and Opportunities (20)

PPTX
Self adaptive based natural language interface for disambiguation of
PPTX
Introduction to Big data
PPTX
The Semantic Web Exists. What Next?
PDF
The technical case for a semantic web
PPTX
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
PDF
Session 0.0 poster minutes madness
PDF
The Future of Semantics on the Web
PPTX
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
PPTX
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
PPTX
Making things findable
ODT
Riding The Semantic Wave
PPTX
Semantics and Machine Learning
PPTX
ESWC 2015 Closing and "General Chair's minute of Madness"
PPTX
BrightTALK - Semantic AI
PPTX
(Keynote) Peter Mika - “Making the Web Searchable”
PPTX
Making the Web Searchable - Keynote ICWE 2015
PPT
Spivack Blogtalk 2008
PPTX
Linked Data past, present and futures
PPT
Semantic Puzzle
PDF
NYCFacets: Metadata, Extrametadata and Crowdknowing
Self adaptive based natural language interface for disambiguation of
Introduction to Big data
The Semantic Web Exists. What Next?
The technical case for a semantic web
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Session 0.0 poster minutes madness
The Future of Semantics on the Web
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Making things findable
Riding The Semantic Wave
Semantics and Machine Learning
ESWC 2015 Closing and "General Chair's minute of Madness"
BrightTALK - Semantic AI
(Keynote) Peter Mika - “Making the Web Searchable”
Making the Web Searchable - Keynote ICWE 2015
Spivack Blogtalk 2008
Linked Data past, present and futures
Semantic Puzzle
NYCFacets: Metadata, Extrametadata and Crowdknowing

More from Srinath Srinivasa (15)

PDF
AI and the sense of self
PPTX
Modeling sustainability in social networks
PDF
Characterizing online social cognition
PDF
Open ended data
PDF
The Web and the Mind
PDF
Big Social Machines: Architecture and Challenges
PDF
Abstraction and Expression on the Web
PDF
Towards a "Mindful" Web
PDF
The Power Law of Social Media: What CIOs Should Know
PDF
Aggregating Operational Knowledge in Community Settings
PDF
Information Networks and Semantics
PDF
Semantics hidden within co-occurrence patterns
PDF
The open problem of open-world computing
PPT
Trends In Graph Data Management And Mining
PPT
Information Networks And Their Dynamics
AI and the sense of self
Modeling sustainability in social networks
Characterizing online social cognition
Open ended data
The Web and the Mind
Big Social Machines: Architecture and Challenges
Abstraction and Expression on the Web
Towards a "Mindful" Web
The Power Law of Social Media: What CIOs Should Know
Aggregating Operational Knowledge in Community Settings
Information Networks and Semantics
Semantics hidden within co-occurrence patterns
The open problem of open-world computing
Trends In Graph Data Management And Mining
Information Networks And Their Dynamics

Recently uploaded (20)

PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPTX
Gastroschisis- Clinical Overview 18112311
PPTX
Acid Base Disorders educational power point.pptx
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
PPTX
Note on Abortion.pptx for the student note
PDF
CT Anatomy for Radiotherapy.pdf eryuioooop
PPTX
SKIN Anatomy and physiology and associated diseases
PPTX
post stroke aphasia rehabilitation physician
PPTX
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPTX
Imaging of parasitic D. Case Discussions.pptx
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PPT
Obstructive sleep apnea in orthodontics treatment
PPTX
Important Obstetric Emergency that must be recognised
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
Gastroschisis- Clinical Overview 18112311
Acid Base Disorders educational power point.pptx
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Note on Abortion.pptx for the student note
CT Anatomy for Radiotherapy.pdf eryuioooop
SKIN Anatomy and physiology and associated diseases
post stroke aphasia rehabilitation physician
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
Medical Evidence in the Criminal Justice Delivery System in.pdf
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
Imaging of parasitic D. Case Discussions.pptx
MENTAL HEALTH - NOTES.ppt for nursing students
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
OPIOID ANALGESICS AND THEIR IMPLICATIONS
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
Obstructive sleep apnea in orthodontics treatment
Important Obstetric Emergency that must be recognised
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx

Big Data and the Semantic Web: Challenges and Opportunities

  • 1. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web: Challenges and Opportunities Srinath Srinivasa Open Systems Laboratory IIIT Bangalore http://osl.iiitb.ac.in/ sri@iiitb.ac.in
  • 2. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India http://www.bda2013.net/
  • 3. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Topical Anchors: Given  a list of noun phrases,  identify a semantic  topic for these terms. Powered by Wikipedia  co­occurrence graph  hosted by Agama Web APIs enable use of  Topical Anchors in  third party applications 
  • 4. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Topic Expansion: Given a term, expands it into semantically relevant topical clusters with different senses. Uses co-occurrence datasets from Wikipedia 2006 or 2011. Web APIs enable use by third party applications
  • 5. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India OSL Releases Agama: A graph database for  storing large undirected graphs  for efficient traversal (not  structure­based retrieval) Currently Agama powers a co­ occurrence graph of all noun­ phrases from Wikipedia articles  hosted in OSL, managing 10s of  millions of nodes and 100s of  millions of edges 
  • 6. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India More data beats better algorithms.. meets No data is an island..
  • 7. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Outline ● Big Data Characteristics ● Big Data Analytics ● Pattern­driven and Model­driven Analytics ● Big Data and the Semantic Web ● Semantic Challenges ● The myth of a global ontology ● Convergent and divergent semantics ● Semantic interoperability  ● Technology Challenges ● Storage, traversal and retrieval of large­scale semantic networks ● Inference on Big Data ● On the road ahead
  • 8. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data Data that is  ● Too large to be processed by conventional  databases and data management techniques  (Volume) ● Too diverse in structure that no single data model  captures all elements of the data (Variety) ● Transient and/or impermanent, especially when  pertaining to dynamic phenomena (Velocity)
  • 9. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data ● Transaction records ● Network streams ● Experimental output ● Social media data  ● Demographic records ● Citation data  ● Clickstreams ● Log data ● Weather data  ● …
  • 10. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Some Big Data Stats ● YouTube users upload 48 hours of video every minute  http://gigaom.com/2011/05/25/youtube­48­hours­of­video­per­minute/ ● Facebook data grows by 500TB daily  http://www.slashgear.com/facebook­data­grows­by­over­500­tb­daily­23243691/ ● WalMart handles more than 1 million customer  transactions every hour http://www.economist.com/node/15557443 ● Akamai analyzes 75 million events per day for  targeted advertising http://wikibon.org/blog/taming­big­data/ ● 90% of data in the world today was created in the last  2 years http://wikibon.org/blog/big­data­infographics/ 
  • 11. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data Analytics Examine Big Data for useful (often actionable)  knowledge The long spectrum of Big Data Analytics Pattern identification Association rule mining Classification/Clustering Record Linkage Security analytics Complex Event Processing Opinion mining Predictive modeling Pattern driven Model driven
  • 12. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics ● Discovery and visualization  of recurring patterns in  datasets ● Mostly quantitative ●  Paradigms in pattern  discovery: ● Sampling and  aggregation ● Thresholding and  filtering Image Source: Wikipedia
  • 13. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics Sampling and Aggregation ● Query based pattern aggregation ● Based on an initial idea of what we are looking  for Hypothesis Data Query Patterns Aggregation Presentation
  • 14. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Pattern Driven Analytics Tresholding and Filtering ● Based on sifting through the entire dataset (or a  view) to look for “interesting” patterns without  the context of a query Data Interestingness criteria Patterns Filtering and Segregation Presentation
  • 15. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Model Driven Analytics Analytics as a model­discovery problem Wedding Images source: Wikipedia Observable Data Latent Concept
  • 16. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Model Driven Analytics ● Pattern discovery coupled with semantic  modeling ● Non­trivial qualitative modeling challenges ● Model discovery: ● Descriptive model discovery Fit a model to explain the observed data ● Predictive model discovery Discover a model that can predict values of data elements  into the future
  • 17. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Linked Data Image source: Wikipedia The Linked Data Cloud as of September 2011
  • 18. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Linked Data ● Using Semantic Web technologies to connect data  elements from disparate data sources ● From Web of Documents to Web of Data ● Elements of Linked Data ● URIs  ● HTTP ● Resource Description Framework (RDF) ● Serialization formats (RDFa, RDF/XML, N3, Turtle,  and others)
  • 19. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web Big Data Semantic Web Model Discovery Catalyzation and Predictive Modeling
  • 20. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data        Semantic Web ● One of the main elements of the Linked Data Cloud: DBpedia is  built from a Big Data resource: Wikipedia ● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from  mining PubMed publications ● Enterprise scale Big Data Analytics helping build organizational  models, operational intelligence solutions, etc. Example: Anzo  software suite by Cambridge Semantics (www.cambridgesemantics.com),  Loom data management suite by Revelytix (www.revelytix.com)
  • 21. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Web       Big Data Schema.org ● Collection of schemata on various topics that are recognized by major  search providers and used to semantically interpret web content SourceMap ● Linked data augmented with web content and crowdsourced data used  to provide details about companies like their carbon footprint, energy  use, water use, etc. www.sourcemap.com  OpenSteetMap ● Linked data augmenting crowdsourced data on www.openstreetmap.org  helped in detailed mapping of disaster scenario during the Jan 2010  Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=berners­lee­linked­data)
  • 22. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web:  Challenges Semantic challenges ● The myth of a global ontology ● Convergent and divergent semantics Technology and system challenges ● Characteristics of a semantic graph ● Managing graph structured data
  • 23. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India The Myth of a Global Ontology Several “core” semantic ontologies exist: ● WordNet ● YAGO ● OpenCyc ● SUMO However, none of them (even automated ones) can  capture all possible semantic associations and all  possible perspectives on a given topic
  • 24. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India The Myth of a Global Ontology The open world problem ● We don't know what we don't know..  ● Representation bias in big data sources The neutral­but­useless perspective ● Localized, utilitarian descriptions often more useful than neutral,  global descriptions. Ex: Use of “zones” as a geographical element in  Indian Railways ● Difficult for disparate perspectives to co­exist in a single Ontology,  violating design principles like Occam's razor
  • 25. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Convergent and Divergent  Semantics Wikipedia article on West Bank conflict Palestine POV Israeli POV Historians' POV UN's POV Encyclopedic Semantics
  • 26. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Convergent and Divergent  Semantics IPL event schedule Traffic planning Advertisement planning around IPL Legal structuring around IPL TV programme scheduling Security planning
  • 27. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Interoperability ● Binary predicates like RDF may not capture  complete semantics of the association But it is too difficult to work with higher­order predicates ● Semantic queries are characterized by contextual  relevance and default assumptions ● Linked Data can be useful primarily within the  context of a model Model­building from predicates as complex a problem as  identifying predicates from data
  • 28. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic Challenges: Summary ● Hard to distinguish data from noise without a model Especially hard when we are using data to help build a model! ● There may not be a single global model explaining the data ● Model construction as challenging, if not more challenging, as predicate  mining ● No clarity on the underlying processes that aid in knowledge aggregation Knowledge aggregation happens differently depending on the kind of  knowledge being aggregated (encyclopedic versus operational knowledge) 
  • 29. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Tech Challenges Storing Big Semantic Data ● Semantic data not amenable to physical access coherence to be  efficiently stored in relational tables ● Logical proximity of triples, more important than physical  proximity ● Read/Write storage models change logical proximity ● RDF graphs tend to be extremely dense and/or clustered ● Need efficient methods of graph storage and retrieval 
  • 30. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data ● Databases optimized to store and retrieve interrelated  sets of triples of the form (subject, predicate, object)  ● Query models based on answering graph queries  (usually in SPARQL) rather than SQL queries ●  Main design criteria: storage and read­ahead policies of  triples based on their logical proximity rather than  physical proximity in order to enable Bulk Synchronous  Parallel (BSP) processing
  • 31. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data AllegroGraph  (http://www.franz.com/agraph/allegrograph/) ● NoSQL Graph based native storage for RDF triples ● ACID compliant ● Interfaces with Solr for free text indexing  ● Triple and text level indexing ● MongoDB integration ● RDFS++ Reasoning with dynamic materialization  ● SPARQL queries on named graphs and Prolog based  inferencing engine
  • 32. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Sesame http://www.openrdf.org/ ●  Open source Java framework for parsing, storing,  querying and inferencing over RDF data  ● Collections of RDF triples can be manipulated in memory  using a graph data model ● Compliant with SPARQL 1.1 protocol recommendation  ● Provides two levels of APIs: SAIL (Storage and Inference  Layer) for low level RDF processing and Repository layer  for programmatic interfacing with Sesame
  • 33. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Mulgara http://www.mulgara.org/  ● Native storage model for RDF ● Supports multiple models (databases) per server ● ACID transactions and concurrency support  ● Copy­on­write­ cache semantics ● Full­text search and support for data types ● Primarily useful as a repository – no evidence of  support for logical inferences over RDF 
  • 34. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Semantic store for Big Data Other examples: ● InfiniteGraph from Objectivity http://www.objectivity.com/ ● Big­Data http://www.bigdata.com/bigdata/blog/  – A high scale­out storage and computing engine ● Agama https://github.com/arrac/agama/wiki/Agama  – Storage, search and traversal support (Ruby library) for  very large graphs  ● Neo4j http://www.neo4j.org/  – Embedded, disk­based transactional graph database  written in Java 
  • 35. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Logical inference over Big Data ● Problem: Find factual answers to specific questions by  reasoning over large­scale data.   ● Performing extremely large­scale deductions over large  semantic datasets in interactive response time  ● Need to contend with potentially inconsistent predicates,  incomplete or missing values and default assumptions ● Varieties of inference over datasets ● Deduction ● Induction ● Abduction ● Statistical inference
  • 36. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Logical inference over Big Data Common approaches for scalable inferencing: ● Horn clause inferencing ● Variants of random walks on knowledge graphs ● Distributed MCMC (Markov Chain Monte Carlo)  methods
  • 37. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Horn Clauses Horn clauses are predicates of the form: atomic sentence with no negation and a single consequent Horn clause knowledge bases can be resolved using “backward  chaining” starting from the consequent and building a tree of  antecedents until they are grounded in facts Horn clause resolution can be scaled over large datasets by  parallelizing resolutions using MapReduce    p1∧p2∧...∧pn →u
  • 38. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Random Walks on Big Data Random walks on RDF graphs as a means of: ● Belief materialization ● Soft inference a c e d f b R R R R Assuming transitivity of R
  • 39. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Random Walks on Big Data Large scale graph processing solutions for  scaling random walks over Big Data:  ● Apache Giraph http://giraph.apache.org/  ● Pregel [Malewicz et al., 2010] ● Grappa http://www.cs.washington.edu/node/4217/ 
  • 40. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC A “generic” problem solving method based on local  sampling, useful for soft inferences on semantic data Time homogeneous Markov Chain:
  • 41. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC A homogeneous Markov chain can be represented as a set of  “states” and “transition probabilities” across states Given an initial “prior” probability distribution across states            the “stationary distribution” or “equilibrium condition”  is defined as: 
  • 42. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India MCMC Markov Chain Monte Carlo Given a state space S and an “equilibrium” distribution        choose a sample s of the state space S so that a Markov chain  on s results in      as the stationary distribution MCMC for logical inference For a logical inference problem, the equilibrium condition  would be of the form [0,1]m  defined over a set of m predicates Example Sampling algorithms for MCMC Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling  Metropolis­Hastings algorithm  http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm 
  • 43. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Scaling MCMC for Big Data Distributed MCMC Several models are explored for distributing MCMC computations  over large datasets making them amenable to diffusing  computations. Some examples include: [Murray 2010; Singh et al  2011] Distributional models for MCMC beyond the scope of this talk.. 
  • 44. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India On the road ahead.. Some promising directions for Big Data and  Semantics ● Diffusion models for large scale inference ● Cognitive models for semantics over large scale data ● Model­based reasoning and reasoning across models ● Soft (probabilistic) inferences, confidence measures,  relevance feedback ● Continuous learning over Big Data 
  • 45. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Thank You!
  • 46. Big Data Tech Conclave, 26—27 April 2013 Bangalore, India References ● Neal Madras. Introduction to Markov Chain Monte Carlo.  http://www.cs.cornell.edu/selman/cs475/lectures/intro­mcmc­lukas.pdf  ● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz  Czajkowski. 2010. Pregel: a system for large­scale graph processing. In Proceedings of the 2010 ACM SIGMOD International  Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135­146. DOI=10.1145/1807167.1807184  http://doi.acm.org/10.1145/1807167.1807184 ● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In  Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for  Computational Linguistics, Stroudsburg, PA, USA, 529­539.  ● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores,  Clusters and Clouds. http://lccc.eecs.berkeley.edu/  ● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics,  Stroudsburg, PA, USA, 79­88. ● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first­order Horn clauses from web  text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10).  Association for Computational Linguistics, Stroudsburg, PA, USA, 1088­1098. ● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large­scale cross­document  coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the  Association for Computational Linguistics: Human Language Technologies ­ Volume 1 (HLT '11), Vol. 1. Association for  Computational Linguistics, Stroudsburg, PA, USA, 793­803.