SlideShare a Scribd company logo
TOPIC MODELING USING
BIG DATA ANALYTICS
-BY
SARAH MASUD(12-CSS-57)
FARHEEN NILOFER(12-CSS-23)
INTRODUCTION
WORK FLOW:
Installation of Hadoop on multiple nodes- For distributed processing.
Pre processing of data set - Cleaning and conversion of data into desired
format.
Passing the converted data to the modeling tool.
Parallelizing the computation and algorithm selection.
Comparison of results on the basis of
Efficiency of different modelling algorithms
Computation on single vs multiple node.
WHAT Is Big Data and Topic Modelling?
Big Data:
Data that cannot be stored or processed by traditional computing techniques.
EXAMPLES: Black Box Data, Social Media, Space Exploration, Power and Grid Station,Search Engine Data…
Topic Modelling:
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections.
IN LAYMAN TERMS
A method of text mining to identify patterns in a corpus. Topic modeling helps us develop new ways to search,
browse and summarize large archives of texts.
VOLUME VARIETY VELOCITY
TOPIC MODELING IN IMPLEMENTATION
WHY Topic Modelling using Big Data ?
You have to create several different variables for every single word in the corpus. The models we
would be running, with roughly 2,000 documents, will get to the edge of what can be done on an
average desktop machine, and commonly take a day.
Hadoop is a framework which could provide all the facilities that are needed in modelling of such a
huge set of data.
till
2
DAYS
10 MINS
??
5 MILLION GIGABYTES OF DATA GENERATED(amount)
2003
2011
2013
2015
HADOOP and its COMPONENTS
Hadoop:
An open source framework written in JAVA.
It is designed to scale up from single servers to thousands
of machines, each offering local computation and
storage.(confi)
It has two major component-
HDFS(Hadoop Distributed File System)- For the storage.
MapReduce- Processing of data.(pgram model)
Hadoop Installation:
Cluster of 5 (in our case) commodity hardwares.
Namenode-the manager.
Datanodes- the actual storage and processing units.
COMPARISON OF EXECUTION TIME
HOW Topic Modelling is Achieved Using
Big Data Analytics?
Proposed Algorithms:
Probabilistic Latent Semantic Indexing ( PLSI) :
It is a novel statistical technique for the analysis of two-mode and co-occurrence data
Latent Dirichlet allocation (LDA):
It’s a way of automatically discovering topics that sentences contain.
Pachinko allocation
Modeling correlations between topics in addition to the word correlations which constitute topics.
`
d wz
HOW Topic Modelling is Achieved Using
Big Data Analytics?
TOOL
S
Model/
Algori
thm
Langu
age
Introd
uction
Mallet LDA(in
cluding
Naïve
Bayes,
Maximu
m
Entropy
, and
Decisio
n
Trees)
Java efficient
routine
s for
converti
ng text
to
"feature
s", a
wide
variety
of
algorith
TOPIC MODELLING TOOLS
WHERE is Topic Modeling Using Big
Data Applied?
SOME APPLICATIONS OF TOPIC MODELING INCLUDE:
Topic Modeling for analyzing news articles.
Topic Modeling for Page Rank in Search Engines.
Finding patterns in genetic data, images, social graphs.
Topic modeling on historical journals.
REFERENCES:
1.Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent Semantic Indexing:
A probabilistic analysis" (Postscript). Proceedings of ACM PODS.
2.Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of
Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.
3.Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84.
doi:10.1145/2133806.2133826.
4.Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv:1204.1956.
THANK YOU

More Related Content

ODP
Topic Modeling
PDF
Topics Modeling
PPT
Topic Models - LDA and Correlated Topic Models
PDF
Latent dirichletallocation presentation
PDF
Topic model an introduction
PDF
Latent Dirichlet Allocation
PDF
Basic review on topic modeling
PDF
TopicModels_BleiPaper_Summary.pptx
Topic Modeling
Topics Modeling
Topic Models - LDA and Correlated Topic Models
Latent dirichletallocation presentation
Topic model an introduction
Latent Dirichlet Allocation
Basic review on topic modeling
TopicModels_BleiPaper_Summary.pptx

What's hot (20)

PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PPTX
Topic model, LDA and all that
PPTX
Neural Models for Information Retrieval
PDF
Introduction to Probabilistic Latent Semantic Analysis
PPT
Topic Models
PPTX
Deep Neural Methods for Retrieval
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
PPTX
Tdm probabilistic models (part 2)
PPTX
Neural Models for Information Retrieval
PPTX
Deep Learning for Search
PPTX
Duet @ TREC 2019 Deep Learning Track
PPTX
Probabilistic models (part 1)
PPT
Artificial Intelligence
PPTX
A Simple Introduction to Neural Information Retrieval
PPTX
Neural Models for Document Ranking
PDF
Language Models for Information Retrieval
PDF
Survey of Generative Clustering Models 2008
PPTX
The Duet model
PPTX
Topic Extraction on Domain Ontology
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic model, LDA and all that
Neural Models for Information Retrieval
Introduction to Probabilistic Latent Semantic Analysis
Topic Models
Deep Neural Methods for Retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
Tdm probabilistic models (part 2)
Neural Models for Information Retrieval
Deep Learning for Search
Duet @ TREC 2019 Deep Learning Track
Probabilistic models (part 1)
Artificial Intelligence
A Simple Introduction to Neural Information Retrieval
Neural Models for Document Ranking
Language Models for Information Retrieval
Survey of Generative Clustering Models 2008
The Duet model
Topic Extraction on Domain Ontology
5 Lessons Learned from Designing Neural Models for Information Retrieval
Ad

Viewers also liked (10)

PPTX
Is hadoop for you
PPTX
Twitter with hadoop for oow
PDF
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
PDF
SpringPeople Introduction to Apache Hadoop
PPTX
Data Architectures for Robust Decision Making
PPTX
Have your cake and eat it too
PPTX
Kafka for DBAs
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PDF
Omnichannel Customer Experience
PDF
Big Data in Retail - Examples in Action
Is hadoop for you
Twitter with hadoop for oow
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
SpringPeople Introduction to Apache Hadoop
Data Architectures for Robust Decision Making
Have your cake and eat it too
Kafka for DBAs
Introduction To Big Data Analytics On Hadoop - SpringPeople
Omnichannel Customer Experience
Big Data in Retail - Examples in Action
Ad

Similar to Topic modeling using big data analytics (20)

PPTX
Topic modeling using big data analytics
PDF
Large scale topic modeling
PDF
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
PDF
Streaming topic model training and inference
PDF
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
PPTX
(Hierarchical) topic modeling
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
PPTX
Topic Modelling to identify behavioral trends in online communities
PDF
Ire major project
PPTX
Topic modeling - EuroPython
PDF
Topic Modeling - NLP
PDF
Tfm slides
PPTX
OSFair2017 training | Explore, model, analyze and visualize systematic resear...
PDF
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
PPTX
Data mining with big data implementation
PPTX
Topic Extraction using Machine Learning
PDF
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
PPTX
Tensors for topic modeling and deep learning on AWS Sagemaker
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
PDF
A Text Mining Research Based on LDA Topic Modelling
Topic modeling using big data analytics
Large scale topic modeling
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Streaming topic model training and inference
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
(Hierarchical) topic modeling
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Topic Modelling to identify behavioral trends in online communities
Ire major project
Topic modeling - EuroPython
Topic Modeling - NLP
Tfm slides
OSFair2017 training | Explore, model, analyze and visualize systematic resear...
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
Data mining with big data implementation
Topic Extraction using Machine Learning
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
Tensors for topic modeling and deep learning on AWS Sagemaker
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A Text Mining Research Based on LDA Topic Modelling

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Mega Projects Data Mega Projects Data
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to machine learning and Linear Models
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
IB Computer Science - Internal Assessment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Fluorescence-microscope_Botany_detailed content
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Clinical guidelines as a resource for EBP(1).pdf
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to machine learning and Linear Models
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
Qualitative Qantitative and Mixed Methods.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

Topic modeling using big data analytics

  • 1. TOPIC MODELING USING BIG DATA ANALYTICS -BY SARAH MASUD(12-CSS-57) FARHEEN NILOFER(12-CSS-23)
  • 2. INTRODUCTION WORK FLOW: Installation of Hadoop on multiple nodes- For distributed processing. Pre processing of data set - Cleaning and conversion of data into desired format. Passing the converted data to the modeling tool. Parallelizing the computation and algorithm selection. Comparison of results on the basis of Efficiency of different modelling algorithms Computation on single vs multiple node.
  • 3. WHAT Is Big Data and Topic Modelling? Big Data: Data that cannot be stored or processed by traditional computing techniques. EXAMPLES: Black Box Data, Social Media, Space Exploration, Power and Grid Station,Search Engine Data… Topic Modelling: Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. IN LAYMAN TERMS A method of text mining to identify patterns in a corpus. Topic modeling helps us develop new ways to search, browse and summarize large archives of texts. VOLUME VARIETY VELOCITY
  • 4. TOPIC MODELING IN IMPLEMENTATION
  • 5. WHY Topic Modelling using Big Data ? You have to create several different variables for every single word in the corpus. The models we would be running, with roughly 2,000 documents, will get to the edge of what can be done on an average desktop machine, and commonly take a day. Hadoop is a framework which could provide all the facilities that are needed in modelling of such a huge set of data. till 2 DAYS 10 MINS ?? 5 MILLION GIGABYTES OF DATA GENERATED(amount) 2003 2011 2013 2015
  • 6. HADOOP and its COMPONENTS Hadoop: An open source framework written in JAVA. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.(confi) It has two major component- HDFS(Hadoop Distributed File System)- For the storage. MapReduce- Processing of data.(pgram model) Hadoop Installation: Cluster of 5 (in our case) commodity hardwares. Namenode-the manager. Datanodes- the actual storage and processing units.
  • 8. HOW Topic Modelling is Achieved Using Big Data Analytics? Proposed Algorithms: Probabilistic Latent Semantic Indexing ( PLSI) : It is a novel statistical technique for the analysis of two-mode and co-occurrence data Latent Dirichlet allocation (LDA): It’s a way of automatically discovering topics that sentences contain. Pachinko allocation Modeling correlations between topics in addition to the word correlations which constitute topics. ` d wz
  • 9. HOW Topic Modelling is Achieved Using Big Data Analytics? TOOL S Model/ Algori thm Langu age Introd uction Mallet LDA(in cluding Naïve Bayes, Maximu m Entropy , and Decisio n Trees) Java efficient routine s for converti ng text to "feature s", a wide variety of algorith TOPIC MODELLING TOOLS
  • 10. WHERE is Topic Modeling Using Big Data Applied? SOME APPLICATIONS OF TOPIC MODELING INCLUDE: Topic Modeling for analyzing news articles. Topic Modeling for Page Rank in Search Engines. Finding patterns in genetic data, images, social graphs. Topic modeling on historical journals.
  • 11. REFERENCES: 1.Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent Semantic Indexing: A probabilistic analysis" (Postscript). Proceedings of ACM PODS. 2.Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993. 3.Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84. doi:10.1145/2133806.2133826. 4.Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv:1204.1956.