video to text summarization using natural languyge proccesing

VIDEO-TO-TEXT SUMMARIZATION USING NLP:
TRANSFORMING VISUAL CONTENT INTO CONCISE TEXT
SUMMARIES
Reg.no Name
23B81D5906 Mekala Hari Ranjitha Nalini
Guide :
Dr N.Deepak
Professor
Department of CSE,
Sir C.R.Reddy College of Engineering
.
SIR C R REDDY COLLEGE OF ENGINEERING, ELURU
Approved by AICTE & Permanently Affiliated to JNTUK, Kakinada
Accredited by NBA, Accredited by NAAC with ‘A’ Grade
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

1
OUTLINE OF THE PRESENTATION
• Abstract
• Introduction
• Literature Survey
• Problem Statement
• Existing System
• Proposed System
• Code and Implementation
• Output Screens
• Conclusion

ABSTRACT
▰ Video summarization aims to produce a high-quality text-based summary of videos.
▰ The process involves converting video files to audio files, followed by converting the
audio into text.
▰ Transformer architecture of Natural Language Processing (NLP) enhances the
workflow.
▰ An extractive-video-summarizer is introduced using state-of-the-art pre-trained ML
models and open-source libraries.
▰ The summarizer follows a systematic regime consisting of five stages:
▻ Preparation of a multidisciplinary dataset of videos.
▻ Audio extraction from video files.
▻ Text generation from audio files using Automatic Speech Recognition (ASR).
▻ Text summarization using extractive summarizers.
▻ Entity extraction using Named Entity Recognition (NER). Class
2

ABSTRACT(cont…)
▰ Project was conducted primarily on English languages.
▰ The model performs significantly well and generates accurate, contextually relevant tags for
videos.
▰ Video datasets are collected from various domains to ensure diversity.
▰ Specialized tools are used for extracting audio from video files.
▰ Advanced ASR systems are employed to ensure accurate speech-to-text conversion.
▰ Extractive summarizers generate concise and informative summaries.
▰ Named Entity Recognition (NER) identifies key entities like names, locations, and events.
▰ State-of-the-art pre-trained models enhance performance and accuracy.
▰ Evaluation metrics demonstrate the model’s effectiveness in generating relevant summaries.
▰ Effective content management and information retrieval are achieved through the generated
summaries.
▰ The extractive-video-summarizer offers a robust solution for video content analysis and
summarization.
2

INTRODUCTION
3
• Video summarization is an essential technique for generating concise, high-quality
text-based summaries of videos.
• It helps users quickly understand the core information and key insights from video
content.
• The summarization process involves converting video files to audio and
subsequently transcribing the audio to text.
• Transformer-based Natural Language Processing (NLP) models significantly
enhance the accuracy and quality of the summaries.
• Existing text summarization models have paved the way for advancements in video
summarization.
• Our proposed extractive-video-summarizer leverages state-of-the-art pre-trained
Machine Learning (ML) models and open-source libraries.
• The model follows a structured approach, encompassing video data collection,
audio extraction, transcription, extractive summarization, and entity extraction.

INTRODUCTION
5
• The summarizer ensures effective content management and rapid information retrieval.
• Robust evaluation metrics confirm its effectiveness in generating accurate and relevant
summaries.
• The entity extraction feature further enhances summary quality by identifying key
information like names, locations, and events.
• Open-source libraries provide flexibility and seamless integration into various applications.
• The model’s systematic regime ensures adaptability across diverse video datasets from
multiple domains.
• Its advanced ASR systems offer precise speech-to-text conversion, facilitating accurate
transcription.
• Evaluation results indicate superior performance compared to traditional methods.
• This research demonstrates the practical application of AI in automating video content
analysis and management.
• The extractive-video-summarizer provides a scalable, efficient, and reliable solution for
video analysis.
• Ultimately, it enhances the accessibility of information by generating insightful video
summaries in a time-efficient manner.

INTRODUCTION
6
• Speech Recognition is a prominent field within machine learning, widely applied
across various domains.
• It powers applications like automatic subtitles on platforms such as Netflix and
YouTube.
• Popular voice assistants like Google Home Mini, Amazon Alexa, and Apple Siri rely
heavily on Speech Recognition.
• Named Entity Recognition (NER) is a crucial Natural Language Processing (NLP)
technique that identifies and extracts specific entities from text.
• NER can detect product names, events, and locations, enhancing search engines,
chatbots, and automated data entry systems.
• Text analysis using NER enables the classification of entities into predefined
categories like dates, phone numbers, or monetary values.
• The primary objective of our model is to generate audio files from videos, convert
them into text, and extract relevant entities.
• Using NLP, applications can process video content to produce text transcripts and
extract entities.

INTRODUCTION
6
• Extracted entities are used to generate meaningful tags that enrich video metadata.
• This enriched metadata significantly enhances content recommendations for users.
• Entity extraction streamlines content management and makes video data more
accessible.
• Video platforms can deliver personalized content by leveraging extracted entities for
better recommendations.
• Automated entity extraction reduces manual effort, improving operational efficiency.
• Our model ensures accurate entity extraction by utilizing pre-trained NLP models.
• It supports multiple languages, broadening its usability and reach.
• Evaluations indicate its effectiveness in improving content discoverability and user
experience.
• By integrating Speech Recognition and NER, the model provides a comprehensive
solution for video content analysis.
• Ultimately, it offers a robust, scalable, and intelligent framework for video
summarization and entity extraction.

Video summarization techniques and
their contributions
The video summarization classifications based on their characteristics and
properties are shown in Fig. 1.

8
Feature-Based Video Summarization (VS) Techniques
• Feature-based techniques focus on video characteristics such as motion, color, gesture,
audio-visual aspects, speech, and objects.
• Low-level features like color and texture are commonly used for video content extraction.
Clustering-Based VS Techniques
• Clustering techniques like k-means, partitioning, and spectral clustering are widely used
for video summarization
• The summary length is determined by content selection criteria and various evaluation
techniques.
Shot Selection-Based VS Techniques
• Generic video summaries are created using keyframe extraction, shot boundary detection,
scene change methods, and redundancy reduction
• Video skimming involves reducing redundancy and detecting objects or events
• Function-based methods use attention mechanisms to identify important video segments
Structure-based methods exploit hierarchical story structures using frames and shots.
Video Summarization Techniques

8
Event-Based VS Techniques
• video summaries generated based on objects, events, perceptions, and features.
• High-level features such as specific faces, motions, and gestures provide reliable content
information (
• It events from keyframes using minimum and maximum frame boundaries.
• Graph theory and scale-free networks are used for video event extraction in mono-view
videos
• Multi-view videos use techniques like Basic Local Alignment Search
• State-of-the-art techniques generate event summaries for sports videos like soccer, cricket,
tennis, and basketball
Trajectory-Based VS Techniques
• Initial projects focused on static video summaries.
• Dynamic video summaries are created using trajectory-based methods with stationary
backgrounds.
• These methods are computationally expensive and require significant resources.
• Deep learning approaches provide effective solutions for detecting important video content.

16
Problem Statement
Video summarization using NLP remains a challenging task due to the
diversity and complexity of video content. Existing methods often
struggle with accurately extracting relevant information from videos,
resulting in low-quality summaries. Additionally, techniques relying on
low-level features like color and texture lack contextual understanding.
There is a need for more robust methodologies that combine advanced
NLP techniques, entity extraction, and deep learning models to generate
meaningful video summaries. This project aims to address these
challenges by developing efficient video-to-text summarization systems.
LIVER DISEASE

17
Here are the existing system problems:
Existing System
• Complexity and Diversity of Video Content
Videos contain various elements like scenes, objects, and interactions,
making it challenging to extract relevant information.
• Low-Quality Summaries
Existing methods often generate inaccurate or incomplete summaries due to
poor feature selection.
• Lack of Contextual Understanding
Approaches using low-level features like color, texture, or motion fail to
comprehend the context of the video.
• Inefficient Use of NLP Techniques
Insufficient utilization of advanced NLP models for understanding the
semantics and generating meaningful summaries.
• Need for Robust Solutions
There is a requirement for improved methodologies combining deep learning
models, entity extraction, and language understanding for better video-to-
text summarization.

19
PROPOSED SYSTEM
Proposed System Architecture
In the figure shows block diagram of the system architecture outlining the key stages of
our model.
Video file
Extractive
Summarization
Abstractive
Summarizati
on
Encoder &
Decoder
Named Entity
Recognition
text summarization

Video File: Serves as the input to the system.
Extractive Summarization: Selects key sentences directly
from the transcribed text.
Abstractive Summarization: Generates a concise and
coherent summary using natural language generation
techniques.
Encoder & Decoder: Processes the text using a transformer-
based mechanism to understand its context and meaning.
Named Entity Recognition (NER): Identifies and categorizes
entities like names, locations, and dates to enhance the
summary's informativeness.
Text Summarization: Produces the final summarized text as
the output.

20
Extractive Summarization
Proposed System Model
• Extractive summarization involves selecting and extracting the most relevant
sentences or phrases directly from the original text.
• It uses ranking algorithms or machine learning models to identify the most
informative sentences.
• Common methods include TextRank, LexRank, and clustering-based approaches.
• It is useful for news articles, research papers, and legal documents where factual
accuracy is crucial.
• It maintains the original meaning of the text with high accuracy.
• It can result in summaries lacking coherence and fluidity since sentences are
directly extracted without rephrasing.
Figure 2 : Extractive summarization

Abstractive Summarization
• Abstractive summarization generates a concise and coherent summary by
understanding the context and meaning of the text.
• It uses advanced natural language generation (NLG) techniques to create new
sentences that convey the main ideas.
• Models like BART, T5, and GPT are commonly used for abstractive summarization.
• It is beneficial for summarizing conversational text, articles, or reports where
coherence and readability are essential.
• It can produce human-like summaries by paraphrasing and rephrasing content.
• It may introduce factual inconsistencies or lose key information if not trained
properly.
Figure 3: Abstractive summarization

21
• Sequence-to-sequence (Seq2Seq) is a neural network architecture used
for transforming one sequence of data into another.
• It is widely used in tasks like machine translation, text summarization,
chatbots, and speech recognition.
• Seq2Seq models typically consist of an Encoder and a Decoder.
• The Encoder processes the input sequence and converts it into a fixed-
length context vector (a numerical representation).
• The Decoder uses this context vector to generate the output sequence
step-by-step.
• Attention mechanisms are often added to Seq2Seq models to focus on
relevant parts of the input during decoding.
• Transformer-based models like BART, T5, and GPT use Seq2Seq for
improved text generation and understanding.
Encoder-Decoder Architecture

22
PROPOSED SYSTEM
Figure 4 : Encoder-Decoder Architecture

23
• The Encoder-Decoder architecture is a common framework in sequence-to-
sequence (Seq2Seq) tasks, primarily using LSTM (Long Short-Term Memory)
or GRU (Gated Recurrent Unit) models.
• An Encoder is the first component of the sequence-to-sequence (Seq2Seq)
architecture.
• It processes the input sequence (such as a sentence) and converts it into a
fixed-length context vector, also called a latent representation.
• The Encoder typically consists of multiple layers of recurrent neural networks
(RNNs), long short-term memory networks (LSTMs), gated recurrent units
(GRUs), or transformer blocks.
• Each layer captures the sequential and contextual information from the input
data.
• The final hidden state of the Encoder contains a comprehensive representation
of the input, which is passed to the Decoder for generating the output.
• Encoders are essential in tasks like machine translation, text summarization,
and speech recognition.
Encoder

Named Entity Recognition
• Definition: NER is an information extraction technique that identifies and classifies named
entities in text into predefined categories like names, organizations, locations, times, and
monetary values.
• Applications: NER is widely used in Natural Language Processing (NLP) to extract useful
information from large datasets, such as analyzing news articles, customer reviews, and social
media posts.
• Entity Classification: Detected entities are categorized into types like Person, Organization,
Location, Date, Quantity, and Monetary Value.
• NER Process: It involves two steps —
• Entity Detection: Identifies named entities in the text.
• Entity Categorization: Classifies the identified entities into specific categories.
• Tools Used: Libraries like SpaCy are commonly used for entity extraction and tagging,
providing efficient and accurate results.
• Practical Use Cases: NER helps in answering questions such as:
• Which companies are mentioned in a news article?
• Were specific products mentioned in reviews?
• Does a tweet contain the name of a person or location?

24
• A Decoder is a key component of the sequence-to-sequence (Seq2Seq)
architecture, responsible for generating the output sequence.
• It takes the context vector from the Encoder, which represents the input
sequence, and generates one output token at a time.
• The Decoder uses techniques like recurrent neural networks (RNNs), long
short-term memory networks (LSTMs), gated recurrent units (GRUs), or
transformer blocks for sequential processing.
• It predicts the next token by considering both the context vector and the tokens
generated so far.
• Attention mechanisms are often applied to help the Decoder focus on the most
relevant parts of the input sequence during generation.
• It is widely used in applications like machine translation, text summarization,
chatbot development, and image captioning.
Decoder

26
Figure 5:Encoder–Decoder architecture of the long short term memory (LSTM) network
‐ ‐

30
Software Specifications
SoftwareSpecification
OperatingSystem: Windows10
Tool: Jupiternotebook
Language: Python
Hardware Specification
Processor: IntelCorei3
Ram: 4GB
SystemType:64bit
CNN

31
CODE
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize from nltk.stem
import WordNetLemmatizer
from nltk.corpus import stopwords import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding import os
# Download required NLTK data nltk.download('punkt')
nltk.download('stopwords') import nltk
from nltk.tokenize import sent_tokenize, word_tokenize from nltk.stem
import WordNetLemmatizer
from nltk.corpus import stopwords import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding import os
# Download required NLTK data nltk.download('punkt')
nltk.download('stopwords') nltk.download('wordnet')

32
CODE
already up-to-date! [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Package
wordnet is already up-to-date!
True
model = whisper.load_model("base") # Load Whisper model result = model.transcribe(audio_path)
return result["text"]
# Step 3: Preprocess text (Tokenization, Lemmatization) def preprocess_text(text):
sentences = sent_tokenize(text) # Split into sentences lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) processed_sentences = []
for sentence in sentences:
words = word_tokenize(sentence.lower()) # Lemmatize and remove stopwords
lemmatized = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and
word.isalnum()]
processed_sentences.append(" ".join(lemmatized))
return sentences, processed_sentences
# Step 4: Create word embeddings (simple example using pre-trained GloVe) def
load_glove_embeddings(glove_file='glove.6B.100d.txt'):
embeddings_index = {}

33
CODE
Loading and Augmenting validation data:
with open(glove_file, encoding='utf-8') as f: for line in f:
values = line.split() word = values[0]
coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs
return embeddings_index def get_sentence_vectors(sentences, embeddings_index,
embedding_dim=100): sentence_vectors = []
for sentence in sentences:
words = word_tokenize(sentence) word_vectors = [embeddings_index.get(word,
np.zeros(embedding_dim)) for word in words] if word_vectors:
sentence_vectors.append(np.mean(word_vectors, axis=0)) else:
sentence_vectors.append(np.zeros(embedding_dim)) return np.array(sentence_vectors)
# Step 5: Build and train LSTM model for sentence scoring def build_lstm_model(input_dim,
sequence_length):
model = Sequential()

34
CODE
Input video link:
https://www.youtube.com/watch?si=OVw0QJdkwobv7_Ba&v=SrlFJf6v4fY&feature=youtu.be

35
CODE
Output Text:
Music So I'm happy to see all of you here because each one of you has a potential which
only English can help you to realize. You have great capabilities but all your talents, all
your capabilities are getting blocked because you do not know English. That is the job
which we have to do. But remember that one person called Smitharoi cannot teach you
English. No other person can teach you English. You have to learn it yourself. Just now you
heard that lots of people are seeing the videos which are there on YouTube which I did for
impact at various points. I did in IIT Kanpur and other places so those also might be there
on YouTube. But no YouTube video can change you. No lecture, no class can change you.
You have to learn English by self- effort. How to improve your communication? That is
listening, speaking, reading and writing. I have told in many videos of impact. So if you have
seen it, practice it. If you haven't seen it, please see it now after finishing. For many years,
Gampagaru has been asking me and you won't believe I get lots of mails, messages, phone
calls. When I am sitting in an important meeting, I get five or six calls. Madam, we have
seen your impact video. What is the use of calling me? Please do not call me at any point of
time. I can't teach you English on the cell phone. Not possible. Say Madam, if by speaking
to you my English becomes better. I don't have that much capability to speak. You have to
practice. You won't believe at least I get a few thousand mails per month. I can't answer
because I am an individual human being. I answer slowly.

Don't send WhatsApp or Viber message immediately. Practice for one year, practice
for six months, practice for five years. Only one person of these twenty lakhs, only one
person. Send a message saying, Madam, having learnt English from your video, I have
got a very good job. That means they practiced what I said. So you need to practice.
Whenever you get a free time, please practice English. Even if you practice with
yourself, it is good enough. So this course is not about communication skills. I am
not telling you anything about how you should improve your speaking, listening,
reading or writing. That you will go back to Gampasar's excellence, which he puts the
videos in YouTube. I have said in every video how to do it. One video I think is there
for about interview skills. So some people told me just yesterday one mail came,
saying, Madam, after listening to that video, that was in Thurupati, S.V. University, I
got a job yesterday. I wanted to tell you first. I felt so happy. I feel so wonderful. I
don't know that person. That person doesn't know me. Impact is doing such a great
job and getting. But then as we have been seeing, not many people are getting a job.
The reason nobody is following those videos. You are just listening. Very nice.
Appreciate.

37
But nobody wants to take so much trouble. When you read the newspaper, all grammar
is there in the newspaper. All vocabulary is there in the newspaper. Nobody reads. We
are only interested in what is happening to Kajriwal or what is happening to Prime
Minister Modi or what is KCR doing. Very good. That is called content. Look at the
language. Content we all know. We are very intelligent. Language we do not know. So we
have to improve. So from today, not only for the next four days, but for the next four
months or four years, 24 hours devoted only to English. Once you learn English, you
forget it. Don't bother to practice. By learning English, don't forget your mother tongue.
But please practice as much as you can. Go on practicing. So I have divided the course
into various lessons of grammar. I am told that this will also go on YouTube. So all those
lacks of people who are asking me questions. I hope we will see this and their problems
will be solved. But again and again, I am telling you that today, books do not teach us
vocabulary. Books do not teach us grammar. You need not buy an instant vocabulary
book. That is of no use. If you want to use English, learn from real life. Remember that
when you were born, your parents didn't give you a telegodixnery. They didn't give you a
grammar book. How to learn Telugu in 30 days? No, you learned automatically. You
learned by listening to others speaking good Telugu.

Conclusion
42
▰ Video summarization provides high-quality, text-based summaries for quick
information retrieval.
▰ It reduces the need to watch entire videos by offering concise insights.
▰ The process involves video-to-audio conversion, followed by audio-to-text
transcription.
▰ Extractive summarization selects key sentences using pre-trained machine learning
models.
▰ Named Entity Recognition (NER) is applied to extract relevant entities for tagging.
▰ Open-source libraries and state-of-the-art models enhance summarization
accuracy.
▰ The five stages include video input preparation, audio extraction, speech-to-text
conversion, extractive summarization, and entity extraction.
▰ This approach is beneficial in media analysis, education, research, and corporate
environments.

REFERENCES
43
• Yuan, J, Wang, H, Xiao, L, Zheng, W, Li, J, Lin, F & Zhang, B 2007, ‘A formal study of shot
boundary detection’, IEEE transactions on circuits and systems for video technology, vol.
17, no. 2, pp. 168-186.
• Guan, G, Wang, Z, Yu, K, Mei, S, He, M & Feng, D 2012, ‘Video summarization with global
and local features’, In 2012 IEEE International Conference on Multimedia and Expo
Workshops, pp. 570-575.
• Wei, H, Bingbing N, Yichao Y, Yu, H, Yang, X & Yao, C 2018, ‘Video Summarization via
Semantic Attended Networks’, Thirty-Second AAAI Conference on Artificial Intelligence.
• Sujatha, C & Mudenagudi, U 2011, ‘A Study on Keyframe Extraction Methods for Video
Summary’, International Conference on Computational Intelligence and Communication
Networks (CICN), vol. 73, no. 77, pp.7-9.
• Liu, T, Zhang, HJ & Qi, F 2003, ‘A novel video key-frame-extraction algorithm based on
perceived motion energy model’, IEEE transactions on circuits and systems for video
technology, vol. 13, no. 10,
• pp. 1006-1013.
• Ciocca, G & Schettini, R 2006, ‘An innovative algorithm for keyframe extraction in video
summarization’, Journal of Real-Time Image Processing (Springer), vol. 1, no. 1, pp. 69-88.
• Chang, IC & Cheng, KY 2007, ‘Content-selection based video summarization’, IEEE
International Conference On Consumer Electronics, Las Vegas Convention Center, USA, pp.
11-14

REFERENCES
44
• Sujatha, C & Mudenagudi, U 2011, ‘A Study on Keyframe Extraction Methods for Video
Summary’, International Conference on Computational Intelligence and Communication
Networks (CICN), vol. 73, no. 77, pp.7-9.
• Dhawale, AC & Jain, S 2008, ‘A novel approach towards key frame selection for video
summarization’, Asian Journal of Information Technology, vol. 7, no. 4, pp. 133-137.
• Congcong, L, Wu, YT, Shiaw-Shian, Y & Chen, T 2009, ‘Motion- focusing key frame
extraction and video summarization for lane surveillance system’, ICIP 2009, pp. 4329-
4332.
• Luo, C, Papin & Costello, K 2009, ‘Towards extracting semantically meaningful key frames
from personal video clips:from humans to computers’, IEEE Transactions On Circuits And
Systems For Video Technology, vol. 19, no. 2.
• Elkhattabi, Z, Tabii, Y & Benkaddour, A 2015, ‘Video summarization: Techniques and
applications’, World Academy of Science, Engineering and Technology, Interna- tional
Journal of Computer, Electrical, Automation, Control and Information Engineering, vol. 9,
no. 4,
• pp. 928-933.
•

video to text summarization using natural languyge proccesing

More Related Content

Similar to video to text summarization using natural languyge proccesing (20)

Recently uploaded (20)

video to text summarization using natural languyge proccesing