Revolutionizing News Consumption: How NLP-Powered Summarizers have Changed the Game

Farhan Ali Khan

Published Jul 13, 2024

INTRODUCTION

In today's fast-paced digital age, staying informed while managing time effectively is more crucial than ever. With the exponential growth of digitized information, particularly news content, the challenge of quickly extracting relevant insights becomes significant. To address this, I have implemented a news summarizer using NLP techniques, based on the research paper "An Effective Approach for News Article Summarization" by Shilpi Malhotra and Ashutosh Dixit. This application utilizes extractive summarization to segregate essential information from extensive news articles, providing users with clear, concise and relevant summaries that save time and enhance readability.

In this article, I am going to show you the steps and challenges that have been faced during the implementation of this application.

METHODOLOGY

When we are asked to summarize a particular passage of text, what we do naturally is we first go through the text body, highlight the information that we think is important, and finally filter out the highlighted points into a new text body which is the summarized version of the original text. This summarized version contains the most important information from the original text or passage.

In this tutorial, we are going to follow the same technique in summarizing the news article.

Following are the steps that we follow in the news summarization:

Segregate and preprocess the sentences from the article
Keyword extraction and Creation of the keyword table (keywords and their term-frequency)
Sentence Scoring
Sentence Similarity
Sentence Filtering
Picking of Top 'k' Sentences As Summary

IMPLEMENTATION

Sentence Segregation and Pre-Processing

For making the process convenient for the machine, first step we do is to separate the sentences from the given text passage. This involves the utilization of Spacey library using which we can separate the sentences as shown in the following code:

What this code does is that it first replaces any newline characters (\n) with empty string to simplify the text. Then it adds each sentence to the sentences list by removing any extra space using the strip() method. That is how the sentences are segregated and stored in the list.

Now as we have the entire list of the sentences, now is the time to pre-process each of these sentences using the preprocess() function defined below:

How the function works is that it first tokenizes each sentence and then filter out the singular nouns and stems each token and appends it to the filtered list. During this process, stopwords and punctuation marks are also filtered out from the sentences.

Calling the function for each sentence in the sentences list using the list comprehension as shown:

That is all about the sentence segregation and pre-processing!

Keyword Extraction

Now, we extract the keywords from the filtered sentences list created above and store it in a dictionary with the keywords as the 'keys' and their respective term-frequency as their 'values'.

This process is shown in shown below:

This function simply extracts the keywords based on Nouns, Named Entities, Thematic terms and Cardinal numbers. Details are given on the research paper.

Following code explains the creation of the keyword-table:

The dictionary shows each keyword against its term-frequency. It should be noted that the keywords from the article title can also be utilized in the creation of the keyword-table. In our case, I have taken the title as "Null". You guys can try including the title as an exercise to see if it works well.

In that way, all the keywords are extracted from the given passage. And we are done with that step as well :)

Sentence Ranking

Now that we have the keyword-table as well as filtered sentences, we now rank each sentence (using score_sentence() function) from the filtered list and store the respective score in score_list.

Following equation shows the formula for calculating the score of a sentence:

Definition of this relation in code as a function is shown as follows:

Following code explains the process of scoring each sentence of the text and storing each score in a separate score_list:

The last line sets the score of the first sentence of the article as the highest as directed in the paper. The output shows the list of scores.

Sentence Similarity

As we have the sentences as well as their scores, it is now easier for us to find the similarity of any two sentences, if that similarity is greater than a certain threshold value, then the sentence with the lower score will be discarded as mentioned in the research paper.

The similarity between two sentences Si and Sj is calculated as follows:

Where Length of each sentence is found by using the following equation:

Following code shows the definition of similarityscore() function:

The significance of this function is that it removes the reduntant sentences from our text document. And then after removal, we can choose 'k' top sentences based on their scores from our original sentences which basically gives us the summary. This step will be explained later.

Redundant Sentence Filtration

The following piece of code basically iterates through the list of filtered_sentences. In each iteration, it finds the similarity score of a filtered sentence with all the other sentences in the list (using the inner loop). If a sentence is to be removed (i.e., in case when similarity is above threshold), then the sentence from the original sentence list will be removed. Now the important thing is, the sentences to be removed are stored in sentence_dustbin list. This is shown in the code below:

At the end, we have all the reduntant sentences in the above mentioned list i.e., sentence_dustbin list.

Now we will have to remove the duplicates from the sentence_dustbin list as it maybe possible that same multiple sentences are added to that list. For that, we will use the set function in Python.

Now that we have duplicates removed, we can now filter out the original sentences along with their respective scores from sentences and score_list respectively.

In that way, we have filtered out the redundant sentences (if any) along with their scores from our original set of sentences.

Picking of Top 'k' Sentences As Summary

As we have finalized the sentences and score_list, we can now pick out the top 'k' scored sentences.

The following code first gets the number of top 'k' scored sentences. In our case, I have decided to choose top 75% of the original set of sentences (with 50%, the sentences in the summary will be less, we can choose any percentage value as we like and can see the difference accordingly), and then it finds the indices of those top-scored sentences and then it creates the summary using sentences located at those top scored-indices in the sentences list.

As a result, we get the following summary:

In a significant leap forward for artificial intelligence, OpenAI has unveiled its latest language model, GPT-5, which promises unprecedented capabilities in natural language understanding and generation. Early testers have noted its impressive ability to grasp complex topics, generate high-quality code, and even engage in meaningful philosophical discussions. Moreover, GPT-5 includes enhanced safeguards to minimize biases and ensure ethical use, addressing many of the concerns raised with previous models.

In this way, we have successfully generated the summary of a news article using the "extractive summary" technique mentioned in the research paper (link attached).

I have also integrated this app with Streamlit and deployed on Streamlit-cloud. Feel free to check it out.

CONCLUSION

In conclusion, we have development an NLP-powered news summarizer that represents a significant advancement in the field of information retrieval and processing. By implementing an extractive summarization approach, we have created a tool that efficiently condenses vast amounts of news content into concise and relevant summaries. This not only addresses the challenge of information overload but also aligns with the increasing demand for quick, accurate, and easily digestible news insights. Of course, in some cases, this model might not perform well (That is the reason we have advanced algorithms for text summarization).

The integration of advanced techniques such as named entity recognition, thematic term identification, and keyword extraction ensures that our summarizer captures the most critical aspects of news articles. This enables users to stay informed without the need to read through lengthy texts, thereby saving valuable time and enhancing their ability to keep up with the latest developments.

Live App Link

news-summarizer-farhan.streamlit.app/

Rameesha Asim

CUI ‘26 | CCI | NTDC | xIntern PEL

Being enthusiastic about machine learning I think it’s a very good practice to implement the research papers like this one as it really helps in clearing of the fundamental concepts that have now revolutionalized into more advanced techniques which are being used nowadays.

1 Reaction

See more comments

Revolutionizing News Consumption: How NLP-Powered Summarizers have Changed the Game

Farhan Ali Khan

INTRODUCTION

METHODOLOGY