AUTOMATIC TEXT SUMMARIZATION USING SEQUENCE-TO-SEQUENCE MODELS

Опубликовано в журнале: Научный журнал «Интернаука» № 19(242)
Рубрика журнала: 3. Информационные технологии
DOI статьи: 10.32743/26870142.2022.19.242.339376
Библиографическое описание
Утешов Е.М. AUTOMATIC TEXT SUMMARIZATION USING SEQUENCE-TO-SEQUENCE MODELS // Интернаука: электрон. научн. журн. 2022. № 19(242). URL: https://internauka.org/journal/science/internauka/242 (дата обращения: 27.04.2024). DOI:10.32743/26870142.2022.19.242.339376

AUTOMATIC TEXT SUMMARIZATION USING SEQUENCE-TO-SEQUENCE MODELS

Yerzhan Uteshov

Master of Faculty of Information Technology, Kazakh-British Technical University,

Kazakhstan, Almaty

 

ABSTRACT

As we acquire more digitization over information storage and processing in our daily lives, the demand for digitization has grown in a variety of areas, including investigative procedures. In reality, adopting best practices for the process of evidence extraction from acquired devices from crime scenes is required for crimes involving computer systems.

Summarization has become a research area in recent years. Natural language processing (NLP) techniques enable academics to produce efficient results for a wide range of texts. The suggested work employs the Seq2Seq Architecture with RNN to perform document summarizing tasks. The summary's nature is abstract, allowing the model to generate internal meaning on its own. This methodology becomes a robust foundation for summarizing larger and complex materials with refinement and ongoing development. The end result is efficient summary creation with ROUGE scores in the 0.6-0.7 range.

 

Keywords: Natural Language Processing, Text summarization, Machine Learning, TensorFlow, Seq2Seq.

 

Introduction

Our world is now inundated with massive amounts of data. With so much data circulating in the digital world, machine learning algorithms that can automatically compress lengthy texts and offer accurate summaries that elegantly convey the original contents are needed. Due to the enormous development in the availability of blogs, news stories, and reports in the present era of big data, extracting usable information from a vast number of textual documents is a difficult challenge. Summarizing these materials with automatic text summarization is a viable option. The goal of text summarizing is to reduce large papers into brief summaries while retaining the most relevant information and significance.

The text material can be obtained, processed, and digested effectively and efficiently using such short summaries. In general, text summary can be done in two ways: extractive and abstractive When words, phrases, and sentences in the summaries are chosen from the source articles [1], the method is termed extractive. They're straightforward and can generate grammatically correct sentences. The produced summaries typically retain key information from the original articles and perform well in comparison to human-written summaries.

Abstractive text summarization, on the other hand, has gotten a lot of interest since it can generate new terms using language generation models based on the representation of source materials [2]. As a result, they have a high chance of writing high-quality summaries that are verbally original and can easily include outside knowledge. Many deep neural network-based models have outperformed traditional extractive approaches in terms of generally used evaluation measures in this category [3]. The latest improvements of recurrent neural network (RNN) based sequence-to-sequence (seq2seq) models for the task of abstractive text summarization are the subject of this research.

Literature Survey

Identifying the need of summarization to ease the effort of extracting essential information from the multiplicity of online resources, Gupta, Vanyaa, Neha Bansal, and Arun Sharma [4]. Due to the large breadth of applying extraction or abstraction in the field, automatic summarization is defined, and legal summarization is given minimal preference. Generic Summary and Automatic Text Summarization are two types of automatic text summarization (where no extra information is added).

Rajendra Pamula [5], Kanapala, Ambedkar, Sukomal Pal, and Kanapala A survey was conducted to determine how far the summarizing of single and multi-documents in the legal sphere has progressed. It was discovered that the challenge of summarizing is significantly more difficult when dealing with legal literature because the size, structure (including status codes), terminology, ambiguity, and citations vary. It is critical to manage the hierarchical structure of such content correctly.

As a result, for the single document summary, the approaches were found and classified based on the overlying technique, which included Linguistic feature-based approaches like LSA (Latent Semantic Analysis) blending term and sentence description or using Lesk Algorithm, Statistical feature-based approaches where features like term frequency (TF), inverse sentence frequencies (ISF), textual entailment (TE), Language-Independent Approaches where based on word frequency etc., Language-Independent Approach In addition to these Evolutionary computing-based approaches, Graph-based approaches such as Tree Knapsack were discovered. A comparable study for Multi Documents demonstrated numerous ways for extracting linguistic aspects significant terms and creating a co-occurrence base for the terms. Another, interesting method is the Latent Dirichlet Allocation, which is used to evaluate sentences based on important topic terms. Graph-based techniques have also been investigated as a feasible solution.

Proposed Extraction-based Summarization Using Word Vector Embedding by Aditya Jain, Divij Bhatia, and Manish K. Thakur [6]. To construct the labelled training data, a similarity score is calculated for each sentence using 100-dimensional glove vectors. For feature extraction, the Mean TF-ISF i.e. Term Frequency Inverse Sentence Frequency (TF-ISF) was considered, with longer sentences receiving less weight. A three-layer fully linked feed-forward neural network was employed for summarization MLP, or multi-layer perceptron. ROUGE1: 0.36625, ROUGE2: 0.15735, and ROUGEL: 0.34410 were the outcomes of the summary with ROUGE scores for the first 284 documents.

Mahmood, Yousefi-Azar, and Len Hamey [7], It was demonstrated that a deep autoencoder (AE) may be suitable used for query-based summary synthesis based on term frequency (tf) input. The inclusion of random noise to local tf prompted the development of Ensemble Noisy Auto-Encoder (ENAE). The training was split into two stages: pre-training and fine-tuning. A generative model (RBM) is utilized to discover parameters.

RBM is made up of two layers of neural networks, the first being Gaussian-Bernoulli and the second being Bernoulli-Bernoulli (except for hidden units). The experiment was conducted on the SKE and BC3 email corpus, and the results showed that utilizing AE improved ROUGE-2 recall by 11.2 percent.

Proposed Work

The information comes from the Harvard NLP project, which consists of two datasets: the Gigaword dataset and the CNN/DM dataset which show that different style has a different impact on the model adaption, convergence speed, readability, and abstraction of generated summaries. The CNN/DM dataset was employed, which consists of news articles and their handwritten summaries. The news stories are divided into numerous categories, such as business, politics, sports, and finance. Figure 1 depicts the process of importing, cleaning, preprocessing, training, and testing data using DL neural networks such as the seq2seq architecture.

 

Figure 1. Flowchart for the training and testing of the dataset

 

To simplify part of the work, the first step is to import the needed libraries required by the python application. After that, you must import the dataset. The dataset is unprocessed, and it will require treatment to extract only known characters and significant words while removing irrelevant data. Because the text data obtained is largely unstructured, the dataset needs to be cleaned. The language may be reduced, there may be spelling errors, it is not standardized, and it contains dates and other numerical data as well as legal jargon. This process can also be aided by stemming or lemmatizing terms. This is why the preprocessing stage is dedicated solely to it. Note down the metrics that are used to measure the performance of our model after the training and testing phases, such as the ROUGE Score. Examining the network's limitations — Because the training models aren't ideal, it's important to figure out why they aren't and see if there are any ways to enhance them. This is represented in Figure 2.

Figure 2. Analyzing the limitations of the classifier

 

The trained dataset is used to point down grammar faults and nonsensical words. This data is examined further to determine any patterns or classifier limitations.

Implementation

Data Cleaning

This module is responsible for cleaning the dataset, which includes the removal of noise from unstructured text. This module will be used to load training data, test data, and to display sample cleaning output. Below is a diagram of the algorithm.

Text dataset as input

The result is a cleaned text dataset.

1. Install the necessary libraries.

2. Upload the data.

3. For each sample in the dataset, do the following:

a. Replace the undesired characters with #.

Any further substitutions may be made if they appear to be appropriate.

4. Create a clean dataset.

Following this, we'll start developing our model's vocabulary, which will consist of a simple word dictionary with (word, position) entries and a reverse dictionary with (position, word) entries.

Building Dictionary and Data Preprocessing

As previously stated, two dictionaries are created using the methods below. The seq2seq algorithm requires these stages in terms of abstraction, however this dictionary creation is not required for ETS.

Before continuing, 4 built-in words are added, which will be used to create sequences of the same length, identify terms not present in the dictionary, identify the beginning of a sentence, and indicate the conclusion of a sentence.

A cleaned dataset as input

Document Term Matrix as an output (Dictionary)

1. Import the cleaned dataset with labels.

2. Use the built-in nltk function to tokenize the phrases into the words that make them up.

3. Only choose the most popular terms from the tokens listed above.

4. Insert the four pre-programmed words.

5. Iterate through the terms to create a dictionary and pickle it.

6. Using the information above, create a reverse dictionary.

7. Determine the maximum lengths of the summary and article.

Building the seq2seq Abstractive Text Summarizer (ATS) model

The Attentive seq2seq Model is being trained. An equal number of deep learning models have been built for each version of the training data, with their parameters optimized on the Gigaword validation set. The encoder's bidirectional LSTM layer is made up of two layers of equal size (200), which is also the size of the decoder's LSTM layer. The batch size for the Gigaword and CNN/DailyMail data sets is set to 64, and the training data for each epoch is randomized randomly. The learning rate is originally set to 0.002, with each training period declining by a factor of 25%. The Adam optimization method is also utilized, using gradient norm clipping and a loss function of negative conditional log-likelihood.

Finally, p = 0.2 dropout is employed. The CNN/DailyMail data sets include a vocabulary constraint of 150,000 words (i.e., utilizing the most frequent tokens from the training set), whereas the Gigaword data set has no such restrictions. The models were trained on NVidia K40 GPUs and were adequately converged after 15 epochs.

The PG Model is being trained. The PG model differs from the attentive seq2seq model in that the encoder utilizes two layers of bidirectional LSTMs with 256 dimensions, and the decoder uses one LSTM with 512 dimensions.

The RL Model is being trained. For the Gigaword and CNN/DailyMail data sets, the learning rate is set to 104 and the batch size is 32 and 16, respectively. The LSTM layer has a comparable dimensionality to the PG model stated earlier. The rest of the training parameters are based on the same assumptions as the designs mentioned previously.

The RL Model is being trained. For the Gigaword and CNN/DailyMail data sets, the learning rate is set to 104 and the batch size is 32 and 16, respectively. The LSTM layer has a comparable dimensionality to the PG model stated earlier. The rest of the training parameters are based on the same assumptions as the designs mentioned previously.

TR Model Instruction. The encoder and decoder each have a six-layer stack. The inner-layer dimensionality has been set to 2,048 and the model dimensionality has been set to 512. We assume 8 heads (i.e., 8 parallel attention layers), which reduces each attention layer's model dimensionality to 512/8 = 64.

With parameters 1 = 0.9 and 2 = 0.99, the Adam optimizer is utilized. Equation (6) is used to change the learning rate during training (i.e., increasing it for the first warmupSteps training steps and subsequently decreasing it), where warmupSteps = 5,000 and a = 0.05. For the Gigaword and CNN/DailyMail data sets, dropout is performed with probability p = 0.1 and batch size is set at 64 and 16, respectively.

The seq2seq model will be used to build a Recurring Neural Network (RNN) that is organized as an encoder/decoder architecture. The seq2seq will have a bidirectional structure, with the RNN cell transforming into an LSTM cell, an attention mechanism for better encoder/decoder interface, and a beam search notion. The construction will be broken down into the following sections:

1. Define the TensorFlow (tf), placeholders, variables, and RNN cell in the initialization block.

2. Embedding Block: The embedding matrix is defined for use in the encoder and decoder.

3. Multilayer bidirectional RNN is defined in the encoder block.

4. Add the attention mechanism and the beam search to the decoder block.

5. Loss Block: Only for ATS training, applying gradient clipping, and using Adam Optimizer.

The goal of the 3.1 Initialization Block is to create an Initialization Block.

1. Install the necessary libraries.

2. Create a Model Class that accepts an object called args that contains many parameters.

3. Set all of the parameters, such as embedding size, num hidden, num layers, Learning Rate, and BeamWidth, to their default values.

4. In addition, the reversed dictionary and maximum article and summary length parameters must be set.

5. Define the testing phase in which the LSTM will be used as a cell.

6. Set the data batch size.

7. Use summary length to define the decoder.

8. Create a global step that starts at zero.

Embedding Block

The following is the algorithm to follow:

Dictionary as input

Embedded Matrix as output

1. Create an embedding variable in variable scope.

2. Use tf.constant instead of args.glove if you're in the training phase and args.glove is enabled.

3. For each word, get init will be called to return a vector.

4. If not, define word2vector at random for testing.

5. Define the encoder and decoder embeddings.

Encoder Block:

The following are the measures to take:

1. Establish forward and reverse cells.

2. Use stack bidirectional dynamic rnn with the following parameters: Forward Cells, Backward Cells, Encoder embedded input (word2vec format), X len (length of art icles), and time major to connect them (if True avoids transposes).

3. Create the output at encoder output (for attention calculations) and encoder state (for the decoder's initial state).

4. LSTMStateTuple must be used to create encoder state both forward and backward.

3.4. Decoder Block: There will be two pieces to the Decoder Block:

Part A: Attention Model Input: Encoder Output and Decoder Input

1. BahdanauAttention is the attention structure employed here.

2. output encoder Is used to calculate attention.

3. The decoder cell is a multilayer LSTM to which AttentionWrapper is applied to enhance attention.

4. To combine the two inputs, create a helper function.

5. Transpose the decoder output and use all RNN outputs to generate logits.

6. Adjust the logits.

Loss Block

Loss computation, gradient calculations, ad trimming, and optimizer application are all used to train in this block. The procedures to be followed are stated below.

1. Create an embedding variable in variable scope.

2. Use tf.constant instead of args.glove if you're in the training phase and args.glove is enabled.

3. For each word, get init will be called to return a vector.

4. Create a name scope and a block that will only be utilized during training.

5. Calculate your loss (softmax function).

6. Figure out the gradients.

7. Use gradient clipping to fix the growing gradient problem.

8. Apply

Results and Discussion

Both seq2seq and text-rank metrics have been seen using our proposed technique. The rate at which they are changing is examined. To begin evaluating from the beginning, a dataset is loaded so that it can be prepared for further processing and looks like figure 3.

Dictionary building is necessary for mapping words to their associated integer values and vice versa using this clean dataset.

Now do a numerical analysis of the articles and their summaries. The 99 percentile articles are 28 pages long, while the summaries are 11 pages long.

Following training, five random inputs from our testing dataset are chosen at random and their summaries are generated.

Figure 3 shows how the metrics fluctuate when using seq2seq to check the same five random articles from the validation dataset.

 

Figure 3. The metrics

 

Similarly, findings for the language's extractive summary are created. During the prediction, we needed to provide the percentage by which we intended to reduce our summary, or the percentage of the original text that we wanted to summarize. The anticipated summary is saved in a file, and the following findings are shown in figures 6,7,8, and 9 for compression rates of 25%, 50%, and 75%.

Conclusion

An abstractive summarization platform is attempted to be created in this work, with a scope for the use of Legal or Judicial Data. Due to the lack of available data on the subject, the decision was made to use data of a similar kind, i.e. news reports. In addition to building the abstraction, an interactive module is offered, allowing us to compare the two approaches. Following the seq2seq architecture, alternative models and architectures might be used as a subsequent attempt. By using the identical process of word embeddings followed by encoder decoder with an attention mechanism, the summarization results are also merged on Language and news articles.

 

References:

  1. Rakesh M. Verma and Daniel Lee. 2017. Extractive Summarization: Limits, Compression, Generalized Model and Heuristics. Computación y Sistemas 21 (2017).
  2. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Ça glar Gulçehre, and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. CoNLL 2016 (2016), 280.
  3. Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep Communicating Agents for Abstractive Summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 1662–1675.
  4. Gupta, Vanyaa, Neha Bansal, and Arun Sharma. "Text summarization for big data: A comprehensive survey." In International Conference on Innovative Computing and Communications, pp. 503-516. Springer, Singapore, 2019.
  5. Kanapala, Ambedkar, Sukomal Pal, and Rajendra Pamula. "Text summarization from legal documents: a survey." Artificial Intelligence Review 51, no. 3 (2019): 371-402.
  6. Jain, Aditya, Divij Bhatia, and Manish K. Thakur. "Extractive text summarization using word vector embedding." In 2017 International Conference on Machine Learning and Data Science (MLDS), pp. 51-55. IEEE, 2017.
  7. Yousefi-Azar, Mahmood, and Len Hamey. "Text summarization using unsupervised deep learning." Expert Systems with Applications 68 (2017): 93-105.