How to Run Text Summarization with TensorFlow

Text summarization problem has many useful applications. If you run a website, you can create titles and short summaries for user generated content. If you want to read a lot of articles and don’t have time to do that, your virtual assistant can summarize main points from these articles for you.

It is not an easy problem to solve. There are multiple approaches, including various supervised and unsupervised algorithms. Some algorithms rank the importance of sentences within the text and then construct a summary out of important sentences, others are end-to-end generative models.

End-to-end machine learning algorithms are interesting to try. After all, end-to-end algorithms demonstrate good results in other areas, like image recognition, speech recognition, language translation, and even question-answering.

Image credit: https://research.googleblog.com/2015/11/computer-respond-to-this-email.html

Text summarization with TensorFlow

In August 2016, Peter Liu and Xin Pan, software engineers on Google Brain Team, published a blog post “Text summarization with TensorFlow”. Their algorithm is extracting interesting parts of the text and create a summary by using these parts of the text and allow for rephrasings to make summary more grammatically correct. This approach is called abstractive summarization.

Peter and Xin trained a text summarization model to produce headlines for news articles, using Annotated English Gigaword, a dataset often used in summarization research. The dataset contains about 10 million documents. The model was trained end-to-end with a deep learning technique called sequence-to-sequence learning.

Code for training and testing the model is included into TensorFlow Models GitHub repository. The core model is a sequence-to-sequence model with attention. When training, the model is using the first two sentences from the article as an input and generates a headline.

When decoding, the algorithm is using beam search to find the best headline from candidate headlines generated by the model.

GitHub repository doesn’t include a trained model. The dataset is not publicly available, a license costs $6000 for organizations which are not members of Linguistic Data Consortium. But they include a toy dataset which is enough to run the code.

How to run

You will need TensorFlow and Bazel as prerequisites for training the model.

The toy dataset included into the repository, contains two files in “data” directory: “data” and “vocab”. The first one contains a sequence of serialized tensorflow.core.example.example_pb2.Example objects. An example of code to create a file with this format:

import struct
from tensorflow.core.example import example_pb2
 
with open(output_filename, 'wb') as writer:
  body = 'body'
  title = 'title'
 
  tf_example = example_pb2.Example()
  tf_example.features.feature['article'].bytes_list.value.extend([body])
  tf_example.features.feature['abstract'].bytes_list.value.extend([title])
  tf_example_str = tf_example.SerializeToString()
  str_len = len(tf_example_str)
  writer.write(struct.pack('q', str_len))
  writer.write(struct.pack('%ds' % str_len, tf_example_str))

“vocab” file is a text file with the frequency of words in a vocabulary. Each line contains a word, space character and number of occurrences of that word in the dataset. The list is being used to vectorize texts.

Running the code on toy dataset is really simple. Readme on GitHub repo lists a sequence of commands to run training and testing code.

You can run TensorBoard to monitor training process:

TextSum-toy-train

When running “decode” code, note that it will loop over the entire dataset indefinitely, so you will have to stop execution manually at some point. You can find results of decoding in log_root/decode folder. It will contain a few files, some of them have prefix “ref”, they contain original headlines from the test set. Other files have prefix “decode”, they contain headlines generated by the model.

Troubleshooting

You can encounter an error when running “eval” or “decode” code using TensorFlow 0.10 or later:

“ValueError: Could not flatten dictionary. Key had 2 elements, but value had 1 elements.”

There is an open issue on GitHub for this error. One workaround is to downgrade TensorFlow to 0.9, it worked for me. Another workaround requires changing the code of the model: adding “state_is_tuple=False” to instantiations of LSTMCell in seq2seq_attention_model.py.


If you run training and decoding on toy dataset, you will notice that decoding generates nonsense. Here are few examples of headlines generated:

<UNK> to <UNK> <UNK> <UNK> <UNK> <UNK> .

<UNK> <UNK> <UNK> <UNK> of <UNK> <UNK> from <UNK> <UNK> .

in in <UNK> <UNK> <UNK> .

One of the reasons for poor performance on the toy set could be incompleteness of the vocabulary file. Vocabulary file is truncated and doesn’t contain many of the words which are used in the “data” file. It leads to too many “<UNK>” tokens which represent unknown words.

How to run on another dataset

A toy dataset is, well, a toy. To create a useful model you should train it on a large dataset. Ideally, the dataset should be specific for your task. Summarizing news article may be different from summarizing legal documents or job descriptions.

As I don’t have access to GigaWord dataset, I tried to train the model on smaller news article datasets, which are free: CNN and DailyMail. I found the code to download these datasets in DeepMind/rcdata GitHub repo, and slightly modified it to add the title of the article in the first line of each output file. See modified code here.

92570 articles in CNN dataset, and 219503 articles in Daily Mail dataset. It could be a few more articles, but the code from DeepMind repo could not download all URLs. 322k articles are way fewer than 10 million articles in GigaWord, so I would expect a lower performance of the model if training on these datasets.

After you run the code to download the dataset you will have a folder with lots of files, one HTML file for every article. To use it in TextSum model you will need to convert it to the binary format described above. You can find my code to convert CNN/DailyMail articles into binary format in textsum_data_convert.py file in my “TextSum” repo on GitHub. An example of running the code for CNN dataset:

python textsum_data_convert.py \
  --command text_to_vocabulary \
  --in_directories cnn/stories \
  --out_files cnn-vocab

python textsum_data_convert.py \
  --command text_to_binary \
  --in_directories cnn/stories \
  --out_files cnn-train.bin,cnn-validation.bin,cnn-test.bin \
  --split 0.8,0.15,0.05

Then you can copy train/validation/test sets and vocabulary files into “data” directory and start training the model:

Train on CNN data (cnn-train, cnn-vocab):

bazel-bin/textsum/seq2seq_attention \
  --mode=train \
  --article_key=article \
  --abstract_key=abstract \
  --data_path=data/cnn-train.bin \
  --vocab_path=data/cnn-vocab.bin \
  --log_root=log_root \
  --train_dir=log_root/train \
  --truncate_input=True

Evaluate on CNN data (cnn-validation, cnn-vocab):

bazel-bin/textsum/seq2seq_attention \
  --mode=eval --article_key=article \
  --abstract_key=abstract \
  --data_path=data/cnn-validation.bin \
  --vocab_path=data/cnn-vocab.bin \
  --log_root=log_root \
  --train_dir=log_root/eval \
  --truncate_input=True

Decode on CNN data (cnn-test, cnn-vocab):

bazel-bin/textsum/seq2seq_attention \
  --mode=decode \
  --article_key=article \
  --abstract_key=abstract \
  --data_path=data/cnn-test.bin \
  --vocab_path=data/cnn-vocab.bin \
  --log_root=log_root \
  --decode_dir=log_root/decode \
  --beam_size=8 \
  --truncate_input=True

Training with default parameters doesn’t go very well. Here is a graph of running_avg_loss:

TextSum-cnn-train

Decoding results are also disappointing:

“your your <UNK>”

“We’ll the <UNK>”

“snow hit hit hit <UNK>”

Either dataset is too small, or hyperparameters need to be changed for this dataset.


When running the code I found that training code doesn’t use GPU, though I have all the correct configuration: GeForce 980Ti, CUDA, CuDNN, TensorFlow compiled with using GPU. While training, python.exe consumes 100–300+% CPU, and it appears in the list of processes when running nvidia-smi, but GPU utilization stays 0%.

Thu Oct 13 20:14:11 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 29%   46C    P8    26W / 250W |    124MiB /  6142MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     17605    C   /usr/bin/python                                102MiB |
+-----------------------------------------------------------------------------+

I guess it can be related to the fact that authors of the model were running the code using multiple GPUs, and one GPU had some special purpose. A fragment of seq2seq_attention_model.py file:

  def _next_device(self):
    """Round robin the gpu device. (Reserve last gpu for expensive op)."""
    if self._num_gpus == 0:
      return ''
    dev = '/gpu:%d' % self._cur_gpu
    self._cur_gpu = (self._cur_gpu + 1) % (self._num_gpus-1)
    return dev

The decoding code uses GPU quite well. It consumes almost all 6Gb of GPU memory and keeps utilization over 50%.

Sat Oct 15 01:51:04 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 32%   53C    P2    96W / 250W |   5881MiB /  6142MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3020    C   /usr/bin/python                               5858MiB |
+-----------------------------------------------------------------------------+

Conclusion

Using the code from this article you can easily run text summarization model on your own dataset. Let me know if you find something interesting!

If you happen to have a license for the GigaWord dataset, I will be happy if you share trained TensorFlow model with me. I would like to try it on some proprietary data, not from news articles.

Do you use any other text summarization algorithms? What works the best?

 

  • Daniel Krotov

    Best article I have seen so far on textsum. I just wanted to add something that you might find helpful. Back when I first started playing with this myself, I ended up generating a vocab file based on the toy dataset. I have submitted a pull request but should you wish to provide a reference to it, I have listed the link below.

    https://github.com/tensorflow/models/files/499497/vocab.txt

    I too scraped my own data and have yet to actually get good results. I am running this on my laptop which has a 980M. I’m currently training against about 40k articles, and getting an avg loss of 3.5 after about a week. I got this down to about 1.5ish last week but when I attempted upgrading to 0.11 my data got corrupted. So I’m retraining again using 0.9. When you trained your CNN dataset, what did you get your average loss down to before you tried decoding? Xin Pan replied to someone that with a large dataset, you shouldn’t expect to get below 1.0, however now I’m wondering if what I’m using is really considered “large” if he was speaking in reference to the Gigaword dataset.

    Curious to hear what you have found so far. I have been trying to gain a deeper understanding lately into why the results are so off yet look so good with the sample provided on git. I will let you know if I find anything else out. Again awesome job with the article!

    • surmenok

      Thanks for the comment!
      The best loss I got was 0.6551, but it was just one data point, at 18k training steps (I don’t know how long it was in terms of training time, maybe one day or a bit less). Most of the time loss was fluctuating between 1.5 and 7.
      I haven’t researched text summarization more after writing this article. I may revisit this problem if I get GigaWord or other dataset of significant size.
      Please let me know if you’ll find any clues on how to make the model work on smaller datasets, or if you know where to get larger datasets.

      • Daniel Krotov

        I will keep you up to date with my progress on Textsum and results I get as I get deeper. So now I am using my gaming laptop as a primary TF training machine so no more games for me for a while :) I originally trained against the toy dataset until I got to about 0.005 on avg. The results weren’t too bad at times but still not fully usable. So I then scraped 70k articles and set 30% aside for testing and used the remaining 40k to train. I trained for about 2 weeks and to be honest, wasn’t all too impressed.

        As I dug deeper into the code, and read a plethora or white papers, I then started to begin to understand why so much data is crucial. Textsum is an abstractive model rather than extractive. With an extractive model, you are literally taking the article and deleting pieces of it until you get something close to the headline. Textsum on the other end tries to create the headline from what it tries to understand from the article and references it has trained against in the past. This is quite a fascinating approach however to achieve good results, it needs a lot of data and that data needs to be clean.

        This in my opinion, as you have probably found also is the toughest part of the process. Clean data that can be used for training. I unfortunately have not had the luck of being able to find how to get a hold of the Gigaword dataset, but after reaching out to LDC, they did state that they have CNN articles and some others that run around $300. I’m more prone to purchasing a dataset around this price rather than the $6k price-tag of the gigaword.

        Right now I am in the process of writing my own scraper and I’m going to see how it works out. I will keep you abreast of the results. I’m going to shoot for about a million articles and see how the results are. Hopefully with more data I can get much better results, but until I can prove that I understand this model, take this all with a grain of salt :)

        One important thing to note that Xin had stated in one of the tickets on git was that with large datasets you shouldn’t expect to get a avg loss lower than 1. My current dataset I trained above got descent results with the 0.005 avg loss but when I went to eval against the test set I was only seeing 0.8. This clearly shows that I had over-fitted and was providing me false results. I wasn’t stopping the training as I was expecting that if I was over-fitting, the avg loss would start rising again in tensorboard but I think that must be another graph that that occurs on when over-fitting.

        That’s it for now. I will update you should I gain more insight on this.

        • Richard Liao

          Daniel, It is very interesting to see that you have gone that far. Just a question for you and Surmenok: in the original implement, it has only taken first two sentences as encoder input, but why not the whole article? If the information only contains in the first few sentences, attention network should be able to pick them up, why discard the rest of the article? Of course, computation might be costly, also has to deal with variable lengths of the articles. Daniel, in your effort, did you feed in whole articles (CNN and others)? How are the performances if taking this approach? Thx!

          • Daniel Krotov

            Hi Richard! So you will see in the code that you can totally take as many sentences as you wish. 2 is the default but you can increase it higher. I am hoping to do more work in this area at a later time once I figure out tensorflow serving so I can try serving the textsum model. This has been giving me quite some difficulty so it has been where my focus has been lately.

            That said, you hit it on the head. One of the biggest reasons for only using 2 sentences is primarily around the resources necessary to train against full articles. It also seems from the white papers I have read, that the abstractive model doesn’t work as well when sampling much longer articles. I have not been able to prove this myself though.

            In the end, training against 1.3 million articles to an avg loss of 1.8-2.2 took about a week and a half on an Nvidia 980GTX and finally provided descent results. When training against 40k articles, the results were still pretty bad.

            Should I figure out the TF Serving issues, I will be jumping back into testing longer articles and will update you as to my findings.

            • surmenok

              Hi Daniel,
              Did you get any further with text summarization?

          • surmenok

            I have not found a definitive answer to this. The only thing which could point to the reason is this sentence in the blog post about TextSum (https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html): “It turns out for shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, …”
            I guess that the model just doesn’t have good accuracy if trained on longer messages. Probably with longer inputs a recurrent neural network rolls out to longer computational graph and it’s harder to train because of vanishing gradients.

  • Hafnu hafnu

    i am first to work with tensorflow can anyone assist on how to start text summarisation with tensorflow

    • surmenok

      Hi Hafnu,
      This article is basically a tutorial for running text summarization with TensorFlow. Have you tried to follow the steps described in the article? Do you have any questions in particular?

  • Luckman

    Thanks for sharing! It’s a good article with details. But I am stuck in the step of converting cnn/dailymail articles into binary format, vocab.bin and train.bin. After processing the data by generate_questions.py, I got a lot of files, such as questions/training, questions/test and questions/validation. However, the folder of stories is still empty, and I just cannot turn this questions data into vocab and binary training data. I can’t get which step I missed or run it incorrectly.

    • surmenok

      Try to run generate_questions.py in “store” mode. This mode generates files in “stories” folder.
      python generate_questions.py –corpus=cnn –mode=store

      • Luckman

        Ok! I get it! I missed that step. Thanks a lot!

  • Tjs

    Where are the output headlines stored?? After decoding

    • surmenok

      decode* files in log_root/decode folder

      • Tjs

        Yes. Found it. but , all decode lines are showing . It is also not reading my test data. giving answers on data set. Thats very confusing.

  • surmenok

    Have you installed NLTK? http://www.nltk.org/
    It should be possible to install it using pip: pip install nltk

    • Tjs

      Yeah thanks.this was the issue.

  • Tjs

    Can please you share your trained model Pavel? I have the data and I tried it to train but its not giving the satisfactory results.

    • surmenok

      I can share the model I trained on CNN/DailyMail, but I’m not sure if it will be of any value, because it is not giving satisfactory results either. I don’t have a model trained on Gigaword dataset.

      • Shujian

        Hi Pavel, thanks for this great tutorial. I also have interest to take a look at the trained model on CNN/DailyMail. Would you mind sharing that?

        • surmenok

          I just started the training. I’ll let you know when it completes.

      • Tjs

        Thanks Pavel. Yes I am talking about the model trained on CNN/DailyMail. Even i don’t have access to Gigaword, just wanted to have an idea if I am going on the right track. If you can share it then it would be great. Please do let me know when you share it and it would be very kind of you if you share it at your earliest possible convenience.

        • surmenok

          My development machine was busy on other tasks, so I didn’t have a chance to fully train the model on CNN/DailyMail. I just started the training. I’ll let you know when it completes.

  • Tjs

    Is it necessary to evaluate the model? Can we test it without evaluation?

    • surmenok

      You can run evaluation to check how well it works on previously unseen samples. What do you mean by testing? You can run “decode” if you want just to summarize some documents.

  • surmenok

    On page 5 they say that they used this code as a starting point: https://github.com/nyu-dl/dl4mt-tutorial
    Have you tried that?

  • surmenok

    There could be a stopping condition, but I have never waited long enough to catch it. I was stopping training manually. The trained model is in log_root folder. There should be a bunch of model.* files: checkpoints of the model on different stages of training.

  • Sean Lee

    Thank you so much for sharing your insight. It helped me a lot in understanding the flow. However, I am experiencing some issues when I try it on my end.

    After running commands for training, it runs for few minutes and I see ‘Killed’. From there I don’t get any files created under /log_root/train

    Could you give me any advice on this? Thank you in advance :)

    • Ayushya Chitransh

      I also am facing the same problem. During training, my running avg loss starts showing up and gets killed randomly. In few instances my average loss reached to 3 and sometimes it got killed at 5. My training always stopped due to either freezing my laptop or displaying ‘Killed’. I looked up for this issue and landed at ‘Killed during checkpoint save’ (https://github.com/tensorflow/tensorflow/issues/1962 ) which suggested that it might be a memory issue.

      • Osama Jamil

        what is the RAM of your system you are using ?

        • Ayushya Chitransh

          4 GB, and am not using GPU. I am using toy dataset provided at github.

          • Osama Jamil

            well i dont think you will be able to run it on a system with 4 GB ram, i tried on 4GB using toy dataset. You can run it on min 12 GB ram for toy dataset. For bigger datasets you will need even more than that.

            • Sawyer

              You are right, I used 8G RAM to run the CNN stories, but it has taken too long time and still running

              • Osama Jamil

                yes if you look at ram usage and processing it will show no processing with 95-99% ram usage.. and if u run the program directly using python rather than bazel it will crash

                • Osama Jamil

                  Running on systems with atleast 12 GB ram will yield some results but if you are using CNN dataset it should be more than that or it will be killed after few runs with high avg loss

            • Ayushya Chitransh

              After reading your reply i checked my resource usage and yes, i found that it got killed due yo shortage of memory space. Though, increasing swap memory in my ubuntu has given better results. Full 3.7GB ram+6.4GB out of 18.6 is currently in use, while the system is runmimg right now at average loss of 4.1 and still going. Increasing swap memory seems to work.

  • Osama Jamil

    Hi!
    Great work helped me a lot. thanks.
    Im having slight issue running on CNN data.
    The vocabulary file generated using the code above gives assertion error. Im not able to understand why it is causing issues.
    following error i get :
    Traceback (most recent call last):
    File “/home/umair/sumModel/bazel-bin/textsum/seq2seq_attention.runfiles/__main__/textsum/seq2seq_attention.py”, line 213, in
    tf.app.run()
    File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”, line 30, in run
    sys.exit(main(sys.argv))
    File “/home/umair/sumModel/bazel-bin/textsum/seq2seq_attention.runfiles/__main__/textsum/seq2seq_attention.py”, line 165, in main
    assert vocab.CheckVocab(data.SENTENCE_START) > 0
    AssertionError

    im using ubuntu 14.04 with 12 GB ram TF version 0.9 and python 2.7
    Any help in this matter would be gratly appreciated. thanks.

  • Yerik Wang

    Hi Pavel, How did you get the graph of running_avg_loss in the TensorBoard. I run the TensorBoard but could not find the graph of running_avg_loss in the “Event”

    • surmenok

      I just restarted training on CNN/DailyMail, and see these 5 graphs on TensorBoard: global_norm, global_step, learning_rate, loss, running_avg_loss.

  • surmenok

    Have you tried IBM Watson’s approach? Does it work better than TextSum?

    • Richard Liao

      I didn’t. I am not sure if there’s a trial that we can use IBM Watson. Instead, lately we are kind of satisfied with textRank algorithm. There’re some variations that we are exploring.

      • surmenok

        What kind of problem are you solving with TextRank?

        • Richard Liao

          Same. It’s unsupervised approach for text summary.

          • surmenok

            Can I talk with you about it offline? I’m curious how exactly you are applying TextRank to text summarization. Usually it is used for another kind of problems, like keyword and sentence extraction.

            • Richard Liao

              Sure. Email me at ricliao@gmail.com. I will send you couples of links.