Methods for Text Summarization

Automatic summarization refers to the algorithmic shortening of a communication. The result should effectively convey the important points in a shorter form, be complete, readable and follow the vocabulary of the original. This allows a user to consume critical information as efficiently as possible. Summarization appears in virtually every form of consumable media, for example:

  • Abstracts of research papers

  • Synopses of books and films

  • Headlines of articles

  • Minutes of a meeting

  • Sound bites of politicians and correspondents for news media.

Text summarization is of particular interest because of its relative simplicity to approach computationally, compared to the large potential gain in productivity. For this reason, vendors such as Google have heavily invested in research in this area and now feature summaries that many of us use every day: answering a user’s query directly on the search page without the need to ever click through to a source.

Text summarization can be broadly categorized into two approaches:

  1. Extractive, wherein the key phrases are identified and presented verbatim, with no additional content generated by the algorithm.

  2. Abstractive, where the whole text is interpreted, and a summarized output is generated using language that may not have been present in the original.

Extractive techniques have the advantage of being far easier to implement due to their method of scoring and concatenating existing phrases rather than creating new ones. For this reason, more research is in this area and there are more available toolsets for extractive summarization. The drawback of this approach is mostly in the inconsistent fluency of the summary.

Abstractive techniques are more complex and generally require more adventurous algorithms and computing power. They are usually highly domain-specific, with little transferability to other types of text. However, for their intended domain, they can provide highly accurate and readable summaries.

This whitepaper will discuss the most popular approaches to automatic text summarization; their implementations, practicality and relative strengths and weaknesses. The topic of text summarization will be considered in its wider business and historical context, and an overview of the leading institutions in this field will be provided. An overview of the available commercial and open-source options will be provided, along with a summary of related fields in Natural Language Processing.