The automatic creation of literature abstracts

From LS2

Jump to: navigation, search

Research Notes

This is one of the first works on automatic summarization of (technical) documents. Luhn proposed that "the frequency of word occurrence in an article furnishes a useful measurement of word significance". He also proposed that "the relative position within a sentence of words having given values of significance furnishes a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." He claims that a user repeats certain words as he advances or varies his arguments and as he elaborates on a subject. This emphasis is taken as a measure of significance. Luhn claims that words with the same root can be considered identical notions and regarded as the same word. In his method, an inventory is taken and a word list is compiled in descending order of frequency. Luhn further states that in technical writing, there is a small probability that a single word will be used to convey different notions and there is a small probability of using different words to convey the same notion. Luhn brings in a concept of stop words, and says that the highest frequency words are too common and would consist of "noise" in the data. They can be eliminated by storing a list of stop words. He draws a boundary line on the graph of the words and their frequency to establish a section of the data that would consist the most informative words. A line E is drawn that is bell shaped and the words following the curve would take up proportional significance.

After deriving significances of words, Luhn formulates his criteria of selecting significant sentences from a document. He proposes that "wherever the greatest number of frequently occurring different words are found in greatest physical proximity to each other, the probability is very high that the information being conveyed is most representative of the article. From these considerations a “significance factor” can be derived which reflects the number of occurrences of significant words within a sentence and the linear distance between them due to the intervention of non-significant words. All sentences may be ranked in order of their significance according to this factor, and one or several of the highest ranking sentences may then be selected to serve as the auto-abstract."

He produces some results achieved on some technical articles using the methods described above.

BibTex

@article{Luhn58,
 author =   {H. P. Luhn},
 title =    {The Automatic Creation of Literature Abstracts},
 journal =  {IBM Journal of Research Development},
 year =     {1958},
 volume =   {2},
 pages =    {159--165},
 number =   {2},
 x-location =   {yes},
 url = {http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.pdf },
}