New methods in automatic extracting

From LS2

Revision as of 15:19, 26 September 2007; Dipanjan (Talk | contribs)
(diff) ←Older revision | Current revision | Newer revision→ (diff)
Jump to: navigation, search

Research Notes

This paper describes a study whose objective was to build an extracting system to produce indicative extracts, and to investigate a research methodology to handle new text and new extracting criteria efficiently. Three different evaluation schemes were applied to the resulting automatic extracts. Comparison of the automatic extracts and corresponding "target" extracts of 40 documents, which had not been used in the developmental research, showed that a mean of 44 percent of the sentences were co-selected. Also, the mean similarity rating, in terms of a subjective evaluation of content by information type, was 66 percent. These are to be compared with a mean of 25 percent coselected sentences and a mean of 34 percent similarity rating between target extracts and random extracts, respectively. Statistical comparison of the automatic and the corresponding target extracts for the documents used in the developmental phase showed a mean of 57 percent coselected sentences with a standard deviation of 15 percent. A sentence-by-sentence analysis of the corresponding automatic and target extracts of 20 of these documents resulted in a judgment that 84 percent of the computer-selected sentences could be classified as extract-worthy; i.e. they would be worthy of selection in an extract of unrestricted length.


The research methodology tries to find out "significant" sentences that claims that there exist certain clues independent of subject content, that indicate that the sentences are significant. As research steps, a collection of documents was selected, their characteristics were studied, the desired form and content of the "target" extracts were specified, manually produce target extracts that meet both the content and the machine specifications, develop a system that assigns numerical weights to machine-recognizable characteristics, write a program to generate extracts, cyclically improve their quality by comparing them with the training corpus, and test the quality of the auto-extracts on previously unseen data.

For preliminary statistical analysis, a set of 200 documents were selected from the fields of physical sciences and humanities. For extracting experiments on a different corpus, 200 documents from chemistry were selected. Their lengths ranged from 100 too 3900 words. Target extracts were created by maintaining a protocol described in detail in the paper. No redundancy, coherence and number of sentences were salient features kept in mind while creating the extracts. Because of characteristics like "anaphora", there were some problems in extraction because enough compression levels were not reached. Automation was reached by using 4 basic criteria - Cue, Key, Title, and Location. A dictionary was used and it was regarded as a list of words that formed a fixed input to the automatic extracting system and was independent of the document being summarized. On the other hand a glossary was regarded as a list of words with numerical weights that formed a variable input to the automatic extracting system and was composed of words selected from the document being extracted. The cue method used the presence of words like "significant", "impossible" and "hardly" in the sentences to score them. The key method used methods from (Luhn, 1958), that claims that high frequency content words are positively relevant. The title method comprises of analyzing certain specific characteristics of the skeleton of the document like title, headings and format. The location method hypothesized that sentences occurring under certain headings are positively relevant, and topic sentences occur very early or very late in the document. Finally, weights were attached to each method and the final score of the sentence was computed by a weighted sum. The length factor of sentences were also used in the computation of the score.

BibTex

@article{321519,
author = {H. P. Edmundson},
title = {New Methods in Automatic Extracting},
journal = {J. ACM},
volume = {16},
number = {2},
year = {1969},
issn = {0004-5411},
pages = {264--285},
doi = {http://doi.acm.org/10.1145/321510.321519 },
publisher = {ACM Press},
address = {New York, NY, USA},
}
Personal tools