- Some Principles of Weighting Methods Based on Word Frequencies for Automatic Indexing
- No.26, p.67-88
Characteristics of the occurrence frequency of words in natural language texts have been used as an indicator for the selection of significant words in automatic indexing. This paper describes some general principles common to term weighting methods which use occurrence frequency measures.
For this purpose, nearly sixty weighting fomulas were collected from the documents published in the past thirty years. Then their theoretical characteristics were analyzed and compared with each other. As a result, these formulas were classified into following five categories.
1) absolute frequency measures
2) two kinds of relative frequency measures
3) word dispersion measures
4) 2-Poisson model proposed by Harter
5) information theory similar to the one proposed by Shannon
Various mathematical relations peculiar to the formulas of each category were found. These relations were well explained by a model consisting of two kinds of word sets, one of which is subsumed by the other; that is, the significance of a word depended on the degree of its maldistribution to the subsumed word set.
- 本文PDF (2,760K)