Past Issues

Studies in Informatics and Control
Vol. 9, No. 3, 2000

Classification Knowledge Discovery From Newspaper Articles

Hisao Mase, Yukiko Morimoto, Hiroshi Tsuji, Hiroshi Kinukawa
Abstract

This paper describes experimental trials to test the feasibility of a keyword-based newspaper-article classification system under development. The system seeks to identify keywords which characterize a variety of categories from a series of classified training newspaper articles. Our classification system generates a knowledge base consisting of weighted keywords from a large quantity of training articles, and classifies a newly input article into pre-defined categories based on a keyword-matching algorithm. Thus, the method of selection for the training articles is as important as how the keywords are derived with their weighting. First, we compare the resuhs of three different methods for deriving keyword: (1) stopword treatment, (2) weighting method for keywords, and (3) scoping for keywords in a defined area. Then, we discuss the resuhs of two methods for selecting training articles: (I) the relationship between the quantity of training articles and classification correctness, and (2) the influence of time lapse between the publication dates of the training and evaluation articles. Our system generated a knowledge base from 103,000 training articles which were then used to classify 37,000 articles into 22 categories and 153 subcategories. The trial results have shown that stopword deletion and the normalization of the keyword weights are effective in improving classification correctness. We also found that articles covering one month were the minimum requisite to generate a knowledge base which classifies articles into 22 categories. Furthermore, we confirmed that the knowledge base should be generated from training articles whose publication date was as close as possible to that of the evaluation articles.

Keywords

Text classification, Classification knowledge discovery, Keyword extraction.

View full article