Choe, Jae-Woong & Song, Ji-Young. 2013. The Topical Classification of Essays by College Student English Learners Using Hierarchical Clustering . Language Information . Volume 17. 93-115. In this study, we report on a set of experimentations for, and a successful completion of, the automatic topic classification of 3286 English essays (YELC) written by college level English learners in Korea. We adopted Hierarchical Agglomeration Clustering for our purpose. In order to find the best combination of distance measures and algorithms for hierarchical clustering, we first selected 100 essays, and then calculated precision rate on the basis of the subset of essays for each of the 15 combinations of 5 distance measures and 3 methods provided in R implementation of ‘Dist’ and ‘hclust’. As a result, the combination of ‘correlation’ and ‘ward’ method was chosen as the optimal one for our chosen corpus, which was applied to ten sets of randomly selected 100 essays for further validation. As a final step for topic classification, the ‘correlation’-‘ward’ combination was applied to classify the whole corpus into six topics. The precision rate was estimated to be 98.7%, a quite decent one for our purpose. We then conducted a Key word analysis on the six topic-groups, thereby showing some distributional characteristics of the words used in each group.

 

Key words: Learner Corpus, English, College students, YELC, Argumentative writing, Topical Classification, Hierarchical Clustering, Automatic unsupervised document classification, Statistical computing environment R, Key word analysis