Hong, Jungha. 2014. A Corpus-linguistic Approach to Random Samples. Language Information. Volume 18. 137-162. In quantitative studies, a random sample is supposed to be randomly selected by probability sampling in such a way that it represents a population. The statistical analysis of corpus frequency data is based on a random sample model, which assumes that the corpus was randomly selected from the language. However, Kilgarriff (2005), Evert (2006), Goh (2011) show that typical corpus data severely violate the randomness assumption. This paper aims to evaluate random sampling methods for corpus linguistics and to explore their characteristics and applicability. They are evaluated on the relative frequencies of 30 morphemes and the frequencies of all morpheme types which occur in each sample observed from 1,000 resampling trials based on how close each random sample is to the normal distribution and the Zipf-Mandelbrot (Mandelbrot 1977) law. The present study creates three findings. First, systematic sampling at the unit of measurement, i.e. individual words from an entire corpus is a best way to construct random samples for corpus linguistics. Second, the closer the relative frequencies of 30 morphemes in a sample lie to the normal distribution, the closer the frequency distribution of all morpheme types to the Zipf-Mandelbrot distribution. Third, It is an effective way to utilize random samples for solving problems that stem from different sample size and data sparseness. Moreover, using them facilitates detecting rather big difference in word frequencies obtained from different corpora.

 

Key words: corpus, random sample, probability sampling, simple random sampling, systematic sampling, unit of measurement, unit of sampling, Zipf-Mandelbrot's law, normal distribution, variation