In this article, we introduce a software package that applies a corpus-based algorithm to derive semantic representations of words. The algorithm relies on analyses of contextual information extracted from a text corpus-specifically, analyses of word co-occurrences in a large-scale electronic database of text. Here, a target word is represented as the combination of the average of all words preceding the target and all words following it in a text corpus. The semantic representation of the target words can be further processed by a self-organizing map (SOM; Kohonen, Self-organizing maps,2001), an unsupervised neural network model that provides efficient data extraction and representation. Due to its topography-preserving features, the SOM projects the statistical structure of the context onto a 2-D space, such that words with similar meanings cluster together, forming groups that correspond to lexically meaningful categories. Such a representation system has its applications in a variety of contexts, including computational modeling of language acquisition and processing. In this report, we present specific examples from two languages (English and Chinese) to demonstrate how the method is applied to extract the semantic representations of words.
- Contextual self-organizing map
- Corpus analysis
- Distributed semantic representation
- Sematic vectors