types2: Exploring word-frequency differences in corpora

Tanja Säily, Jukka Suomela

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

Abstract

We demonstrate the use of the types2 tool to explore, visualize, and assess the significance of variation in word frequencies. Based on accumulation curves and the statistical technique of permutation testing, this freely available tool is especially well suited to the study of types and hapax legomena, which are common measures of morphological productivity and lexical diversity. We have developed a new version of the tool that provides improved linking between the visualizations, metadata, and corpus texts, which facilitates the analysis of rich data.

The new version of our tool is demonstrated using two data sets extracted from the Corpora of Early English Correspondence (CEEC) and the British National Corpus (BNC), both of which are rich in sociolinguistic metadata. We show how to use our software to analyse such data sets, and how the new version of our tool can turn the results into interactive web pages with visualizations that are linked to the underlying data and metadata. Our paper illustrates how the linked data facilitates exploring and interpreting the results.
Original languageEnglish
Title of host publicationBig and Rich Data in English Corpus Linguistics, Methods and Explorations
Publication statusPublished - 2017
MoE publication typeA3 Part of a book or another research book

Publication series

NameStudies in Variation, Contacts and Change in English
PublisherResearch Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki
Volume19
ISSN (Electronic)1797-4453

Fingerprint Dive into the research topics of 'types2: Exploring word-frequency differences in corpora'. Together they form a unique fingerprint.

Cite this