Name: The project of Tomsk dialect corpus in keeping with trends of corpus linguistics development

Authors: Svetlana S. Zemicheva, Ekaterina V. Ivantsova

Tomsk State University, Tomsk, Russian Federation

In the section Linguistics

Issue 3, 2018Pages 192-205
UDK: 811.161.1; 81-25; 81’322DOI: 10.17223/18137083/64/18

Abstract: The concept of the dialect corpus representing the Russian dialect speech of the Middle Ob region is proposed. The authors demonstrate that the project of Tomsk dialect corpus corresponds to the key trends of modern corpus linguistics: the involvement of oral speech materials; attention to the regional variation of the language; the study of dialect as part of the traditional culture; multimodality. The novelty of the resource is determined by the material – it is one of the few corpuses that include the speech of residents of the vast Siberian region: the archive includes the results of a 70-year expedition survey of about 400 villages – and lexicocentric and textocentric orientation: the possibility of access to full texts is fundamentally important. The problem of representativeness and balance of the dialect corpus which has not been studied in the scientific literature is considered. Today, Tomsk dialect corpus includes approximately 700 000 words, allowing it to be considered as a fairly representative collection of dialect texts. At the same time, the special characteristics of the material result in the corpus being not strictly balanced. The texts are presented in spelling with some phonetical features of the dialect. The structure of the new electronic resource involves 3 types of markup: passport, thematic and type of text. Passport metamarkup includes extra-linguistic data about the texts: the place of recording, the date, the information about the informant (sex, age, place of birth, level of education, occupation). Thematic meta-markup is made by means of an inductive analysis of the discursive practices of old-timers. The list of topics is hierarchical, with each topic being three levels deep maximum. The principle of «soft» markup is used, with the possibility of simultaneously assigning several themes to the one text fragment. At the first level of the hierarchy, 16 macro-themes are marked (Work, Food, Nature, etc.), on the second – 64 topics. Firstly, the markup by type of text at this stage includes the degree of the spontaneity of speech events and, secondly, the most frequent speech genres. The prospects for using the resource are the study of Middle Ob dialects in linguocultural, genre, communicative, cognitive, linguopersonological and other aspects; the creation of new dialect dictionaries; the investigation of traditional culture and folklore, customs and rituals, history of the region.

Keywords: corpus linguistics, Tomsk dialect corpus, Russian dialects of Siberia


