A few decades ago, scientists could only dream of automating linguistic research. The work was carried out manually, a large number of students were attracted to it, there was a significant probability of a mistake “by inattention”, and most importantly, it all took a lot, a lot of time.
With the development of computer technology, it has become possible to conduct research an order of magnitude faster, and today one of the promising areas in the study of language is corpus linguistics. Its main feature is the use of large amounts of textual information, combined into a single database, specially marked out and called the corpus.
Today, there are many cases created for various purposes, based on various language material, covering from millions to tens of billions of lexical units. This area is recognized as promising and demonstrates significant success in achieving applied and research goals. Professionals who are one way or another dealing with the natural language are encouraged to familiarize themselves with the corpus of texts at least at a basic level.
History of Corpus Linguistics
The formation of this direction is associated with the creation of the Brownian Corps in the USA in the early 60s of the last century. The collection of texts totaled only 1 million word forms, and today a corpus of such a volume would be completely uncompetitive. To a large extent, this is due to the pace of development of computer technology, as well as the growing requirements for new research resources.
In the 90s, corpus linguistics formed into a full-fledged and independent discipline, collections of texts were compiled and marked out for several dozen languages. During this period, for example, the British National Corps with 100 million word usage was created.
As this area of ​​linguistics develops, the volumes of texts become more and more (and reach billions of vocabulary units), and the markup becomes more and more diverse. Today in the Internet space you can find corps of written and spoken language, multilingual and educational, focused on fiction or academic literature, as well as many other varieties.
What are the cases
Types of cases in corpus linguistics can be represented on several grounds. Intuitively, the classification can be based on the language of texts (Russian, German), access mode (open source, closed, commercial), the genre of source material (fiction , documentary, academic, journalism).
An interesting way is the generation of materials representing oral speech. Since an intentional recording of such a speech would create artificial conditions for the respondents, and the material obtained could not be called “spontaneous”, modern corpus linguistics went a different way. The volunteer is equipped with a microphone, and during the day all the conversations in which he participates are recorded. People around you, of course, cannot know that during everyday conversation they contribute to the development of science.
Later, the received audio recordings are stored in the data bank and are accompanied by printed text according to the type of transcript. Thus, the markup necessary to create a corpus of oral everyday speech becomes possible.
Application
Where the use of the language is possible, the use of corpus of texts is also possible. The purpose of using corpus methods in linguistics can be:
- Creation of tonality determination programs that are actively used in politics and business to track positive and negative feedback from voters and clients, respectively.
- Connecting the information system to dictionaries and translators to improve their performance.
- A variety of research tasks that contribute to understanding the structure of the language, the history of its development and predictions of its change in the near future.
- Development of information extraction systems based on morphological, syntactic, semantic and other attributes.
- Optimization of various linguistic systems , etc.
Using enclosures
The resource interface is similar to a typical search engine and prompts the user to enter a word or combination of words to search the infobase. In addition to the exact request form, you can use the advanced version, which allows you to find textual information on almost any linguistic criteria.
The basis for the search may be:
- belonging to a certain group of parts of speech;
- grammatical signs;
- semantics;
- stylistic and emotional coloring.
In addition, you can combine search criteria for a sequence of words: for example, find all occurrences of a verb in the present tense, first person, singular, followed by the preposition "in" and the noun in the accusative case. Solving such a simple task takes a few seconds and requires only a few clicks of the mouse in the specified fields.
Process of creation
The search itself can be carried out both for all sub-cases, and for one specifically selected, depending on the needs when achieving a specific goal:
- The first step is to determine which texts will form the basis of the corpus. For practical purposes, journalistic, newspaper materials, and online commentaries are often used. In research projects, a wide variety of types of cases are used, but the texts should be selected on some general basis.
- The resulting set of texts is pre-processed, errors are corrected, if any, a bibliographic and extralinguistic description of the text is prepared.
- All non-textual information is eliminated: graphs, pictures, tables are deleted.
- Tokens, usually representing words, are allocated for their further processing.
- Finally, morphological, syntactic and other markup of the resulting set of elements is carried out.
The result of all the operations performed is a syntactic structure with a multitude of elements distributed over it, for each of which a part of speech, grammatical and, in some cases, semantic features are defined.
Difficulties in creating enclosures
It is important to understand that to get a corpus it’s not enough to put together a lot of words or sentences. On the one hand, the collection of texts should be balanced, that is, represent different types of texts in certain proportions. On the other hand, the contents of the case must be specially marked.
The first question is solved by agreement: for example, 60% of literary texts, 20% of documentary texts are included in the collection, a certain proportion is given to the written presentation of oral speech, legislative acts, scientific works, etc. There is no ideal recipe for a balanced body today.
The second issue regarding content markup is more complex. There are special programs and algorithms used for automatic markup of texts, however, they do not give a 100% result, can cause failures and require manual revision. Opportunities and problems in solving this problem are described in detail in the work of V.P. Zakharov on corpus linguistics.
The markup of the text is carried out at several levels, which we will list below.
Morphological marking
From school, we remember that in Russian there are different parts of speech, and each of them has its own characteristics. For example, a verb has categories of mood and tense that the noun does not have. A native speaker without hesitation declines nouns and conjugates verbs, but manual labor is not suitable for marking up a corpus of 100 million word usage. All necessary operations can be performed by a computer, however, for this it needs to be taught.
Morphological markup is necessary for the computer to "understand" each word as a part of speech that has certain grammatical features. Since a number of regular rules operate in Russian (as in any other) language, it is possible to construct an automatic procedure for morphological analysis by embedding a number of algorithms in a machine. However, there are exceptions to the rules, as well as various complicating factors. As a result, pure computer analysis today is far from ideal, and even 4% of errors give a value of 4 million words on a case of 100 million units, requiring manual refinement.
This problem is described in detail by V.P. Zakharov’s book “Corpus linguistics”.
Syntax markup
Parsing or parsing is a procedure that determines the relationship of words in a sentence. Using a set of algorithms, it becomes possible to determine in the text the subject, predicate, additions, various turns of speech. By figuring out which words in the sequence are main and which are dependent, we can efficiently extract information from the text and train the machine to issue only the information we are interested in in response to the search query.
By the way, modern search engines use this to produce specific numbers instead of lengthy texts in response to relevant queries such as: “how many calories are in an apple” or “the distance from Moscow to St. Petersburg”. However, to understand even the very basics of the described process, you will need to familiarize yourself with “Introduction to Corpus Linguistics” or other basic textbook.
Semantic markup
The semantics of a word is, in simple terms, its meaning. A widely applicable approach in semantic analysis is to attribute to a word tags that reflect its belonging to a set of semantic categories and subcategories. Such information is valuable for optimizing algorithms for analyzing the tonality of a text, automatic summarization, and other tasks using corpus linguistics.
There are a number of “roots” of a tree, which are abstract words that have very broad semantics. As this tree branches, nodes are formed that contain increasingly specific lexical elements. For example, the word "creature" may be associated with concepts such as "man" and "animal." The first word will continue to branch into various professions, the terms of kinship, nationality, and the second - into classes and species of animals.
Application of information retrieval systems
The areas of use of corpus linguistics cover a wide variety of areas of activity. Cases are used to compile and correct dictionaries, create automatic translation systems, abstracts, extract facts, determine tonality and other word processing.
In addition, such resources are actively used in the study of world languages ​​and the mechanisms of functioning of the language as a whole. Access to large volumes of pre-prepared information contributes to the prompt and comprehensive study of trends in the development of languages, the formation of neologisms and steady speech, change the meaning of lexical units, etc.
Since working with such large volumes of data requires automation, today there is a close interaction between computer and corpus linguistics.
National Corps of the Russian Language
This building (abbreviated as NKRYA) includes a number of subcorps that allow you to use the resource to solve a wide variety of problems.
The materials in the NKRJ database are divided into:
- on publications in the media of the 90s and 2000s, both domestic and foreign;
- spoken recordings;
- accentologically marked texts (i.e. with stress marks);
- dialectal speech;
- poetic works;
- materials with syntactic markup, etc.
The information system also includes subcases with parallel translations of works from Russian into English, German, French and many other languages ​​(and vice versa).
Also in the database there is a section of historical texts representing written speech in Russian at various periods of its development. There is also a training building, which may be useful to foreign citizens in mastering the Russian language.
The national corpus of the Russian language includes 400 million lexical units and in many respects is ahead of a significant part of the corps of European languages.
Prospects
The fact in favor of recognizing this direction as promising is the presence of laboratories of corpus linguistics in Russian universities, as well as in foreign ones. Application and research within the framework of the information and search resources under consideration are associated with the development of certain areas in the field of high technologies, question-answer systems, but this was discussed above.
The further development of corpus linguistics is predicted at all levels, starting from the technical one, in terms of introducing new algorithms that optimize information search and processing processes, expand the capabilities of computers, increase RAM, and ending with everyday ones, as users find more and more ways to use this type of resources in everyday life and work.
Finally
In the middle of the last century, 2017 was a distant future, in which spacecraft plow the expanses of the universe and robots do all the work for people. In reality, science is replete with “white spots” and is making desperate attempts to answer questions that have troubled humanity for centuries. The questions of the functioning of the language here hold an honorable place, and corpus and computer linguistics can help us answer them.
Processing large amounts of data allows you to detect patterns that were previously inaccessible, predict the development of certain language features, and track the formation of words in almost real time.
At a practical global level, corpses can be considered, for example, as a potential tool for assessing public moods - the Internet is a continuously updated database of various texts created by real users: these are comments, reviews, articles, and many other forms of speech.
In addition, working with cases contributes to the development of the same technical tools that are involved in the information search familiar to us from Google or Yandex services, machine translation, and electronic dictionaries.
It is safe to say that corpus linguistics takes only the first steps, and will rapidly develop in the near future.