First results of the “TurkLang-7” project: Creating Russian-turkic parallel corpora and MT systems


Khusainov A., Suleymanov D., Gilmullin R., MINSAFINA A., Kubedinova L., Abdurakhmonova N.

2020 Computational Models in Language and Speech Workshop, CMLS 2020, Kazan, Rusya, 12 - 13 Kasım 2020, cilt.2780, ss.90-101 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 2780
  • Basıldığı Şehir: Kazan
  • Basıldığı Ülke: Rusya
  • Sayfa Sayıları: ss.90-101
  • İstanbul Üniversitesi Adresli: Evet

Özet

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).The idea of the “TurkLang-7” project is to create datasets and neural machine translation systems for a set of Russian-Turkic low-resource language pairs. It is planned to achieve this goal through a hybrid approach to the creation of a multilingual parallel corpus between Russian and Turkic languages, studying the applicability and effectiveness of neural network learning methods (transfer learning, multi-task learning, back-translation, dual learning) in the context of the selected language pairs, as well as the development of specialized methods for the unification of parallel data in different languages, based on the agglutinative nature of the selected Turkic languages (structural and functional model of the Turkic morpheme). In this paper, we describe the main stages of work on this project and the results of the first year: we developed a semiautomatic process for creating parallel corpora, collected data from several sources on 7 Turkic languages, and conducted the first experiments to create machine translation systems.