Tezin Türü: Doktora
Tezin Yürütüldüğü Kurum: Işık Üniversitesi, Graduate School of Science, Computer Engineering, Türkiye
Tezin Onay Tarihi: 2022
Tezin Dili: İngilizce
Öğrenci: ONUR GÖRGÜN
Asıl Danışman (Eş Danışmanlı Tezler İçin): Olcay Taner Yıldız
Eş Danışman: Ayşegül Tüysüz Erman
Özet:
Machine translation (MT) has been one of the hot topics in NLP research over recent years. However, most of the related studies have been done for specific languages, and there are a limited number of comprehensive studies for languages with free word order, such as Turkish. English-Turkish is also one of the least frequently studied language pairs in translation due to the morphological and syntactic gaps between the two languages. This also makes it hard to build parallel corpora, which is crucial for the machine translation task. This thesis aims to be the first statistical syntax tree-based machine translation approach to the English-Turkish language pair, as well as a parallel corpus for translation tasks. We construct an English-Turkish parallel treebank of approximately 17K sentences by following a three-phased approach: manual transformation of English trees from Penn Treebank (PTB) by constraining the translated trees to the reordering of the children and gloss replacement; morphological analysis of the translated gloss; and morphological enrichment of the target tree. For translation consistency, we also developed a set of tools. We also apply the transformation schema to the closed-domain and build 8.3K sentences corpus. We employ both corpora on machine translation task. In our experiments, we obtained a 12.8 BLEU score in the open-domain and a 26.8 BLEU score in the closed-domain. We also evaluate both corpora intrinsically through perplexity analysis. The results show that our studies on making a corpus can be repeated, and studies on machine translation using the small corpus look promising.