Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Created by
  • Haebom

Author

Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

Outline

MoVoC (Morpheme-aware Subword Vocabulary Construction) is a tokenizer, MoVoC-Tok, proposed to address the limitations of subword tokenization methods that fail to maintain morpheme boundaries in low-resource, morpheme-complex languages written in the Geez script. MoVoC-Tok is a hybrid segmentation method that integrates supervised learning-based morphological analysis into subword vocabularies. It combines morpheme-based tokenization with Byte Pair Encoding (BPE) tokens to maintain morpheme integrity while preserving lexical meaning. It provides manually annotated morpheme data for four Geez script languages and morpheme-aware vocabularies for two languages. While it does not significantly improve machine translation quality, it consistently improves intrinsic metrics such as MorphoScore and Boundary Precision, highlighting the value of morpheme-aware segmentation. The provided dataset and tokenizer can be utilized in research on low-resource, morpheme-rich languages.

Takeaways, Limitations

Takeaways:
We present MoVoC-Tok, a morphologically aware tokenizer for low-resource, morphologically complex languages.
Achieving morphological integrity and lexical meaning simultaneously through a hybrid segmentation method.
Release of manually annotated morphological datasets for four Geez script languages.
We observed performance improvements in intrinsic metrics such as MorphoScore and Boundary Precision, demonstrating the importance of morphologically aware segmentation.
Supporting low-resource language research through open datasets and code.
Limitations:
It had no significant effect on improving the quality of automatic translation.
👍