Müge Kural and
Deniz Yuret.
2024.
Unsupervised Learning of Turkish Morphology with Multiple Codebook VQ-VAE. In
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024), pp
1--17,
Bangkok, Thailand and Online,
Aug.
Association for Computational Linguistics. [
ai.ku]
url url abstract google scholar
This paper presents an interpretable unsupervised morphological learning model, showing comparable performance to supervised models in learning complex morphological rules of Turkish as evidenced by its application to the problem of morphological inflection within the SIGMORPHON Shared Tasks. The significance of our unsupervised approach lies in its alignment with how humans naturally acquire rules from raw data without supervision. To achieve this, we construct a model with multiple codebooks of VQ-VAE employing continuous and discrete latent variables during word generation. We evaluate the model{'}s performance under high and low-resource scenarios, and use probing techniques to examine encoded information in latent representations. We also evaluate its generalization capabilities by testing unseen suffixation scenarios within the SIGMORPHON-UniMorph 2022 Shared Task 0. Our results demonstrate our model{'}s ability to distinguish word structures into lemmas and suffixes, with each codebook specialized for different morphological features, contributing to the interpretability of our model and effectively performing morphological inflection on both seen and unseen morphological features.
Ali Hürriyetoğlu,
Hristo Tanev,
Vanni Zavarella,
Jakub Piskorski,
Reyyan Yeniterzi,
Deniz Yuret and
Aline Villavicencio.
2021.
Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report. In
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), pp
1--9,
Online,
Aug.
Association for Computational Linguistics. [
ai.ku]
url abstract google scholar
This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.
Cemil Cengiz,
Ulaş Sert and
Deniz Yuret.
2019.
KU_ai at MEDIQA 2019: Domain-specific Pre-training and Transfer Learning for Medical NLI. In
Proceedings of the 18th BioNLP Workshop and Shared Task, pp
427--436,
Florence, Italy,
Aug.
Association for Computational Linguistics. [
ai.ku]
url abstract google scholar
In this paper, we describe our system and results submitted for the Natural Language Inference (NLI) track of the MEDIQA 2019 Shared Task. As KU{\_}ai team, we used BERT as our baseline model and pre-processed the MedNLI dataset to mitigate the negative impact of de-identification artifacts. Moreover, we investigated different pre-training and transfer learning approaches to improve the performance. We show that pre-training the language model on rich biomedical corpora has a significant effect in teaching the model domain-specific language. In addition, training the model on large NLI datasets such as MultiNLI and SNLI helps in learning task-specific reasoning. Finally, we ensembled our highest-performing models, and achieved 84.7{\%} accuracy on the unseen test dataset and ranked 10th out of 17 teams in the official results.