Open data and tools for Greek NLP and beyond
First multi-label NLI dataset accounting for semantic ambiguity. Unlike traditional NLI datasets that assume single correct labels, OYXOY marks ALL possible inference labels, capturing the reality that semantic ambiguity makes multiple inferences valid simultaneously.
Modern Greek NLI benchmark with multi-label annotations. 1,763 pairs covering entailment, contradiction, and neutrality with full word sense disambiguation.
Format: JSON
License: CC BY 4.0
Citation: Kogkalidis et al. (EACL 2024)
Comprehensive corpus of Greek dialectal varieties: Cypriot, Pontic, Cretan, and Northern Greek. First large-scale resource for Greek dialectal NLP.
Varieties: 4 major dialects
License: CC BY 4.0
Citation: Chatzikyriakidis et al. (2023)
Greek translation and extension of the FraCaS test suite for natural language inference. 774 inference examples covering quantifiers, plurals, adjectives, and more.
Examples: 774
Phenomena: 9 categories
Citation: Amanaki et al. (LREC 2022)
Modified version of Greek XNLI with dropped subjects restored, addressing a key morphosyntactic property of Greek that affects NLI performance.
Based on: XNLI
Modification: Pro-drop restoration
Citation: Amanaki et al. (LREC 2022)
Platform for computational analysis of ancient Greek texts with knowledge graph extraction and neuro-symbolic reasoning capabilities.
Focus: Classical texts
Methods: KG extraction, NLI
Status: Active development
Dataset of inferences from natural language dialogues including disfluencies, hesitations, and interactive phenomena often absent from written text.
Features: Disfluencies, repairs
Citation: Ek et al. (SemDial 2024)
Collaboration: CLASP
All CLLT datasets are released under permissive open licenses (typically CC BY 4.0) to encourage research and development. Please check individual dataset repositories for specific license details.
If you use our datasets in your research, please cite the corresponding papers. BibTeX entries are available in each dataset's GitHub repository and in our publications page.
We welcome contributions, error corrections, and extensions to our datasets. Please submit issues or pull requests on the respective GitHub repositories.
For questions, collaboration inquiries, or access to unreleased resources, please contact the lab.
We are actively developing additional datasets and resources including GRDD+ (extended dialectal corpus), Greek poetry corpora for RAG systems, and enhanced knowledge graph extraction tools for classical texts. Stay tuned for announcements on our news page.