English–Tigre set of 8,000 translated sentences (800 easy, 3,134 intermediate, 4,000 long), for MT development (service to start soon).
A linked index of datasets, tools, reference works, and models for Tigre — a Semitic language of Eritrea and eastern Sudan, written in the Ge'ez script.
English–Tigre set of 8,000 translated sentences (800 easy, 3,134 intermediate, 4,000 long), for MT development (service to start soon).
Crowdsourced example sentences with translations. Tigre ranks 30th of 429 languages by sentence count.
~330k rows pairing Tigre with several languages. Foundation for translation models and the dictionaries below.
Community speech-donation platform for collecting spoken Tigre. About 11 hours donated so far.
First English–Tigrinya–Tigre dictionary with around 6,200 vocabulary entries.
Searchable phrasebooks: English (58k), Arabic (31k), German (27k), Swedish (15k). From the open parallel corpus.
A Tigre–Arabic dictionary compiled in Kassala, Sudan. A community reference resource.
The Tigre-language edition of Wikipedia in Ge'ez script. A growing source of encyclopedic text.
Rule-based morphological analyzer for Tigre in Java. From a Moscow State University thesis (2019, refactored 2020).
Python rule-based morphological analyzer/generator. Supports Tigre alongside Amharic, Oromo, and Tigrinya.
Paper by Beshir Ibrahim. Reports BLEU 33.09 / chrF++ 40.91 for MT and low ASR word error rates.
Meta AI platform for crowdsourced paragraph translation into Tigre (Ge'ez). Collects human reference translations.
Initiative supporting language tech for under-resourced languages. The project participates as an official partner.
Home of the BeitTigreAI effort on NLP for Tigre and Ge'ez, covering data and ML across MT, ASR, TTS, and LLMs.
The organization's hub of models and datasets — central access point for the artifacts listed below.
Gemma-4 E2B (5B) SFT for text generation. Tuned for translation tasks into Tigre.
An 8B Tigre LLM built on Gemma 4. General generation and instruction following in Tigre.
SONAR-based multilingual sentence encoder for Tigre. Produces embeddings for retrieval and cross-lingual tasks.
A 3B OmniASR CTC speech-recognition model. Transcribes spoken Tigre audio into text.
A 0.6B Wav2Vec2-BERT ASR model for accurate Tigre audio transcription.
A compact 1B Tigre LLM based on Llama 3.2. Lightweight text generation option.
XLM-RoBERTa base adapted for Tigre. For classification, tagging, and language understanding.
A 3.3B NLLB-200 MT model covering Tigre. High-quality translation to and from other languages.
Distilled 600M NLLB-200 MT model. Smaller, faster variant for constrained deployment.
Training data for Tigre text-to-speech. Large paired audio/text resource (~6.77k rows, viewer enabled).
Assets and data supporting a Tigre TTS model. Companion to the TTS training set.
Data for FastText word embeddings in Tigre. For lightweight vectorization and language ID.
A collection of Tigre speech audio recordings. Raw audio for ASR and TTS development.
Speech recordings aligned with text transcripts. Suitable for training and evaluating ASR models.
A parallel multilingual corpus including Tigre (~330k rows). For training and evaluating MT models.
Tigre text extracted from Wikipedia. A monolingual corpus for pretraining and language modeling.
A Tigre lexicon dataset (~420k entries, viewer enabled). Vocabulary resource for NLP and linguistic work.
Data for building KenLM n-gram language models. Often used to rescore ASR and translation output.
A monolingual Tigre text corpus (preview). General-purpose data for language modeling and pretraining.