Resource Index · ISO 639-3 · tig

The Tigre Language Digitization Effortትግረ

A linked index of datasets, tools, reference works, and models for Tigre — a Semitic language of Eritrea and eastern Sudan, written in the Ge'ez script.

Datasets & Corpora

English–Tigre set of 8,000 translated sentences (800 easy, 3,134 intermediate, 4,000 long), for MT development (service to start soon).

Crowdsourced example sentences with translations. Tigre ranks 30th of 429 languages by sentence count.

~330k rows pairing Tigre with several languages. Foundation for translation models and the dictionaries below.

Speech

Community speech-donation platform for collecting spoken Tigre. About 11 hours donated so far.

Reference Works

5Tigre Dictionary — M. Mussie BekhitNow Being Published

First English–Tigrinya–Tigre dictionary with around 6,200 vocabulary entries.

Searchable phrasebooks: English (58k), Arabic (31k), German (27k), Swedish (15k). From the open parallel corpus.

7Tigre–Arabic Dictionary — Mohammed AhmedIn Progress

A Tigre–Arabic dictionary compiled in Kassala, Sudan. A community reference resource.

Encyclopedic & Web

The Tigre-language edition of Wikipedia in Ge'ez script. A growing source of encyclopedic text.

Tools & Software

Rule-based morphological analyzer for Tigre in Java. From a Moscow State University thesis (2019, refactored 2020).

Python rule-based morphological analyzer/generator. Supports Tigre alongside Amharic, Oromo, and Tigrinya.

Research & Partnerships

11TigMM Corpus & Model Suite

Paper by Beshir Ibrahim. Reports BLEU 33.09 / chrF++ 40.91 for MT and low ASR word error rates.

Meta AI platform for crowdsourced paragraph translation into Tigre (Ge'ez). Collects human reference translations.

Initiative supporting language tech for under-resourced languages. The project participates as an official partner.

BeitTigreAI Project

Home of the BeitTigreAI effort on NLP for Tigre and Ge'ez, covering data and ML across MT, ASR, TTS, and LLMs.

The organization's hub of models and datasets — central access point for the artifacts listed below.

BeitTigreAI Models · Hugging Face

Gemma-4 E2B (5B) SFT for text generation. Tuned for translation tasks into Tigre.

An 8B Tigre LLM built on Gemma 4. General generation and instruction following in Tigre.

SONAR-based multilingual sentence encoder for Tigre. Produces embeddings for retrieval and cross-lingual tasks.

A 3B OmniASR CTC speech-recognition model. Transcribes spoken Tigre audio into text.

A 0.6B Wav2Vec2-BERT ASR model for accurate Tigre audio transcription.

A compact 1B Tigre LLM based on Llama 3.2. Lightweight text generation option.

XLM-RoBERTa base adapted for Tigre. For classification, tagging, and language understanding.

A 3.3B NLLB-200 MT model covering Tigre. High-quality translation to and from other languages.

Distilled 600M NLLB-200 MT model. Smaller, faster variant for constrained deployment.

BeitTigreAI Datasets · Hugging Face

Training data for Tigre text-to-speech. Large paired audio/text resource (~6.77k rows, viewer enabled).

Assets and data supporting a Tigre TTS model. Companion to the TTS training set.

Data for FastText word embeddings in Tigre. For lightweight vectorization and language ID.

A collection of Tigre speech audio recordings. Raw audio for ASR and TTS development.

Speech recordings aligned with text transcripts. Suitable for training and evaluating ASR models.

A parallel multilingual corpus including Tigre (~330k rows). For training and evaluating MT models.

Tigre text extracted from Wikipedia. A monolingual corpus for pretraining and language modeling.

A Tigre lexicon dataset (~420k entries, viewer enabled). Vocabulary resource for NLP and linguistic work.

Data for building KenLM n-gram language models. Often used to rescore ASR and translation output.

A monolingual Tigre text corpus (preview). General-purpose data for language modeling and pretraining.