Resource Index · ISO 639-3 · tig

The Tigre Language Digitization Effortትግረ

A linked index of datasets, tools, reference works, and models for Tigre — a Semitic language of Eritrea and eastern Sudan, written in the Ge'ez script.

Datasets & Corpora

1Google Translate Dataset

English–Tigre set of 8,000 translated sentences (800 easy, 3,134 intermediate, 4,000 long), for MT development (service to start soon).

2tatoeba.org — Tigre

Crowdsourced example sentences with translations. Tigre ranks 30th of 429 languages by sentence count.

3Parallel multilingual corpus

~330k rows pairing Tigre with several languages. Foundation for translation models and the dictionaries below.

Speech

4Mozilla Common Voice — Tigre

Community speech-donation platform for collecting spoken Tigre. About 11 hours donated so far.

Reference Works

5Tigre Dictionary — M. Mussie BekhitNow Being Published

First English–Tigrinya–Tigre dictionary with around 6,200 vocabulary entries.

6Tigre Multilingual Dictionaries

Searchable phrasebooks: English (58k), Arabic (31k), German (27k), Swedish (15k). From the open parallel corpus.

7Tigre–Arabic Dictionary — Mohammed AhmedIn Progress

A Tigre–Arabic dictionary compiled in Kassala, Sudan. A community reference resource.

Encyclopedic & Web

8Tigre Wikipedia (tig.wikipedia.org)

The Tigre-language edition of Wikipedia in Ge'ez script. A growing source of encyclopedic text.

Tools & Software

9TigreParser

Rule-based morphological analyzer for Tigre in Java. From a Moscow State University thesis (2019, refactored 2020).

10HornMorpho

Python rule-based morphological analyzer/generator. Supports Tigre alongside Amharic, Oromo, and Tigrinya.

Research & Partnerships

11TigMM Corpus & Model Suite

Paper by Beshir Ibrahim. Reports BLEU 33.09 / chrF++ 40.91 for MT and low ASR word error rates.

12Bouquet (Meta) — Tigre

Meta AI platform for crowdsourced paragraph translation into Tigre (Ge'ez). Collects human reference translations.

13UNESCO × Meta Partner Program

Initiative supporting language tech for under-resourced languages. The project participates as an official partner.

BeitTigreAI Project

14BeitTigreAI — project site

Home of the BeitTigreAI effort on NLP for Tigre and Ge'ez, covering data and ML across MT, ASR, TTS, and LLMs.

15BeitTigreAI on Hugging Face

The organization's hub of models and datasets — central access point for the artifacts listed below.

BeitTigreAI Models · Hugging Face

16gemma-4-E2B-sft-tran-tigre

Gemma-4 E2B (5B) SFT for text generation. Tuned for translation tasks into Tigre.

17tigre-llm-gemma4

An 8B Tigre LLM built on Gemma 4. General generation and instruction following in Tigre.

18tigre-sonar-encoder

SONAR-based multilingual sentence encoder for Tigre. Produces embeddings for retrieval and cross-lingual tasks.

19tigre-asr-omniASR_CTC_3B

A 3B OmniASR CTC speech-recognition model. Transcribes spoken Tigre audio into text.

20tigre-asr-Wav2Vec2Bert

A 0.6B Wav2Vec2-BERT ASR model for accurate Tigre audio transcription.

21tigre-llm-Llama3.2-1B

A compact 1B Tigre LLM based on Llama 3.2. Lightweight text generation option.

22tigre-xlm-roberta-base

XLM-RoBERTa base adapted for Tigre. For classification, tagging, and language understanding.

23tigre-nllb-200-3.3B

A 3.3B NLLB-200 MT model covering Tigre. High-quality translation to and from other languages.

24tigre-nllb-200-distilled-600M

Distilled 600M NLLB-200 MT model. Smaller, faster variant for constrained deployment.

BeitTigreAI Datasets · Hugging Face

25tigre-tts-training

Training data for Tigre text-to-speech. Large paired audio/text resource (~6.77k rows, viewer enabled).

26tigre-tts-model

Assets and data supporting a Tigre TTS model. Companion to the TTS training set.

27tigre-data-fasttext

Data for FastText word embeddings in Tigre. For lightweight vectorization and language ID.

28tigre-data-speech-audio

A collection of Tigre speech audio recordings. Raw audio for ASR and TTS development.

29tigre-speech-text-aligned

Speech recordings aligned with text transcripts. Suitable for training and evaluating ASR models.

30tigre-data-parallel-multilingual

A parallel multilingual corpus including Tigre (~330k rows). For training and evaluating MT models.

31tigre-data-wikipedia

Tigre text extracted from Wikipedia. A monolingual corpus for pretraining and language modeling.

32tigre-data-lexicon

A Tigre lexicon dataset (~420k entries, viewer enabled). Vocabulary resource for NLP and linguistic work.

33tigre-data-kenLM

Data for building KenLM n-gram language models. Often used to rescore ASR and translation output.

34tigre-data-monolingual-text

A monolingual Tigre text corpus (preview). General-purpose data for language modeling and pretraining.