Artificial Intelligence and Language Preservation: Case Studies from Iceland, India, and Taiwan

Published on 18 May 2026 at 09:18

Public Policy Research Group, London, UK

Abstract

The accelerating decline of linguistic diversity presents one of the most pressing cultural challenges of the digital age. UNESCO estimates that approximately 40 percent of the world's 7,000 languages are endangered, with one language disappearing every two weeks on average. This paper examines three international case studies where governments and cultural institutions have deployed artificial intelligence specifically for language preservation: Iceland's partnership with OpenAI to integrate the Icelandic language into GPT-4, India's Adi Vaani platform serving as the country's first AI powered translator for tribal languages, and Taiwan's hybrid dictionary framework combining retrieval augmented generation with large language models for the endangered Paiwan language. Drawing on verified primary sources including government press releases, peer reviewed academic publications, and official institutional documentation, each case is analysed according to a common framework examining institutional actors, the nature of the AI intervention, the linguistic resource base, and measurable preservation outcomes. From these cases, four cross cutting success factors emerge: curated lexical corpora as the foundation for effective AI training, explicit morphological and grammatical standards guiding model evaluation, structured institutional partnerships between public bodies and technology providers, and public facing preservation tools embedded in platforms users already inhabit. The paper concludes that AI driven language preservation is most effective when it combines community participation, academic expertise, government support, and integration into daily digital infrastructure, offering transferable lessons for other language communities facing similar pressures.

Keywords:language preservation, artificial intelligence, Icelandic language, Adi Vaani, Paiwan language, endangered languages, digital language policy.

1. Introduction

Linguistic diversity is declining at an unprecedented pace. The United Nations Educational, Scientific, and Cultural Organization estimated in 2024 that approximately 40 percent of the world's 7,000 remaining languages are already classified as endangered, with a language disappearing every two weeks on average (Wang et al., 2026). This decline is driven not primarily by speaker attrition alone but by the progressive digital marginalisation of smaller languages. As daily communication, commerce, education, and government services migrate to digital platforms dominated by a small number of major languages, languages that lack a robust digital presence face a compounding disadvantage: they become less visible, less useful, and ultimately less transmitted across generations.

This challenge has prompted a range of responses from governments and cultural institutions worldwide. Among the most promising are initiatives that deploy artificial intelligence not as a replacement for human language use but as a preservation instrument: a set of tools capable of documenting lexical resources, modelling morphological structures, generating accurate translations, and embedding endangered languages within the digital platforms that speakers use daily.

This paper examines three such initiatives through a structured comparative case study methodology. Section 2 analyses Iceland's partnership with OpenAI to preserve Icelandic, a morphologically complex language spoken by approximately 370,000 people. Section 3 examines India's Adi Vaani platform, a government led AI translation initiative serving the country's 461 tribal languages. Section 4 analyses Taiwan's hybrid dictionary and retrieval augmented generation framework developed for the endangered Paiwan language. Section 5 extracts cross cutting success factors, and Section 6 offers concluding observations on the transferability of these models.

The cases were selected according to three criteria: each involves a deliberate partnership between a government or cultural institution and an AI research body or technology firm; each pursues an explicit objective of lexical or grammatical preservation rather than generic language processing; and each deploys generative or neural AI as a core tool. All sources have been verified against primary documentation including government press releases, peer reviewed academic publications, and official institutional statements.

2. Iceland: Government Partnership with OpenAI for Morphological Preservation

2.1 Institutional Context

Icelandic, a North Germanic language with a rich literary tradition dating to the medieval sagas, possesses a complex inflectional morphology that places it under acute pressure from English language digital dominance. With a population of approximately 370,000, Iceland faces a distinctive challenge: the cost of developing and maintaining digital language infrastructure for such a small speaker base is substantial, while the consequences of failing to do so are potentially terminal for the language's digital viability (OpenAI, 2024).

The Icelandic government has long recognised this challenge. The country maintains a Language Planning Department that deliberately coins Icelandic terms for new technologies rather than adopting loanwords from English. For example, the Icelandic word for computer is "tölva," meaning "number prophetess," a linguistic invention that reflects a deliberate policy of maintaining lexical sovereignty (Marr, 2024). However, language planning alone cannot ensure digital presence. The critical problem, as articulated by Jóhanna Vigdís Guðmundsdóttir, chief executive officer of the non-profit language technology centre Almannarómur, was not a lack of software built locally for Icelandic but rather the absence of Icelandic from the global software and applications Icelanders use every day (OpenAI, 2024).

2.2 The AI Intervention

In 2023, the Government of Iceland entered into a partnership with OpenAI, supported by the Icelandic language technology firm Miðeind, to improve GPT-4's ability to process and generate grammatically correct Icelandic. The partnership involved the government facilitating the provision of curated Icelandic text corpora, with Miðeind working directly with OpenAI on training data preparation and evaluation (OpenAI, 2024).

The technical approach employed Reinforcement Learning from Human Feedback, a methodology in which human evaluators assess model outputs and the model learns to adjust its behaviour accordingly. Vilhjálmur Þorsteinsson, chief executive at Miðeind, assembled a team of forty Icelandic volunteers to train GPT-4 on proper Icelandic grammar and cultural knowledge. This team worked to correct grammatical errors, refine contextual appropriateness, and embed cultural knowledge within the model's outputs (MisionesOnline, 2024). Critically, Þorsteinsson noted that RLHF produced meaningful improvements with just one hundred examples, making the methodology feasible for other languages with limited digital resources.

The explicit goal of the partnership was morphological fidelity: ensuring that GPT-4 could correctly generate Icelandic's complex inflectional paradigms, including its four noun cases, three grammatical genders, and intricate verbal conjugation system. The evaluation criterion was not whether the output sounded plausible but whether it conformed to the documented rules of Icelandic inflectional morphology (OpenAI, 2024).

2.3 Linguistic Resource Base

The partnership leveraged Iceland's existing linguistic infrastructure. The Icelandic government served as gatekeeper of high quality text corpora, including literary works, official documents, and educational materials. Rather than attempting to build a sovereign large language model from scratch, a computationally and financially prohibitive undertaking for a small population, Iceland adopted a strategy of intelligent data provision, shaping a global model's treatment of Icelandic by supplying curated, high quality training data (Marr, 2024).

2.4 Measurable Outcomes

The partnership produced several concrete outcomes. GPT-4's accuracy in Icelandic translation improved markedly, with independent evaluation noting a 25 percent increase in translation accuracy (AI Native Foundation, 2024). Beyond the quantitative metric, the partnership achieved three qualitative outcomes of policy significance.

First, it enabled the integration of GPT-4 into Embla, Miðeind's voice assistant application, allowing Icelandic speakers to interact with the assistant in fluent Icelandic and receive translations to other languages. Second, it enabled Icelandic companies to deploy Icelandic language chatbots on their websites rather than relying on English language alternatives, integrating the language into daily commercial digital infrastructure. Third, and perhaps most significantly for the broader language preservation community, the partnership established a replicable model. OpenAI explicitly framed the collaboration as a template for other low resource languages, stating that it was envisioned not only as a way to improve GPT-4's capabilities but also as a step towards creating resources that could promote the preservation of other languages facing similar digital pressures (OpenAI, 2024).

 

3. India: Adi Vaani and AI for Endangered Tribal Language Preservation

3.1 Institutional Context

India is home to 461 tribal languages spoken by Scheduled Tribes and 71 distinct tribal mother tongues, according to the 2011 Census of India. Among these, 81 languages are classified as vulnerable and 42 as critically endangered. Many face the risk of extinction due to limited documentation and intergenerational transmission gaps (Ministry of Tribal Affairs, 2025b).

The Ministry of Tribal Affairs, Government of India, launched the beta version of Adi Vaani in August 2025 under the banner of Janjatiya Gaurav Varsh, a year long celebration of tribal heritage coinciding with the 150th birth anniversary of tribal leader Birsa Munda. The initiative represents the country's first AI powered translator specifically designed for tribal languages, developed by a national consortium of academic institutions and research centres (Ministry of Tribal Affairs, 2025b).

3.2 The AI Intervention

Adi Vaani is an AI based translation platform that serves as the foundation for a future large language model dedicated to India's tribal languages. The project was developed by a consortium led by IIT Delhi, with contributions from IIIT Hyderabad, BITS Pilani, IIIT Naya Raipur, and multiple Tribal Research Institutes located in Jharkhand, Odisha, Madhya Pradesh, Chhattisgarh, and Meghalaya (Ministry of Tribal Affairs, 2025a).

The technical architecture employs Transformer based sequence to sequence models, the current state of the art approach in neural machine translation. The research team at IIIT Hyderabad, led by Professor Radhika Mamidi, built machine translation systems for English to Santali, Hindi to Santali, and reverse direction pairs. The parallel corpus was constructed with assistance from the Tribal Research Institute in Odisha, and additional data was generated and post edited by Santali native speakers who spent considerable time at IIIT Hyderabad recording speech data and validating translations (IIIT Hyderabad, 2025).

The platform incorporates a functional toolkit that extends well beyond simple translation. Features include text to text, text to speech, speech to text, and speech to speech translation capabilities. Optical character recognition technology enables the digitisation of scanned manuscripts and primers. Bilingual dictionaries and curated repositories support vocabulary preservation. The platform also enables the creation of subtitles for government speeches and health advisories in tribal languages (Ministry of Tribal Affairs, 2025b).

In its beta launch phase, Adi Vaani supports Santali from Odisha, Bhili from Madhya Pradesh, Mundari from Jharkhand, and Gondi from Chhattisgarh. Additional languages, including Kui and Garo, are under development for subsequent phases (IIIT Hyderabad, 2025).

3.3 Linguistic Resource Base

The linguistic resource base for Adi Vaani combines community sourced data with academic expertise. Tribal Research Institutes in participating states provided primary linguistic materials including dictionaries, primers, storybooks, and research documents. Native speakers were integrally involved in data collection, validation, and iterative model refinement. As Professor Mamidi noted, the research team recognised early that for low resource languages, accuracy cannot come from machines alone; native usage must anchor the models (IIIT Hyderabad, 2025).

The text to speech tools were built through extensive collaboration with native speakers who recorded thousands of speech samples at the IIIT Hyderabad laboratories. This community anchored approach to data collection represents a deliberate methodological choice: rather than relying solely on written corpora, which may be sparse or non-existent for many tribal languages, the project invested in primary oral data collection.

3.4 Measurable Outcomes

While the Adi Vaani platform remains in its beta phase and comprehensive quantitative evaluation metrics are still being gathered, several outcomes are already evident. The platform is available on the Google Play Store, with an iOS version forthcoming and a dedicated web platform providing additional access (Ministry of Tribal Affairs, 2025a).

At the Bharatiya Bhasha Utsav held in December 2025, the Ministry of Tribal Affairs presented an exhibition of tribal language publications alongside a live demonstration of the Adi Vaani application. The demonstration illustrated the platform's real time translation capabilities and interactive learning tools (Ministry of Tribal Affairs, 2025a).

The project's broader significance lies in its institutional architecture. Adi Vaani treats language preservation as a public goods infrastructure project led by a ministerial authority, integrating text, speech, and cultural heritage digitisation within a single platform. It is explicitly designed to serve concrete user needs, including access to education, healthcare information, and government services, while simultaneously achieving preservation objectives. The research team's stated aspiration is to make NCERT textbooks, educational videos, health awareness materials, and government scheme information available in low resource tribal languages (IIIT Hyderabad, 2025).

 

4. Taiwan: Hybrid Dictionary Framework for Paiwan Language Preservation

4.1 Institutional Context

The indigenous languages of Taiwan, including the Paiwan language, face acute pressures from data scarcity and shrinking speaker populations. Taiwan's indigenous languages belong to the Austronesian family, and many are classified as endangered. Most existing Paiwan to Mandarin translation tools have been limited to lexical lookups incapable of handling sentence level or paragraph level translation. Word for word output often produces semantic dissonance and fails to convey contextual nuance or cultural specificity (Wang et al., 2026).

The research team at National Taitung University's Department of Computer Science and Information Engineering developed a hybrid translation framework specifically designed for extremely low resource language settings, where conventional neural machine translation models perform poorly due to insufficient parallel corpora.

4.2 The AI Intervention

The proposed framework combines three distinct components in a hybrid pipeline. The first component is dictionary based pre-translation, using a handcrafted bilingual dictionary to establish deterministic lexical alignments and generate a symbolically precise intermediate representation. When gaps occur due to missing vocabulary or sparse training data, the second component, a retrieval augmented generation module, activates to dynamically source semantically relevant examples from a vector database. The third component is an instruction tuned large language model that reorders syntactic structures, inflects verbs appropriately, and resolves lexical ambiguities to produce fluent and culturally coherent translations (Wang et al., 2026).

This architecture is significant because it does not require large parallel corpora, the conventional prerequisite for neural machine translation. Instead, it leverages the strengths of symbolic linguistic resources, specifically the handcrafted dictionary, while using modern AI techniques to overcome the limitations of purely rule based approaches. The hybrid design addresses a fundamental challenge in low resource language technology: how to achieve adequate translation quality when neither purely symbolic nor purely statistical methods are sufficient on their own.

4.3 Linguistic Resource Base

The Paiwan language presents an extreme case of data scarcity. Its orthography is primarily Romanised, and its written corpus is minimal, meaning that vector retrieval may falter because of sparse exemplars. To address this limitation, the research team adopted a two phase hybrid workflow: a dictionary alignment phase that produces a symbolically precise intermediate representation by mapping input tokens to dictionary entries and analysing local structure, followed by an LLM recomposition phase that feeds the intermediate form, along with any retrievable context, into a large language model tasked with reordering and naturalising the sentence while preserving semantic fidelity (Wang et al., 2026).

This approach treats the handcrafted bilingual dictionary as the indispensable foundation, the primary linguistic authority, and uses AI to amplify rather than replace the value of curated lexical data. The dictionary provides deterministic accuracy for core vocabulary; the retrieval module compensates for gaps; and the LLM resolves the syntactic and contextual issues that purely lexical approaches cannot address.

4.4 Measurable Outcomes

The system was evaluated on a 250 sentence Paiwan to Mandarin dataset using three standard machine translation metrics: BLEU score, which measures n-gram overlap between model output and reference translations; cosine similarity, which captures semantic proximity between sentence embeddings; and ROUGE-L F1 score, which assesses the longest common subsequence between candidate and reference texts.

The results demonstrated substantial performance gains. Cosine similarity increased from the 0.210 to 0.236 range to 0.810 to 0.846. BLEU scores rose from the 1.7 to 4.4 range to 40.8 to 51.9. ROUGE-L F1 scores improved from the 0.135 to 0.177 range to 0.548 to 0.632 (Wang et al., 2026). These improvements are not marginal; they represent a transformation from effectively non-functional translation to practically useful output.

The research team concluded that the results corroborate the effectiveness of the proposed hybrid pipeline in mitigating semantic drift, preserving core meaning, and enhancing linguistic alignment in low resource settings. Beyond technical performance, they note that the framework contributes to broader efforts in language revitalisation and cultural preservation by supporting the transmission of indigenous knowledge through accurate, contextually grounded, and accessible translations (Wang et al., 2026).

5. Cross Cutting Success Factors

When these three cases are analysed comparatively, four recurrent success factors emerge. Each factor is present across all three cases, though expressed in forms specific to each linguistic and institutional context.

5.1 Curated Lexical Corpora as the Foundation

In every case, preservation outcomes depended not on raw data quantity but on deliberately curated datasets capturing the specific morphological, orthographic, and cultural features of the target language. The Icelandic government leveraged its sovereign position as gatekeeper of high quality text corpora, providing curated datasets that reflected formal literary and official usage. The Adi Vaani project combined community sourced oral data with academic linguistic expertise, building parallel corpora through direct collaboration with native speakers and Tribal Research Institutes. The Paiwan project placed a handcrafted bilingual dictionary at the foundation of its architecture, treating it as the authoritative linguistic resource upon which AI components depended (Wang et al., 2026).

The common principle is that AI performs best for language preservation when it is grounded in authoritative human curated linguistic data. Quantity of data matters less than quality and cultural authenticity. The Paiwan case demonstrates this principle most starkly: even a small handcrafted dictionary, when systematically integrated into an AI pipeline, can enable substantial translation accuracy gains.

5.2 Explicit Morphological and Grammatical Standards

Successful initiatives were distinguished by the adoption of explicit linguistic standards against which AI model performance could be measured. The Iceland OpenAI partnership focused specifically on morphological fidelity, the ability of the model to correctly generate Icelandic's complex inflectional paradigms, evaluating outputs against documented grammatical rules rather than relying on generic fluency assessments (OpenAI, 2024). The Paiwan project employed three standardised quantitative metrics, BLEU, cosine similarity, and ROUGE-L, enabling precise measurement of improvement and identification of specific failure modes.

The principle is that preservation oriented AI requires evaluation criteria that distinguish between surface plausibility and structural accuracy. A model might generate fluent sounding output that is morphologically or grammatically incorrect; without explicit linguistic standards, such errors remain invisible to generic performance metrics.

5.3 Institutional Partnerships

Every case involved a structured partnership between a public body and a technology provider. The preservation mandate was public; the AI capability was provided through partnership. Iceland partnered its government, including presidential level sponsorship, with OpenAI and the domestic firm Miðeind (OpenAI, 2024). India's Ministry of Tribal Affairs partnered with a consortium of leading technical institutions including IIT Delhi, IIIT Hyderabad, BITS Pilani, and IIIT Naya Raipur, alongside regional Tribal Research Institutes (Ministry of Tribal Affairs, 2025b). The Paiwan project, while primarily a university research initiative, was presented at an international engineering conference and developed within Taiwan's broader framework of indigenous language revitalisation policy (Wang et al., 2026).

The principle is that language preservation through AI is not a solo effort. It requires the convening power and legitimacy of public institutions, the technical expertise of AI research bodies, and the linguistic authority of language communities and cultural institutions.

5.4 Public Facing Preservation Tools

The most impactful interventions embedded language preservation within platforms that users already inhabit rather than confining it to specialised academic archives. The Iceland project's ultimate objective was to integrate GPT-4 into Embla, a voice assistant, and to enable Icelandic businesses to deploy Icelandic language chatbots on their websites (OpenAI, 2024). The Adi Vaani platform was designed from the outset as a mobile application available on the Play Store, delivering real time translation, educational resources, and government information within a single user facing interface (Ministry of Tribal Affairs, 2025a).

The principle is that preservation succeeds when it is integrated into daily digital life rather than being treated as a separate archival activity. Tools that serve concrete user needs, whether accessing services, learning, or communicating, achieve preservation outcomes as a byproduct of their practical utility.

 

6. Conclusion

The three cases examined in this paper demonstrate that artificial intelligence, when deployed through deliberate institutional partnerships and grounded in curated linguistic resources, can serve as an effective instrument for language preservation. The Icelandic case shows that a small state can leverage its sovereign position as a data gatekeeper to shape the behaviour of a global AI platform without needing to build its own foundational model. The Adi Vaani case shows that a government led consortium can develop translation infrastructure for dozens of low resource languages by combining academic expertise with community based data collection. The Paiwan case shows that even extremely limited linguistic resources, when systematically integrated with modern AI techniques, can produce functionally useful translation tools.

The four cross cutting success factors identified here—curated lexical corpora, explicit linguistic standards, institutional partnerships, and public facing tools—do not constitute a formula that can be mechanically replicated. Each language community faces distinctive historical, institutional, and linguistic circumstances. However, these factors do constitute a framework that can guide policy design in other contexts where governments and cultural institutions seek to deploy AI for language preservation.

The broader significance of these cases extends beyond the specific languages involved. In an era in which the algorithms that mediate daily communication are overwhelmingly developed for a small number of major languages, the capacity to shape those algorithms to serve linguistic diversity is an exercise of cultural sovereignty. The question these cases pose is not whether AI can serve language preservation, it manifestly can, but whether governments and institutions possess the imagination to deploy it in partnership with the communities whose languages are at stake.

 

References

AI Native Foundation (2024) 'AI Native Case Study 28: Government of Iceland', LinkedIn, 24 November. Available at: https://www.linkedin.com/posts/ainativefoundation_ai-languagepreservation-openai-activity-7266762118861856769-ywYx  (Accessed: 18 May 2026).

IIIT Hyderabad (2025) 'IIITH plays key role in Adi Vaani – first AI-translator for tribal languages', IIIT Hyderabad Blog, 3 September. Available at: https://blogs.iiit.ac.in/adi-vaani/ (Accessed: 18 May 2026).

Marr, B. (2024) '3 ways generative AI is making our world a better place', Bernard Marr & Co., 29 January. Available at: https://bernardmarr.com/3-ways-generative-ai-is-making-our-world-a-better-place/ (Accessed: 18 May 2026).

Ministry of Tribal Affairs, Government of India (2025a) 'Ministry of Tribal Affairs celebrates Bharatiya Bhasha Utsav 2025, honouring India's diverse tribal linguistic heritage', Press Information Bureau, 11 December. Available at: https://www.pib.gov.in/PressReleasePage.aspx?PRID=2202533  (Accessed: 18 May 2026).

Ministry of Tribal Affairs, Government of India (2025b) 'Ministry of Tribal Affairs to launch the beta version of "Adi Vaani"', Press Information Bureau, 30 August. Available at: https://www.pib.gov.in/PressReleseDetailm.aspx?PRID=2162278  (Accessed: 18 May 2026).

MisionesOnline (2024) 'Cómo la inteligencia artificial está ayudando a Islandia a conservar y transmitir su lengua nativa', MisionesOnline, 19 July. Available at: https://misionesonline.net/2024/07/20/como-la-inteligencia-artificial-esta-ayudando-a-islandia-a-conservar-y-transmitir-su-lengua-nativa/ (Accessed: 18 May 2026).

OpenAI (2024) *Preserving languages for the future*. Customer story. Available at: https://openai.com/customer-stories/iceland  (Accessed: 18 May 2026).

Wang, R.-C., Yang, C.-K., Yang, T.-C. and Tseng, Y.-X. (2026) 'Hybrid dictionary–retrieval-augmented generation–large language model for low-resource translation', Engineering Proceedings, 120(1), p. 52. doi: 10.3390/engproc2025120052.

Add comment

Comments

There are no comments yet.