RANLP 2025

ILID: Native Script Language Identification for Indian Languages

A comprehensive study on automatic language identification techniques for multilingual text processing and cross-lingual understanding especially for Indian languages.

Read Paper Download PDF

Authors

Yash Ingle

SVNIT

u23ai062@coed.svnit.ac.in

Dr. Pruthwik Mishra

SVNIT

pruthwikmishra@aid.svnit.ac.in

Project Resources

Main Repository

GitHub: yashingle-ai/TextLangDetect

Preprint

arXiv: 2507.11832

Python Package

PyPI: langid-indian 0.1.2

Hugging Face Dataset

HF: ILID-language-identification-dataset

Hugging Face Model

HF: Fine-tuned ILID Model

Abstract

This paper presents ILID (Indian Language IDentification), a comprehensive framework for automatic language identification in English and Indian language contexts. With the increasing prevalence of multilingual content in digital communications, accurate language detection has become crucial for natural language processing applications.

Our approach combines traditional statistical methods with modern deep learning techniques to achieve state-of-the-art performance on language identification tasks. We evaluate our methodology on benchmark datasets spanning multiple language families and demonstrate significant improvements over existing baselines.

The contributions of this work include: (1) Language Identification benchmark of 250K sentences covering English and 22 Indian languages (2) Development of baseline models using both traditional and deep learning approaches

Paper Viewer

Read the full paper directly in your browser or download the PDF for offline reading.

Open in New Tab Download PDF

Citation

Use the following citation formats to reference this work in your research.

@misc{ingle2025ilidnativescriptlanguage,
      title={ILID: Native Script Language Identification for Indian Languages}, 
      author={Yash Ingle and Pruthwik Mishra},
      year={2025},
      eprint={2507.11832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11832}, 
}

Key Contributions

Benchmark Dataset Creation

Benchmark dataset of 250K sentences covering English and 22 Indian languages, enabling robust language identification.

Development of Baseline Models

Development of baselines capable of handling multilingual text, including traditional statistical methods and modern deep learning approaches beating state-of-the-art performance.