ILID: Native Script Language Identification for Indian Languages
A comprehensive study on automatic language identification techniques for multilingual text processing and cross-lingual understanding especially for Indian languages.
Project Resources
Main Repository
GitHub: yashingle-ai/TextLangDetectPreprint
arXiv: 2507.11832Python Package
PyPI: langid-indian 0.1.2Hugging Face Dataset
HF: ILID-language-identification-datasetHugging Face Model
HF: Fine-tuned ILID ModelAbstract
This paper presents ILID (Indian Language IDentification), a comprehensive framework for automatic language identification in English and Indian language contexts. With the increasing prevalence of multilingual content in digital communications, accurate language detection has become crucial for natural language processing applications.
Our approach combines traditional statistical methods with modern deep learning techniques to achieve state-of-the-art performance on language identification tasks. We evaluate our methodology on benchmark datasets spanning multiple language families and demonstrate significant improvements over existing baselines.
The contributions of this work include: (1) Language Identification benchmark of 250K sentences covering English and 22 Indian languages (2) Development of baseline models using both traditional and deep learning approaches
Paper Viewer
Read the full paper directly in your browser or download the PDF for offline reading.
Citation
Use the following citation formats to reference this work in your research.
@misc{ingle2025ilidnativescriptlanguage,
title={ILID: Native Script Language Identification for Indian Languages},
author={Yash Ingle and Pruthwik Mishra},
year={2025},
eprint={2507.11832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11832},
}
Key Contributions
Benchmark Dataset Creation
Benchmark dataset of 250K sentences covering English and 22 Indian languages, enabling robust language identification.
Development of Baseline Models
Development of baselines capable of handling multilingual text, including traditional statistical methods and modern deep learning approaches beating state-of-the-art performance.