NLP for India
Indic-ASR
We are building an unified, self-supervised automatic speech recognition model for Indic languages.
Unity AI (Ganga)
Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages.
COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. We introduce COMMENTATOR specifically designed for annotating code-mixed text. It streamlines token-level and sentence-level language annotation with a focus on Hinglish datasets.
Curating benchmarks and Constructing ML models for Code-Mixed NLP
Curating and annotating large-scale Hindi-English code-mixed data to develop NLP tools and ML models for foundational tasks like Language Identification, NER, POS tagging, Sentiment Analysis and Translation. This project aims to advance low-resource Indic code-mixed NLP, enabling state-of-the-art models and tools. It will also establish a public portal for collaboration, leaderboards, and fostering multilingual NLP research.
HinVec
This project aims to develop state-of-the-art word embedding models tailored for the Hindi language, focusing on capturing its unique grammatical and contextual nuances. The project includes a comprehensive evaluation benchmark suite for measuring model performance across various NLP tasks such as text classification, STS, retrieval etc.
Development of Text-to-Speech systems (TTS) in Indic languages
In this project, we aim to develop an TTS (Text-to-Speech) system tailored for Indic language, focussing on gathering huge training data and providing seamless tamil text-to-voice facility.
Ansh Tokenizer and multilingual Indic PL-BERT
This project aims to create a robust tokenizer for Indian languages and, in turn, develop a Phoneme Level BERT model, which would create multilingual shared embeddings based on the phoneme sequences.
