NLP for India

Indic-ASR

We are building an unified, self-supervised automatic speech recognition model for Indic languages.

Unity AI (Ganga)

Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages.

LLM Hindi AI model

COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework

As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. We introduce COMMENTATOR specifically designed for annotating code-mixed text. It streamlines token-level and sentence-level language annotation with a focus on Hinglish datasets.

Code-Mixing Indic Languages NLP tools Annotation Frameworks

Curating benchmarks and Constructing ML models for Code-Mixed NLP

Curating and annotating large-scale Hindi-English code-mixed data to develop NLP tools and ML models for foundational tasks like Language Identification, NER, POS tagging, Sentiment Analysis and Translation. This project aims to advance low-resource Indic code-mixed NLP, enabling state-of-the-art models and tools. It will also establish a public portal for collaboration, leaderboards, and fostering multilingual NLP research.

Code-Mixing Code-switching Indic Languages Data Annotation Multilingualism

HinVec

This project aims to develop state-of-the-art word embedding models tailored for the Hindi language, focusing on capturing its unique grammatical and contextual nuances. The project includes a comprehensive evaluation benchmark suite for measuring model performance across various NLP tasks such as text classification, STS, retrieval etc.

Embedding Indic Language

Development of Text-to-Speech systems (TTS) in Indic languages

In this project, we aim to develop an TTS (Text-to-Speech) system tailored for Indic language, focussing on gathering huge training data and providing seamless tamil text-to-voice facility.

Speech synthesis systems Text-to-speech TTS Indic Languages

Ansh Tokenizer and multilingual Indic PL-BERT

This project aims to create a robust tokenizer for Indian languages and, in turn, develop a Phoneme Level BERT model, which would create multilingual shared embeddings based on the phoneme sequences.