NLP for India

Multimodal Audio Processing System
An advanced system integrating noise reduction, transcription, translation and speech synthesis for streamlined and efficient audio signal processing workflow

Unity AI (Ganga)
Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages.

COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. We introduce COMMENTATOR specifically designed for annotating code-mixed text. It streamlines token-level and sentence-level language annotation with a focus on Hinglish datasets.

Curating benchmarks and Constructing ML models for Code-Mixed NLP
Curating and annotating large-scale Hindi-English code-mixed data to develop NLP tools and ML models for foundational tasks like Language Identification, NER, POS tagging, Sentiment Analysis and Translation. This project aims to advance low-resource Indic code-mixed NLP, enabling state-of-the-art models and tools. It will also establish a public portal for collaboration, leaderboards, and fostering multilingual NLP research.

HinVec
This project aims to develop state-of-the-art word embedding models tailored for the Hindi language, focusing on capturing its unique grammatical and contextual nuances. The project includes a comprehensive evaluation benchmark suite for measuring model performance across various NLP tasks such as text classification, STS, retrieval etc.

Design and implementation of an ASR system for Tamil language
In this project, we aim to develop an ASR system tailored for Tamil language, focussing on gathering huge training data and providing seamless tamil voice-to-text facility.

Enhancing ASR in Marathi language
The project helps create an audio dataset for the Marathi language that spans its various dialects and provides a smooth Marathi voice-to-text experience.