Hi PUBLICATIONS DATASETS EVENTS DEMO COMMENTATOR TEAM CONTACT US

Advancing Code-Mixed NLP with Curated Benchmarks & ML Models

📌

Dataset Curation

Building comprehensive datasets that capture real-world code-mixing patterns across diverse linguistic contexts.

Multi-domain samples
Fine-grained annotations
🚧

Challenges & Limitations

Addressing complex linguistic phenomena that make code-mixed NLP particularly challenging.

Script variation
Syntactic complexity
📊

Resources & Metrics

Developing specialized evaluation frameworks to accurately assess model performance on code-mixed content.

Language-agnostic benchmarks
Cross-lingual evaluation
🌍

Open Collaboration

Fostering a global community of researchers and practitioners to advance code-mixed NLP technologies.

Shared task leaderboards
Open-source repositories

PUBLICATIONS

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities [Paper] [Github]
Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
In Proceedings of the 2026 Conference on Association for Computational Linguistics, 2026.
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [Paper] [Dataset]
Rajvee Sheth, Himanshu Beniwal, Mayank Singh
In Findings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025.
Commentator: A Code-mixed Multilingual Text Annotation Framework [Paper] [Codebase]
Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Himanshu Beniwal, Mayank Singh
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [Paper]
Rahul Gupta, Vivek Srivastava, Mayank Singh
Findings of the Association for Computational Linguistics: EACL 2023.
MMT: A Multilingual and Multi-Topic Indian Social Media Dataset [Paper]
Dwip Dalal, Vivek Srivastava, Mayank Singh
First Workshop on Cross-Cultural Considerations in NLP (C3NLP) at EACL 2023.
Overview and results of MixMT shared-task at WMT 2022 [Paper]
Vivek Srivastava, Mayank Singh
Seventh Conference on Machine Translation (WMT).
HinglishEval Generation Challenge on Quality Estimation of Synthetic Code-Mixed Text: Overview and Results [Paper]
Vivek Srivastava, Mayank Singh
15th International Conference on Natural Language Generation: Generation Challenges, INLG'22.
Code-Mixed NLG: Resources, Metrics, and Challenges [Paper]
Vivek Srivastava, Mayank Singh
5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD).
PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions on WhatsApp Messenger [Paper] [Poster]
Vivek Srivastava, Mayank Singh
The Seventh Workshop on Noisy User-generated Text (W-NUT 2021) at EMNLP 2021.
MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation [Paper]
Ayush Garg, Sammed S Kagi, Vivek Srivastava, Mayank Singh
The 2nd Workshop on Evaluation & Comparison of NLP Systems (Eval4NLP) at EMNLP 2021.
HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text [Paper] [Slides]
Vivek Srivastava, Mayank Singh
The 2nd Workshop on Evaluation & Comparison of NLP Systems (Eval4NLP) at EMNLP 2021.
Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text [Paper] [Website]
Vivek Srivastava, Mayank Singh
Generation Challenge at 14th International Conference on Natural Language Generation (INLG) 2021.
Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text [Paper]
Vivek Srivastava, Mayank Singh
Fifth Workshop on Computational Approaches to Linguistic Code-Switching at NAACL 2021.
IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection [Paper]
Vivek Srivastava, Mayank Singh
Proceedings of the Fourteenth Workshop on Semantic Evaluation at COLING 2020.
PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation [Paper]
Vivek Srivastava, Mayank Singh
Sixth Workshop on Noisy User-generated Text (W-NUT 2020) at EMNLP 2020.

DATASETS

COMI-LINGUA: COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation
By: Rajvee Sheth, Himanshu Beniwal and Mayank Singh. (EMNLP Findings 2025)
Huggingface Paper Link

A high-quality Hinglish code-mixed dataset with 181,463 instances, manually annotated for LID, MLI, POS tagging, NER, text normalization, and translations.

MUTANT: A multi-sentential Hindi-English code-mixed dataset
By: Rahul Gupta, Vivek Srivastava and Mayank Singh. (EACL 2023)
Huggingface Paper Link

A multi-sentential Hindi-English code-mixed dataset with 67,007 documents and 84,937 MCTs, sourced from political speeches, press releases, and Hindi news articles.

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset
By: Dwip Dalal, Vivek Srivastava and Mayank Singh. (EACL 2023)
Huggingface Paper Link

A large-scale language identification dataset derived from 1.7 million tweets collected from Indian Twitter/X, annotated with coarse and fine-grained language labels.

PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions on WhatsApp Messenger
By: Vivek Srivastava and Mayank Singh. (EMNLP 2021)
Huggingface Paper Link

A large-scale corpus of WhatsApp political discussions collected during the Indian General Elections 2019, consisting both raw and annotated data, enabling research in political discourse and misinformation.

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text
By: Vivek Srivastava and Mayank Singh. (EMNLP 2021)
Huggingface Paper Link

A high-quality Hindi-English code-mixed dataset for NLG, containing human- and algorithm-generated Hinglish sentences with quality ratings, sourced from IITB English-Hindi parallel corpus.

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation
By: Vivek Srivastava and Mayank Singh. (EMNLP 2020)
Huggingface Paper Link

A parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English.

EVENTS & PARTICIPATIONS

Dec 2022
MixMT Shared Task at WMT

SEVENTH CONFERENCE ON MACHINE TRANSLATION (WMT22): Colocated with EMNLP 2022, this shared task focused on machine translation for code-mixed languages.

Nov 2022
HinglishEval Generation Challenge

Hosted at IIT Gandhinagar as part of INLG 2022, this challenge explored Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text.

Jan 2022
Tutorial on "Code-Mixed NLG: Resources, Metrics, and Challenges"

Held at CODS-COMAD 2022, covering various challenges in code-mixed natural language generation.

Dec 2021
Tutorial on "Lessons, Insights, and Opportunities with the Metrics for Code-Mixing"

Presented at ICON 2021, discussing evaluation metrics for code-mixed text.

Dec 2020
IIT Gandhinagar at SemEval-2020 Task 9

A study on code-mixed sentiment classification using candidate sentence generation and selection.

ANNOTATION FRAMEWORK: COMMENTATOR



Poster 📽️



Demonstration video ▶️


TEAM

Me

Prof. Mayank Singh

Jibaben Patel Chair Professor in AI
and Associate Professor

IIT Gandhinagar

Snow

Rajvee Sheth

Senior Research Fellow

IIT Gandhinagar

Ronak

Pooja Goswami

Technical Assistant

IIT Gandhinagar

Mahesh

Mahesh Kumar

Technical Assistant

IIT Gandhinagar

diksha

Rahul Gadhvi

Technical Assistant

IIT Gandhinagar


PROJECT INTERNS


Samridhi Raj Sinha

Samridhi Raj Sinha

SRIP Intern

IIT Gandhinagar

mahavir-patil

Mahavir Patil

Project Intern

IIT Gandhinagar

yashchopra9

Drashti Patel

Project Intern

IIT Gandhinagar

yashchopra9

Yash Chopra

Project Intern

IIT Gandhinagar


CONTRIBUTORS


shubh-nisar

Shubh Nisar

Software Engineer Intern

North Carolina State University

Heenaben

Heenaben Prajapati

Senior Research Fellow

IIT Gandhinagar

himanshubeniwal

Himanshu Beniwal

PhD student

IIT Gandhinagar


ALUMNI


ronak

Ronakpuri Goswami

JRF

DAU Gandhinagar

diksha

Diksha
Bishlay

Former TA

IIT Gandhinagar

vaidahi

Vaidahi
Patel

Masters in CS

ASU (USA)

ravindra

Ravindra Purohit

Research Scholar

DAU Gandhinagar

dwip

Dwip
Dalal

PhD Student

UIUC

rahul

Rahul
Gupta

Software Engineer

Goldman Sachs

vivek

Vivek Srivastava

Researcher

TCS Research

PROJECT FUNDING

Graciously sponsored by

Anusandhan National Research Foundation (ANRF)