AI4Bharat-IndicNLP Dataset


The AI4Bharat-IndicNLP dataset is an ongoing effort to create a collection of large-scale, general-domain corpora for Indian languages. Currently, it contains 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We evaluate the IndicNLP embeddings on multiple evaluation tasks.

License - CC BY 4.0

Authors - Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul NC, Avik Bhattacharyya, Mitesh Khapra, Pratyush Kumar

Language - Tamil

Citation - 

@article{kunchukuttan2020indicnlpcorpus,
title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
journal={arXiv preprint arXiv:2005.00085},
}

 

Reference- https://paperswithcode.com/dataset/indicnlp-corpus

https://github.com/AI4Bharat/indicnlp_corpus