IndicCorp Dataset


IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

License - Unknown

Authors - Divyanshu Kakwani; Anoop Kunchukuttan; Satish Golla; Gokul N.C.; Avik Bhattacharyya; Mitesh M. Khapra; Pratyush Kumar.

Language - Tamil

 

Reference- https://datasetsearch.research.google.com/search?src=0&query=tamil%20nlp&docid=L2cvMTFuMmg3eW54eA%3D%3D

https://paperswithcode.com/dataset/indiccorp