Dataset | Uthayasanker Thayasivam

Some of the larger corpus are,
1. 1 Billion Word Language Model Benchmark
2. Wikipedia corpus (2018)
3. Gigaword corpus

Text8 corpus is freely available for embedding testing and evaluations. It’s just contains about 100K tokens (~100 MB corpus), and it contains all most used words in English language. Experimental analysis was performed on selecting the corpus for training. The corpuses were selected, a preliminary pre-processing was performed and the corpus was embedded. Then the models were evaluated to compare how the accuracy of the model varies with the corpus.

Experiments were also conducted using combination of multiple corpuses and their performance was evaluated via accuracy achieved for each word similarity datasets. For this text8 corpus was multiplied 25 times and multiple combinations were produced as follows,
1. Combination 1 : Text8 + Wiki Corpus + 1 Billion Corpus
2. Combination 2 : Text8 + Wiki Corpus
3. Combination 3 : Wiki Corpus + 1 Billion Corpus

A considerably high accuracy was observed for Wikipedia corpus than 1 billion word corpus or text8 corpus. Text8 is a very small sized corpus compared to others which has a vocabulary size of 44k. When the corpus was multiplied (duplicated) and merged with other corpuses, it gave a significant increase in accuracy. Since the focus was on polysemy embedding, text8 combined corpus couldn't be chosen due its very low vocabulary size.

Therefore, Wikipedia corpus was chosen which has considerably high accuracy for word embedding model and has high number of unique tokens.

Morphological regularities in word embedding: -

One billion word benchmark corpus was used with its 1 billion words and 2,428,879 unique tokens to produce English word embedding. A Tamil corpus with 3 million sentences, 38 million words and 1,976,304 unique tokens was composed from the web. Generation of Tamil corpus is done in an iterative manner. The scrapped corpus was experimented and from the results different techniques were used to scrape more corpus. English has more NLP resources compared to Tamil. However, since the main objective was to experiment this for MRL, there was greater focuss on creating a rich Tamil corpus.

Since morphology of the language was being analysed, it was required to create a corpus with high morphological information which must have words in all tenses, genders, parties etc. So initially Tamil news websites was scrapped for the corpus. This amounts for almost 80% of the entire corpus. Since the news corpus lacks morphological information about the present and future tense as well as first and second party, other different sources were also scrapped to fix this issue to an extent. It is difficult to find Tamil websites with has content in future tense. Horoscope news was also scrapped specifically to increase the future tense information in the corpus.

Tamil is a low resource language in the NLP point of view. The corpus was created with the intention of increasing the morphological richness of the corpus. This is a contribution through the research. This corpus was made publically available so that anyone who is planning to improve morphology can use this corpus.