Sinhala news corpus- Text classification

SiClaEn dataset contains a Reuters English News DataSet and a Sinhala News DataSet. The Sinhala News DataSet was collected from bi-lingual Sinhala and English news sources such as AdaDerana and NewsFirst. The Reuters English News DataSet has 7103 sentences in 383 posts and the Sinhala News DataSet has 5221 sentences in 471 posts. All datasets are categorized pertaining to thefollowing topics; business, entertainment, politics, Science& technology, and sports.

Language - Sinhala

Authors - Nisansa de Silva

Reference -

Citation- @article{de2015Sinhala,Author={de Silva, Nisansa},

title={{Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language}},


abstract={Sinhala, despite its several millennia long history, remains a resource poor language.

The objective of this study was to explore the possibility of enhancing the text classification process

of a resource poor language by means of data and tools  from a resource rich language.

However, it was discovered that if the feature space is based on an n-gram model,

Sinhala, being a a highly inflected language,

naturally performs better than English, which is a weakly inflected language.

This result held true even when Sinhala was only utilizing the basic lexical level

models and English was utilizing advanced semantic level models.},
