Large Sinhala ASR training data set


This data set contains transcribed audio data for Sinhala. The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, anonymized UserID and the transcription of audio in the file.
The data set has been manually quality checked, but there might still be errors.

Language - Sinhala

Authors - Oddur Kjartansson and Supheakmungkol Sarin and Knot Pipatsrisawat and Martin Jansche and Linne Ha

Reference - http://openslr.org/52

License type -Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

@inproceedings{kjartansson-etal-sltu2018,

title = {{Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali}},
author = {Oddur Kjartansson and Supheakmungkol Sarin and Knot Pipatsrisawat and Martin Jansche and Linne Ha},
booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)},
year = {2018},
address = {Gurugram, India},
month = aug,
pages = {52--55},URL = {http://dx.doi.org/10.21437/SLTU.2018-11}}