Crowdsourced Sinhala [si-lk] ASR dataset


This dataset was collected for speech technology research.

This dataset was collected from native Sinhala speakers who volunteered to supply the data. The audio was recorded on standard consumer smartphones, in various environments. The audio is delivered in a downsampled lossless format (16kHz, 16 bit, mono, FLAC audio).

Some quality checks have been done on the data, but there might still be mistranscriptions or artifacts in the audio.

License - CC BY-SA 4.0

Authors - Unknown

Language - Sinhala

Reference- https://research.google/tools/datasets/sinhala-asr/

http://openslr.org/52/


@inproceedings{kjartansson-etal-sltu2018,
title = {{Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali}},
author = {Oddur Kjartansson and Supheakmungkol Sarin and Knot Pipatsrisawat and Martin Jansche and Linne Ha},
booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)},
year = {2018},
address = {Gurugram, India},
month = aug,
pages = {52--55},
URL = {http://dx.doi.org/10.21437/SLTU.2018-11}
}