Tamil and Sinhala Speech Intent Dataset


This dataset contains crowdsourced Sinhala and Tamil Speech recordings in the banking domain. For more details please look at the following works that are based on this dataset.

  • Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2019 Jul 28 (pp. 288-294).
  • Domain-Specific Intent Classification of Sinhala Speech Data. In 2018 International Conference on Asian Language Processing (IALP) 2018 Nov 15 (pp. 197-202). IEEE.

License Terms
In summary:

  • Use only for academic and/or research purposes. No commercial use.
  • Publication permitted only if the Data Sets are unmodified and subject to the same license terms.
  • Any publication must include a full citation to the papers in which the Data Sets were initially published.

Please read the full License Terms before accessing the Data Sets (Link).

Citations
Cite the following papers in your publication

@inproceedings{karunanayake2019transfer,
title={Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages},
author={Karunanayake, Yohan and Thayasivam, Uthayasanker and Ranathunga, Surangika},
booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop},
pages={288--294},
year={2019}
}

@inproceedings{buddhika2018domain,title={Domain Specific Intent Classification of Sinhala Speech Data},
author={Buddhika, Darshana and Liyadipita, Ranula and Nadeeshan, Sudeepa and Witharana, Hasini and Javasena, Sanath and Thayasivam, Uthayasanker},
booktitle={2018 International Conference on Asian Language Processing (IALP)},
pages={197--202},
year={2018},
organization={IEEE}
}

 

Dataset Details
For each language, the dataset contains the following files

  • [Language]_Data.csv – Details of the audio files
  • [Language]_Sentences – Details of the intents and their inflections
  • license.txt
  • audio_files (Folder) – Contains audio files

Speaker Details
Sinhala language dataset does not contain any speaker information since this data has not been preserved in the data collection process. In the Tamil dataset subfolders in the “audio_files” folder belong to different speakers. This structure can be used to infer information about the speakers.

Dataset Download
Request for download here (Link). Clearly specify which datasets you require. You will get the download link via email once approved. By downloading the data, you signify your acceptance of the above license terms.

  • Sinhala
  • Tamil