The Grid Corpus is a large multitalker audiovisual sentence corpus designed to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female), for a total of 34000 sentences. Sentences are of the form "put red at G9 now". audio_25k.zip contains the wav format utterances at a 25 kHz sampling rate in a separate directory per talker alignments.zip provides word-level time alignments, again separated by talker s1.zip, s2.zip etc contain .jpg videos for each talker [note that due to an oversight, no video for talker t21 is available] The Grid Corpus is described in detail in the paper jasagrid.pdf included in the dataset.
N. Dehouche, and A. Wongkitrungrueng. Proceedings of ANZMAC 2018: The 20th Conference of the Australian and New Zealand Marketing Academy. Adelaide (Australia), page 3--5 December. (2018)
J. Pfister, K. Kobs, and A. Hotho. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, page 816-825. (June 2021)
J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel. Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, page 43--52. (2015)
E. Žunić, K. Korjenić, S. Delalić, and Z. Šubara. International Journal of Computer Science and Information Technology (IJCSIT), 13 (2):
67 - 84(April 2021)