The Grid Corpus is a large multitalker audiovisual sentence corpus designed to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female), for a total of 34000 sentences. Sentences are of the form "put red at G9 now". audio_25k.zip contains the wav format utterances at a 25 kHz sampling rate in a separate directory per talker alignments.zip provides word-level time alignments, again separated by talker s1.zip, s2.zip etc contain .jpg videos for each talker [note that due to an oversight, no video for talker t21 is available] The Grid Corpus is described in detail in the paper jasagrid.pdf included in the dataset.
The purpose of these datasets is to support equivalence and subsumption ontology matching. There are five ontology pairs extracted from MONDO and UMLS: Source Ontology Pair Category MONDO OMIM-ORDO Disease MONDO NCIT-DOID Disease UMLS SNOMED-FMA Body UMLS SNOMED-NCIT Pharm UMLS SNOMED-NCIT Neoplas Each pair is associated with three folders: "raw_data", "equiv_match", and "subs_match", corresponding to the downloaded source ontologies, the package for equivalence matching, and the package for subsumption matching. See detailed documentation at: https://krr-oxford.github.io/DeepOnto/#/om_resources. See the incoming OAEI Bio-ML track at: https://www.cs.ox.ac.uk/isg/projects/ConCur/oaei/. See our resource paper at: https://arxiv.org/abs/2205.03447.
T. Liang, C. Jin, L. Wang, W. Fan, C. Xia, K. Chen, and Y. Yin. Findings of the Association for Computational Linguistics: ACL 2024, page 8926--8939. Bangkok, Thailand, Association for Computational Linguistics, (August 2024)
K. Wang, C. Thrasher, E. Viegas, X. Li, and B. Hsu. Proceedings of the NAACL HLT 2010 Demonstration Session, page 45–48. USA, Association for Computational Linguistics, (Jun 2, 2010)
S. Hachmeier, and R. Jäschke. Proceedings of the 31st International Conference on Computational Linguistics, page 9845–9859. Association for Computational Linguistics, (2025)
A. Jaiswal, S. Singh, and S. Tripathy. 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), page 1-6. IEEE, (July 2023)