An adequate natural language description of developments in a real-world scene can be taken as proof of understanding what is going on. An algorithmic system that generates natural language descriptions from video recordings of road traffic scenes can be said to 'understand' its input to the extent that algorithmically generated text is acceptable to the humans judging it. A fuzzy metrictemporal Horn logic (FMTHL) provides a formalism for representing both schematic and instantiated conceptual knowledge about the depicted scene and its temporal development. The resulting conceptual representation mediates in a systematic manner between the spatiotemporal geometric descriptions extracted from video input and a module that generates natural language text. This article outlines a 30-year effort to create such cognitive vision system, indicates its current status, summarizes lessons learned along the way, and discusses open problems against this background.