Abstract
Despite widespread use of LLMs as conversational agents, evaluations of
performance fail to capture a crucial aspect of communication: interpreting
language in context. Humans interpret language using beliefs and prior
knowledge about the world. For example, we intuitively understand the response
"I wore gloves" to the question "Did you leave fingerprints?" as meaning "No".
To investigate whether LLMs have the ability to make this type of inference,
known as an implicature, we design a simple task and evaluate widely used
state-of-the-art models. We find that, despite only evaluating on utterances
that require a binary inference (yes or no), most perform close to random.
Models adapted to be äligned with human intent" perform much better, but still
show a significant gap with human performance. We present our findings as the
starting point for further research into evaluating how LLMs interpret language
in context and to drive the development of more pragmatic and useful models of
human discourse.
Users
Please
log in to take part in the discussion (add own reviews or comments).