Abstract
It is clear that one of the primary tools we can use to mitigate the
potential risk from a misbehaving AI system is the ability to turn the system
off. As the capabilities of AI systems improve, it is important to ensure that
such systems do not adopt subgoals that prevent a human from switching them
off. This is a challenge because many formulations of rational agents create
strong incentives for self-preservation. This is not caused by a built-in
instinct, but because a rational agent will maximize expected utility and
cannot achieve whatever objective it has been given if it is dead. Our goal is
to study the incentives an agent has to allow itself to be switched off. We
analyze a simple game between a human H and a robot R, where H can press R's
off switch but R can disable the off switch. A traditional agent takes its
reward function for granted: we show that such agents have an incentive to
disable the off switch, except in the special case where H is perfectly
rational. Our key insight is that for R to want to preserve its off switch, it
needs to be uncertain about the utility associated with the outcome, and to
treat H's actions as important observations about that utility. (R also has no
incentive to switch itself off in this setting.) We conclude that giving
machines an appropriate level of uncertainty about their objectives leads to
safer designs, and we argue that this setting is a useful generalization of the
classical AI paradigm of rational agents.
Users
Please
log in to take part in the discussion (add own reviews or comments).