Wireheading, the Delusion Box and Model-Based Utility Functions

Bill Hibbard

18 April 2013

In the original wireheading paper (Olds and Milner, 1954) a rat's action (pushing a bar) increased its reward (sent current through a wire connected to the reward center in the rat's brain). In Ring and Orseau's paper (2011) a reinforcement learning (RL) agent's action causes the delusion box to increase the agent's observed reward. So in the RL case, the delusion box is a precise analogy to the original wireheading scenario. In their paper, Ring and Orseau show that RL agents will choose to wirehead.

Non-RL agents do not have reward signals so there can't be such a precise analogy to rat wireheading. But the delusion box provides such a precise analogy for RL agents that it seems like a natural way to extrapolate wireheading to more general agents without reward signals.

My paper (Hibbard, 2012) on model-based utility functions (MBUFs) presented examples of agents with MBUFs that do not choose to self-delude. If we accept that the delusion box is the right way to extrapolate wireheading to non-RL agents, then MBUF agents are a way to avoid wireheading.

Furthermore, introspecting my own motivations indicates that I do not become a meth addict because my motives are defined in terms of my world model. That is, my world model tells me that choosing to become a meth addict will alter my future motives so that I no longer value my health and my relations with my wife and other people close to me, so the choice to become a meth addict will have bad future consequences for things I currently care about now (my health and human relations). Thus MBUFs seems like the natural way to avoid wireheading.

A MBUF requires the agent to have an accurate model of the environment. But agents with inaccurate models will have serious failures much simpler than wireheading. For example, if a pistol is lying on the table but I model it as a hair dryer, then I may shoot myself in the head trying to dry my hair.


Hibbard, B. 2012. Model-based utility functions. J. Artificial General Intelligence 3(1), pp. 1-24.

Olds, J., and P. Milner, P. 1954. Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. J. Comp. Physiol. Psychol. 47, pp. 419-427.

Ring, M., and Orseau, L. 2011. Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., and Looks, M. (eds) AGI 2011. LNCS (LNAI), vol. 6830, pp. 11-20. Springer, Heidelberg.