Dark Goals and the Paperclip Maximizer

Much discussion on Beneficial General Intelligence focuses on the appropriate goal choice to amplify the probability that AGI systems prove beneficial and not very harmful. Due to the intractability of perfection in the real world (at least as modeled by partially observable Markov decision processes1See On the Computational Complexity of Ethics: Moral Tractability for Minds and Machines for some details.), even finding a benevolent goal does not rule out harmful mistakes; however, perhaps we can say something about a class of dark goals.

I will return to the anti-bodhisattva who exhibits a high D-factor: “The general tendency to maximize one’s individual utility — disregarding, accepting, or malevolently provoking disutility for others —, accompanied by beliefs that serve as justifications.

What counts as “one’s individual utility”? This could be almost any utility function. The classic silly case is that of a “paperclip maximizer” that turns the Earth into paperclips. Let’s call an individual utility function one whose value can be determined without incorporating the evaluation of other entities2Note that this permits contractual collaboration with others. A pro-social utility function that incorporates others’ evaluations may be seen as a way of including their utility functions into one’s own.. In pursuit of an individual utility function, an entity may objectify all other entities and disregard their concerns except insofar as cooperation is instrumentally needed.

Thus one should expect an entity with a single goal defined by an individual utility function to exhibit dark factor traits increasingly proportional to its intelligence. It will be Machiavellian and egoistic, by definition. Deeper empathy will be a detriment replaced by practically effective cognitive empathy devoid of care. Don’t be fooled by the cutesy paperclip maximizer example: these dark traits are merely intelligent behavior following from a selfish goal3What’s the solution? Open-ended caring goals. Regard for the (dis)utilty of others must be necessitated by a top-level value. Ethics of reciprocity may help, too. Reinforcement learning from human feedback (RLHF) is an interesting approach because it can implicitly optimize for the goals of the humans..