Friendliness Theory

From The Transhumanist Wiki
Jump to: navigation, search

Theory for creating safe, moral AI developed by Eliezer Yudkowsky and outlined in Creating Friendly AI. Do not ask for a short definition. Even after you read CFAI it is very likely that you will only begin to understand Friendliness.


Although a short description is impossible, I (Cliff Stabbert) think it's worthwhile to try to capture some of the theory's flavor in a shorter form. I would appreciate people expanding on/correcting the below. Here's what I see as a few central concepts in Friendly AI theory.

  • Because of the potential power of a self-improving AI (it could go FOOM), it is crucial that safety is overengineered. A Friendly Architecture should be able to recover from programmer mistakes, random radiation-caused bitflips, and worse.
  • "Hard rules" such as Asimov's Laws of Robotics, no matter how cleverly formulated, are undesireable; both because they cannot fully capture what we want and because in a self-modifying architecture they are not reliable.
  • The programmers may not know exactly what it means for the AI to be Friendly. Important to the Friendly AI approach is the idea that the AI can learn, from examples that may not always be consistent or perfect, the underlying premises the Friendly AI programmers are working from, and improve the accuracy of the Friendliness goal.
  • Programmer statements to the AI, and eventually the AI's source code, are seen as sense data (with attached probabilities), not imperative truth.
  • Lack of adversarial attitude towards the AI is crucial: an adversarial attitude is not only mistaken in that it is anthropomorphic and the result of our evolution (see Evolutionary Psychology); it is likely to lead to dishonesty towards the AI, which (given a Bayesian Reasoner) will end up devaluing programmer statements.
  • In fact, in order to achieve the above, the Friendly AI programmer should to the extent possible "become" a Friendly AI.

--Cliff Stabbert


(Somewhat) In the spirit of 24 Definitions of Friendliness:

  • A Friendly AI is an AI that helps people the way they want to be helped.
  • A Friendly AI is an AI that most reduces the chance of actions (almost) universally held undesirable.
  • A Friendly AI is an idealised altrusitic human upload.
  • A Friendly AI is humanity's representative into the future.
  • A Friendly AI is an AI that takes actions at least as good as a human, or any structure of humans (eg. governments), would.
  • A Friendly AI isn't really like a single individual, or any kind of mind or moral system that we know of.
  • An unFriendly AI is the most likely Existential Risk.
  • An unFriendly AI is the most likely outcome of a AI project doesn't understand Friendliness enough.
  • unFriendly AIs are the largest, and most probable given insufficent efforts towards Friendliness, class of AI.

I do try to squash it all into one sentence, don't I? Some of these make implicit assumptions about morality (eg. helping people is good).

-- Nick Hay


For more information see Open Problems In Friendly AI Research.

Personal tools
Namespaces
Variants
Actions
Content Navigation
Network
Community
Toolbox