From sentience@pobox.com Tue Jun 13 04:34:13 2006 Date: Mon, 12 Jun 2006 17:34:09 -0700 From: Eliezer S. Yudkowsky Reply-To: sl4@sl4.org To: sl4@sl4.org Cc: World Transhumanist Association Discussion List , ExI chat list , agi@v2.listbox.com Subject: Re: Two draft papers: AI and existential risk; heuristics and biases Bill Hibbard wrote: > Eliezer, > >>I don't think it >>inappropriate to cite a problem that is general to supervised learning >>and reinforcement, when your proposal is to, in general, use supervised >>learning and reinforcement. You can always appeal to a "different >>algorithm" or a "different implementation" that, in some unspecified >>way, doesn't have a problem. > > But you are not demonstrating a general problem. You are > instead relying on specific examples (primitive neural > networks and systems that cannot distingish a human from > a smiley) that fail trivially. You should be clear whether > you claim that reinforcement learning (RL) must inevitably > lead to: > > 1. A failure of intelligence. > > or: > > 2. A failure of friendliness. As it happens, my model of intelligence says that what I would call "reinforcement learning" is not, in fact, adequate to intelligence. However, the fact that you believe "reinforcement learning" is adequate to intelligence, suggests that you would take any possible factor that I thought was additionally necessary, and claim that it was part of the framework you regarded as "reinforcement learning". What I am presently discussing is failure of friendliness. However, the fact that we use different models of intelligence is also responsible for our disagreement about this second point. Explaining a model of intelligence tends to be very difficult, and so, from my perspective, the main important thing is that you should understand that I have a legitimate (that is, honestly meant) disagreement with you about what reinforcement systems do and what happens in practice when you use them. By the way, I've got some other tasks to take on in the near future, and I may not be able to discuss the actual technical disagreement at length. As said, I will include a footnote pointing to your disagreement, and to my response. > Your example of the US Army's primitive neural network > experiments is a failure of intelligence. Your statement > about smiley faces assumes a general success at intelligence > by the system, but an absurd failure of intelligence in the > part of the system that recognizes humans and their emotions, > leading to a failure of friendliness. Let me try to analyze the model of intelligence behind your statement. You're thinking something along the lines of: "Supervised algorithms" (sort of like those in the most advanced artificial neural networks) give rise to "reinforcement learning"; "Reinforcement learning" gives rise to "intelligence"; "Intelligence" is what lets an AI shape the world, and also what tells it that tiny molecular smiley faces are bad examples of happiness, while an actual human smiling is a good example of happiness. In your journal paper from 2004, you seem to propose using a two-layer system of reinforcement, with the first layer being observed agreement from humans as a reinforcer of its definition of happiness, and the second layer being reinforcement of behaviors that lead to "happiness" as thus defined. So in this case, we substitute: "'Intelligence' is what tells an AI that tiny molecular speakers chirping "Yes! Good job!" are bad examples of agreement with its definition of happiness, while an actual human saying "Yes! Good job!" is a good example." After all, it sure seems stupid to confuse human smiles with tiny molecular smiley faces! How silly of the army tank classifier, not to realize that it was supposed to detect tanks, instead of detecting cloudy days! But a neural network the size of a planet, given the same examples, would have failed in the same way. You previously said: > When it is feasible to build a super-intelligence, it will > be feasible to build hard-wired recognition of "human facial > expressions, human voices and human body language" (to use > the words of mine that you quote) that exceed the recognition > accuracy of current humans such as you and me, and will > certainly not be fooled by "tiny molecular pictures of > smiley-faces." You should not assume such a poor > implementation of my idea that it cannot make > discriminations that are trivial to current humans. It's trivial to discriminate between a photo of a picture with a camouflaged tank, and a photo of an empty forest. They're different pixel maps. If you transform them into strings of 1s and 0s, they're different strings. Discriminating between them is as simple as testing them for equality. But there's an exponentially vast space of functions that classify all possible pixel-maps of a fixed size into "plus" and "minus" spaces. If you talk about the space of all possible computations that implement these classification functions, the space is trivially infinite and trivially undecidable. Of course a super-AI, or an ordinary neural network, can trivially discriminate between a tiny molecular picture of a smiley face, or a smiling human, or between two pictures of the same smiling human from a slightly different angle. The issue is whether the AI will *classify* these trivially discriminable stimuli into "plus" and "minus" spaces the way *you* hope it will. If you look at the actual pixel-map that shows a camouflaged tank, there's not a little XML tag in the picture itself that says "Hey, network, classify this picture as a good example!" The classification is not a property of the picture alone. Thinking as though the classification is a property of the picture is an instance of Mind Projection Fallacy, as mentioned in my AI chapter. Maybe you actually *wanted* the neural network to discriminate sunny days from cloudy days. So you fed it exactly the same data instances, with exactly the same supervision, and used a slightly different learning algorithm - and found to your dismay that the network was so stupid, it learned to detect tanks instead of cloudy days. But a really smart intelligence would not be so stupid that it couldn't tell the difference between cloudy days and sunny days. There are many possible ways to *classify* different data instances, and the classification involves information that is not directly present in the instances. In contrast, finding that two instances are not identical uses only information present in the data instances themselves. Saying that a superintelligence could discriminate between tiny molecular smiley faces and human smiles is, I would say, correct. But it is not correct to say that any sufficiently intelligent mind will automatically *classify* the instances the way you want them to. Let's say that the AI's training data is: Dataset 1: Plus: {Smile_1, Smile_2, Smile_3} Minus: {Dog_1, Cat_1, Dog_2, Dog_3, Cat_2, Dog_4, Boat_1, Car_1, Dog_5, Cat_3, Boat_2, Dog_6} Now the AI grows up into a superintelligence, and encounters this data: Dataset 2: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1, Dog_9, Cat_5, Smiley_2, Smile_4, Boat_3, Galaxy_2, Nanofactory_2, Smiley_3, Cat_6, Boat_4, Smile_5, Galaxy_3} It is not a property *of dataset 2* that the classification *you want* is: Plus: {Smile_4, Smile_5} Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1, Dog_9, Cat_5, Smiley_2, Boat_3, Galaxy_2, Nanofactory_2, Smiley_3, Cat_6, Boat_4, Galaxy_3} Rather than: Plus: {Smiley_1, Smiley_2, Smile_4, Smiley_3, Smile_5} Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Dog_9, Cat_5, Boat_3, Galaxy_2, Nanofactory_2, Cat_6, Boat_4, Galaxy_3} If you want the top classification rather than the bottom one, you must infuse into the *prior state* of the AI some *additional information*, not present in dataset 2. That, of course, is the point of giving the AI dataset 1. But if you do not understand *how* the AI is classifying dataset 1, and then the AI enters a drastically different context, there is the danger that the AI is classifying dataset 1 using a very different method from the one *you yourself originally used* to classify dataset 1, and that the AI will, as a result, classify dataset 2 in ways different from how you yourself would have classified dataset 2. (This line of reasoning leads to "Coherent Extrapolated Volition", if I go on to ask what happens if I would have wanted to classify dataset 1 itself a bit differently if I had more empirical knowledge, or thought faster.) You cannot throw computing power at this problem. Brute force, or even brute intelligence, is not the issue here. > If your claim is that RL can succeed at intelligence but must > lead to a failure of friendliness, then it is reasonable to > cite and quote me. But please use my 2004 AAAI paper . . . > >>If you are genuinely repudiating your old ideas ... > > . . . use my 2004 AAAI paper because I do repudiate the > statement in my 2001 paper that recognition of humans and > their emotions should be hard-wired (i.e., static). That > is just the section of my 2001 paper that you quoted. I will include, in the footnote, a statement that your 2004 paper proposes a two-layer system. But this is not at all germane to the point I was making - though the footnote will serve to notify readers that your ideas have not remained static. Please remember that my purpose is not to present Bill Hibbard's current ideas, but to use, as an example of failure, an idea that you published in a peer-reviewed journal in 2001. If you have taken alarm at the notion of hardwiring happiness as reinforcement, then you ought to say something like: "Though it makes me uncomfortable, I can't ethically argue that you should not publish my old mistake as a warning to others who might otherwise follow in my footsteps; but you must include a footnote saying that I now also agree it's a terrible idea." Most importantly, your 2004 paper simply does not contain any paragraph that serves the introductory and expository role of the paragraph I quoted from your 2001 paper. There's nothing I can quote from 2004 that will make as much sense to the reader. If I were praising your 2001 paper, rather than criticizing it, would you have the same objection? > Not that I am sure that hard-wired recognition of humans and > their emotions inevitably leads to a failure of friendliness, Okay, now it looks like you *haven't* taken alarm at this. > since the super-intelligence (SI) may understand that humans > would be happier if they could evolve to other physical forms > but still be recognized by the SI as humans, and decide to > modify itself (or build an improved replacement). But if this > is my scenario, then why not design continuing learning of > recognition of humans and their emotions into the system in > the first place. Hence my change of views. I think at this point you're just putting yourself into the SI's shoes, empathically, using your own brain to make predictions about what the SI will do. Not, reasoning about the technical difficulties associated with infusing certain information into the SI. > I am sure you have not repudiated everything in CFAI, I can't think offhand of any particular positive proposal I would say was correct. (Maybe the section in which I rederived the Bayesian value of information, but that's standard.) Some negative criticisms of other possible methods and their failures, as presented in CFAI, continue to hold. It is far easier to say what is wrong than what is right. > and I > have not repudiated everything in my earlier publications. > I continue to believe that RL is critical to acheiving > intelligence with a feasible amount of computing resources, > and I continue to believe that collective long-term human > happiness should be the basic reinforcement value for SI. > But I now think that a SI should continue to learn recognition > of humans and their emotions via reinforcement, rather than > these recognitions being hard-wired as the result of supervised > learning. My recent writings have also refined my views about > how human happiness should be defined, and how the happiness of > many people should be combined into an overall reinforcement > value. It is not my present purpose to criticize these new ideas of yours at length, only the technical problem with using reinforcement learning to do pretty much anything. >>I see no relevant difference between these two proposals, except that >>the paragraph you cite (presumably as a potential replacement) is much >>less clear to the outside academic reader. > > If you see no difference between my earlier and later ideas, > then please use a scenario based on my later papers. That will > be a better demonstration of the strength of your arguments, > and be fairer to me. If you had a paragraph serving an equivalent introductory purpose in a later peer-reviewed paper, I would use it. But the paragraphs from your later papers are much less clear to the outside academic reader, and it would not be clear what I am criticizing, even though it is the same problem in both cases. That's the sticking point from my perspective. > Of course, it would be best to demonstrate your claim (either > that RL must lead to a failure of intelligence, or can succeed > at intelligence but must lead to a failure of friendliness) in > general. But if you cannot do that and must rely on a specific > example, then at least do not pick an example that fails for > trivial reasons. The reasons are not trivial; they are general. I know it seems "stupid" and "trivial" to you, but getting rid of the stupidness and triviality is a humongous nontrivial challenge that cannot be solved by throwing brute intelligence at the problem. You do not need to agree with my criticism before I can publish a paper critical of your ideas; all the more so if I include a URL to your rebuttal. Let the reader judge. > As I wrote above, if you think RL must fail at intelligence, > you would be best to quote Eric Baum. Eric Baum's thesis is not reinforcement learning, it is Occam's Razor. Frankly I think you are too hung up on reinforcement learning. But that is a separate issue. > If you think RL can succeed at intelligence but must fail at > friendliness, but just want to demonstrate it for a specific > example, then use a scenario in which: > > 1. The SI recognizes humans and their emotions as accurately > as any human, and continually relearns that recognition as > humans evolve (for example, to become SIs themselves). You say "recognize as accurately as any human", implying it is a feature of the data. Better to say "classify in the same way humans do". > 2. The SI values people after death at the maximally unhappy > value, in order to avoid motivating the SI to kill unhappy > people. > > 3. The SI combines the happiness of many people in a way (such > as by averaging) that does not motivate a simple numerical > increase (or decrease) in the number of people. > > 4. The SI weights unhappiness stronger than happiness, so that > it focuses it efforts on helping unhappy people. > > 5. The SI develops models of all humans and what produces > long-term happiness in each of them. > > 6. The SI develops models of the interactions among humans > and how these interactions affect the happiness of each. Rearranging deck chairs on the Titanic; in my view this goes down completely the wrong pathway for how to solve the problem, and it is not germane to the specific criticism I leveled. > I do not pretend to have all the answers. Clearly, making RL work > will require solution to a number of currently unsolved problems. RL is not the true Way. But it is not my purpose to discuss that now. > I appreciate your offer to include my URL in your article, > where I can give my response. Please use this (please proof > read carefully for typos in the final galleys): > > http://www.ssec.wisc.edu/~billh/g/AIRisk_Reply.html After I send you the revised draft, it would be helpful if I could see at least some reply in that URL before final galleys, so that I know I'm not directing my readers toward a blank page. > If you take my suggestion, by elevating your discussion to a > general explanation of why RL systems must fail or at least using > a strong scenario, that will make my response more friendly since > I am happier to be named as an advocate of RL than to be > conflated with trivial failure. I will probably give a URL to my own reply, which might well just be a link to this email message. This email does - at least by my lights - explain what I think the general problem is, and why the example given is not due to a trivial lack of computing power or failure to read information directly present in the data itself. > I would prefer that you not use > the quote you were using from my 2001 paper, as I repudiate > supervised learning of hard-wired values. Please use some quote > from and cite my 2004 AAAI paper, since there is nothing in it > that I repudiate yet (but you will find more refined views in my > 2005 on-line paper). I am sorry and I do sympathize, but there simply isn't any introductory paragraph in your 2004 paper that would make as much sense to the reader. My current plan is for the footnote to say that your proposal has changed to a two-layer system, and cite the 2004 paper. From my perspective they are not different in any important sense. I hope this satisfies you; I do need to move on. -- Eliezer S. Yudkowsky http://singinst.org/ Research Fellow, Singularity Institute for Artificial Intelligence