So if you’re doing this question in the HCH tree that you can imagine, you just have like infinite copies of yourself and you can ask them, you pass these sub-questions to those copies. I guess you could be like, where should I go? The example in the debate paper, I think is where should I go for my holidays? And so some sub-questions might be, which sort of climates do I like? How expensive are flights to different places, do I need my passport? Is my passport ready and do I need it to go to different places? So like the main question might be - do you have an example in your head of a starting question you could ask HCH? So in this case, is it the case, is the idea that, I’m imagining a tree of, you have a main question and then there are sort of sub questions coming off of that. So you split some difficult task that we don’t know what the correct answer to it is up into lots of smaller tasks and there’s some kind of claim, which is, because this HCH tree is all made of humans who are trying to give a good answer, this HCH gives aligned answers.Īnd then if you can build something that’s analogous to that, that is also trustworthy. And one way to set that up specifically is how a human would answer if they could ask sub-questions to copies of themselves that could ask sub-questions to copies of themselves. So it is a recursive acronym for this idea of, one kind of model for what a good answer to a question is, is what a human would answer if they thought more and had more resources and things. So maybe the place to start is this other thing I’ve mentioned, humans consulting HCH. So It sounds like this is kind of related to this thing you called iterated distillation and amplification, sort of concretely, what is that? And what’s this tree you’re talking about? So if you see them do that traversal, and then - finally for the flaw, the tree has at least one flaw - and if you see them do that traversal and the leaf doesn’t have a flaw, then you can be confident that’s a property of the whole tree. So if you assume, for example, one way to think about this interaction is that one debater has a tree in their head and the other debater is traversing it and looking for a leaf that has a flaw in it. So you’re able to verify properties of the whole tree just by looking at one path down it. So the structure that’s going on there, so in IDA you have this implicit tree that’s sort of analogous to the tree in humans consulting HCH, which is like an implicit tree, which could be like, questions and answers to those questions, and sub-questions, and answers to those sub-questions, and questions to help with answering those, and so on.ĭebate you can think of as a different way to interact with the same kind of structure where rather than imitating all of the parts of this tree, you have two ML models that have this tree in their heads and you take some path down this tree until you get to something that is human checkable. One is, what we want from sort of an alignment technique is something that for everything the model knows, we can extract that or create an overseer that knows all the things the model knows and therefore can oversee it adequately.Īnd debate is one way to do that and to do that kind of efficiently. So I guess there’s several different ways to explain it. So debate is pretty closely related to IDA or Iterated Distillation and Amplification, Paul’s idea. So I guess my first question is what is AI safety or AI alignment via debate? What’s the idea? Today we’ll be talking about her work related to the topic of AI alignment via debate. She’s currently a researcher at OpenAI and before that she was the research assistant to the chief scientist at DeepMind, so that’s Shane Legg. Hello everybody, today I’m going to be talking to Beth Barnes. Those who are already quite familiar with the basic proposal might want to skip past the explanation of debate to 13:00, “what problems does it solve and does it not solve”. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |