AGI through Large-Scale, Multimodal Bayesian Learning

 Posted by Jeriaska on June 12th, 2008

bayesian_learning_banner.png

At the AGI-08 conference on artificial general intelligence, Brian Milch presented on his paper “Artificial General Intelligence through Large-Scale, Multimodal Bayesian Learning.” An artificial system that achieves human-level performance on open-domain tasks must have a huge amount of knowledge about the world. He argues that the most feasible way to construct such a system is to let it learn from the large collections of text, images, and video that are available online. More specifically, the system should use a Bayesian probability model to construct hypotheses about both specific objects and events, and general patterns that explain the observed data.


The following transcript of Brian Milch’s AGI-08 presentation on his paper “AGI through Large-Scale, Multimodal Bayesian Learning” has been corrected and approved by the speaker. Video is also available.

AGI through Large-Scale, Multimodal Bayesian Learning

bayesian_learning_01.png

This is going to be a less technical talk than the last two. The claim is that one path to achieving artificial general intelligence could be Bayesian learning from large amounts of data on the web. Let me start with one of the prime examples of a kind of thing we would like AGI to do for us, which is to answer questions. The range of questions people could ask the system is incredibly large.

bayesian_learning_02.png

Here are a couple: “How can I get from Boston to New Haven without a car?” “How many U.S. congress members have PhDs?” Or, “Here is a picture, about how cold is it in there based on what people are wearing?”

If you gave these to a human being and gave them internet access, they could answer all these questions without too much trouble, but if you gave Google these questions, it would not give you the answers. It would have to combine information from multiple webpages, and also, it would need to have broad and deep knowledge about the world.

It would need to know, for example, that when you need to get from one city to another that they are this far apart, you are not going to walk, you are going to need some form of transportation–that kind of general knowledge–even more so when it is a picture interpretation where you are saying those are people in the picture, they are wearing jackets, and that suggests something about the temperature.

bayesian_learning_03.png

This is knowledge about topics that range from transportation, to clothing, and so forth. The question is, how can we possibly acquire all of that knowledge? The proposal I am making, which is not original, is that we should be able to learn knowledge from large amounts of data online, particularly now that the web contains not only text but images and video. The general proposal is to learn by Bayesian belief updating: maintain a probability distribution over models of how the world works in general (people tend to wear coats when it is cold) and also the past, present and future states of the world (there is a scheduled bus between Boston and New Haven).

This is not a fully fleshed out proposal, but I am going to go through some of the issues that arise in thinking about how to make this a reality. The first one is looking at one of the variables that we are going to be thinking about in our probability models.

bayesian_learning_04.png

This is sort of a complicated slide. At the bottom level, we get video data and we get linguistic data. What this diagram is supposed to be is a schematic of a Bayesian network or a probabilistic graphical model. These are the observed variables. Based on those observations, we want to make inferences about all these other things in here.

In particular, what we really care about is what I have called “world history,” which is the past, present and future states of the world; and the tendencies which govern what happens in the world. To reason about those things, we are also going to have to make inferences. If I see some video, what is going on in this scene? What are the objects, what is occluding what? Things like that. That is going to involve reasoning about what objects look like. What do people look like? What do jackets look like? That kind of question.

Those are also variables that I am not going to know initially, but through looking at a lot of videos I am going to learn what those things look like. Then there are also variables for language use. What words do people use in different languages, for different concepts? How do people put sentences together? All those things have to be learned.

bayesian_learning_05.png

I have talked quite a bit about all these things you have to learn. What are you going to learn from? One hypothesis dating back to the early days of AI is that we will learn from text: we will read from encyclopedias, or something like that. Now, on the web we have lots of text available, and it covers pretty much every topic we want to reason about. The problem is, you are going to have no connection to sensory input. There are probably few people here today who would think that you would learn from text alone.

The other major thing that has been proposed is learning from embodied agents, whether they are physical robots or virtual robots. One great thing about this is multimodal data, so you can connect linguistic and sensory input. The other thing is that you can actively manipulate the world you are in. If you are uncertain about things, you can say, “Well, what happens if I look at the other side of the object?” “What happens if I try lifting this up? Does it move?” That kind of thing.

The difficulties with this are, for one thing, if you are using physical robots, as people have pointed out, this can be expensive. Also, it is hard if they are just wandering around the lab to give them any kind of broad experience. How is a robot going to know about buses, unless you let it ride buses? It is not something you are going to do at the early stages.

The other thing is virtual agents. This is more promising, in that you can give it broad experience. If there are buses in Second Life, it can take them. The question is, are you going to accidentally have some special properties in your simulation that do not generalize to the physical world? If you are doing great in simulation, how much do you trust it really solved the problem?

What I am proposing is learning from multimodal data on the web. It provides both broad coverage and multimodal integration. It does have drawbacks: it can be disjointed. Trying to learn from a whole bunch of YouTube videos? How much will you really be able to interpret, especially in a video where there is cutting between scenes? Also, you are forced to do passive observation. Still, at this point I would guess that this is the best path we have.

bayesian_learning_06.png

That is the data we are learning from. The next question is, are we actually learning everything from scratch, or are we going to build some things in? I would argue that we actually want to build in some components.

My argument for this is that children do not learn completely from scratch. We seem to have evolved abilities in the brain to do things like spacial reasoning or reasoning about language. Also, there has been a whole lot of work in areas of AI on things like language processing, the idea of a parse tree. Do we want our system to invent the notion of parsing sentences from scratch?

I would say we probably need built-in modules for things like spacial and temporal reasoning (along the lines of what the speakers talked about during the last session), linguistic reasoning, and also reasoning about other agents–the ability to put yourself in another agent’s shoes and use your own reasoning abilities to say, “If I were them, what would I do?”

bayesian_learning_07.png

We are also going to have a whole lot of learned components. Our targets are to learn the history of the world, the present state of the world, and the tendencies. The possible values for these world histories are going to be structures that have initially unknown sets of objects, initially unknown relations between them, changing over time. There is a lot of work in the probabilistic reasoning literature these days, including my own work on what I have called “Bayesian logic,” which is a probabilistic modeling language that lets you reason about unknown objects. What you need for that are prior distributions over what is going on in the world. That is really what these tendencies are–it is an encoding of what your priors are about what is going to happen.

I say that the most promising way of representing this is with fusions of probability and logic. The field that is working on this in the AI community goes by names of “statistical relational learning” or “relational probabilistic models,” and there was an edited volume by the MIT Press last year that I would recommend to anybody who is interested in learning more about this. The difficulty here is that the types of objects and the predicates and the dependencies of these models are also initially unknown.

We do not know what the model is initially, so we are going to need another layer on top of this. In our probabilistic model, there are priors over how the world tends to behave. This we do not have a great idea how to do. I referred to one path towards this, Dirichlet processes, which is a hot topic in probabilistic AI these days. It is a clustering model that lets you not say ahead of time how many clusters, object types or predicates you need in advance, but you can do Bayesian reasoning to figure out what is the best number of clusters to explain your data. Since we are short on time, I won’t say much more about that.

bayesian_learning_08.png

We are going to do all this learning–what types of algorithms are we going to use? There are lots of options out there in the literature. You are probably going to want to parallelize your interpretation over documents, but I see probabilistic inference as the major challenge, at least for the perception side of AGI.

bayesian_learning_09.png

How are we going to make progress on this? I would say that we should be able to demonstrate progress to the broader AI community even before we have full-blown AGI.

One way to do this is by showing year-by-year improvement on some of the real evaluation data sets that we use in some communities. For example, there is a Caltech data set on object recognition. There are various challenges sponsored by ACE Automated Concept Extraction, workshops on things like resolving coreference between phrases and text, and something called the PASCAL textual entailment challenge, which is about telling whether one sentence in a natural language entails another. This involves a lot of reasoning about whether if it is cold, does that entail that the temperature is low? That kind of question.

If we are building an AGI system, even if we have not solved general AI, we should be able to gradually improve on these data sets. Another thing we should be able to do is serve as a resource for more finely tuned but shallower systems. Right now a lot of these systems use resources like Cyc or WordNet to get some kind of world knowledge. Can we learn things that are better than Cyc and WordNet? Another way that we can interact with the broader AI community is to spin off challenge problems. If we have a probabilistic model that had probabilistic semantics, it should be amenable to probabilistic inference algorithms, but it is usually intractable. How are we going to do that? We can throw that out to the probabilistic inference community.

bayesian_learning_10.png

I have argued that this kind of reasoning is a potential path toward artificial general intelligence. It exploits well understood principles, it learns broad real-world knowledge and is connected to the mainstream of AI research, so I think it is a promising path. Thank you.

bayesian_learning_banner1.png

Leave a Reply