My Research

I've wanted to write this post for a while, but have never found the motivation to do so. There's a workshop this coming week, however, and I am scheduled to present this stuff on Thursday (today... I did okay, I think). So, I decided this is the perfect time to given a non-technical (that is, non-mathematical) introduction to what I do... there truly is no greater motivator than procrastination.

For those of you who don't know, I'm a grad student in computer science at the University of Michigan, specializing in artificial intelligence. Within this field, which is wider and deeper than most people realize, I am looking at the problem of reinforcement learning (RL). The formulation of the problem is simple. An agent interacts with some environment by performing actions. The environment reacts, and may occasionally reward or punish the agent. The goal of the agent is to maximize the amount of reward (or equivalently, minimize the amount of punishment) it gets. Hence the name reinforcement learning: the reward and punishment terminology is borrowed from the theory of operant conditioning in psychology.

Does this problem sound easy? It's not. Let's use a computer game as an example... say, Pong. A Pong agent has two actions available to it: move up and move down. The environment includes the ball and the opposite agent. A simple reward function gives the agent 1 point for winning a game and -1 point for losing the game. Now, put yourself in the role of the agent. You might think, "Oh, just follow the ball by moving up and down, and don't let the ball get past me." Well, you know that, but how does the agent know that? Remember, the agent doesn't know it is in a Pong game - all it observes are a bunch of numbers. The agent doesn't know it will be rewarded for getting the ball past its opponent, nor that it will be punished for letting the ball past it. It doesn't know that the ball will "bounce" when it touches anything - or even what constitutes "touching", "opponent", and "past".

Without any knowledge, the agent will have to act randomly, at least at first. It might get lucky and defend its goal once or twice, but more likely it will let its opponent score and be punished. There comes the first hurdle of reinforcement learning: how does the agent know what it did wrong? The converse is also true if the agent scored - what did it do right? This is called the credit assignment problem, because the agent is trying to figure out what gets the credit (or blame) for the reward (or punishment) it received. Again, remember what while you intuitively know that the agent missed the ball, the agent doesn't have any model of cause and effect to realize this.

To understand the basic solution to reinforcement learning, I must say more about those numbers that the agent observes. For Pong, there might be four numbers: the agent's vertical position, its opponent's vertical position, and the vertical and horizontal position of the ball. Each of these numbers are a state variable, and the different values these numbers can take in conjunction is called the state space - as in "the state of the union", not "the state of Michigan" - because they can describe every situation in the environment. To make reinforcement learning somewhat easier, researchers tend to view tasks as a Markov decision process (MDP). The distinguishing property of an MDP is that the next state of the environment depends only on the current state. For the Pong example, the state space described by the four numbers listed above does not make the game an MDP, since the ball can be moving arbitrarily fast or slow. If instead six numbers are used - the represent the horizontal and vertical velocity of the ball - then the task would become an MDP.

To be concrete, every action the agent takes changes the state of the environment. At each new state, the agent may receive a reward or a punishment, and then it has to take another action, and so on.

Now that the agent has some idea of how to relate one set of numbers to another, it can start learning. In place of human level reasoning, the agent simply plans backwards. Intuitively, if you get punished for being in this state, you know you shouldn't have taken that last action from the previous state. For Pong, this partially corresponds to not moving up if the ball is below you and just about to pass you. This picture is not complete though, because if you are at the top of the screen and the ball is at the bottom, you would not have gotten to the ball in time to deflect it anyway. This means that the faulty action lies further back in time.

Here's what the agent does. The agent remembers the state in which it got a reward or a punishment. Knowing whether this state is good or bad, it knows that the previous action from the previous state is also good or bad. Now knowing about this previous state, and can know about the state two steps back, and three steps back, and so on. That is, the agent learns by propagating the rewards back through the states, so the next time it finds itself in the same situation, it will either avoid that action (because it was eventually punished for it) or do the same thing (because it was eventually rewarded for it).

This was the state of the art 25 years ago.

Before I talk about more recent developments in the field, I want to raise a few problems in the solution above. The shallowest, but also most thought provocative, is this: how does the agent know it's doing its best? Imagine the task is to go from the bedroom to the bathroom to pick up an object. Through pure chance, the agent does this by going through the living room, the kitchen, and the broom closet, despite there being a door directly between the two rooms. Further imagine that the agent is rewarded based on how quickly it gets to its destination (I'm sure you can think of a reasonable, real life scenario for this). How would the agent know that the path it found is the shortest one? This problem is known as the exploration-exploitation problem. It is thus named because the agent needs to explore to know more about the environment, but this often means not exploiting the best action for the agent. In practice, researchers simply make the agent act randomly some small percentage of the time, so it will do its best for the most part but be constantly exploring. For the jargon-philiac, this is called an epsilon-greedy exploration strategy, where epsilon is the small percentage mentioned above.

There are other variations to this general framework which researchers are working on, such as partial observability (what if the agent doesn't know the value of some state variables?) and stochastic actions (what if actions only succeed some of the time?). I will skip the details on these topics and instead focus on what I'm looking into.

Pong, while an illustrative example, is not the most complicated environment for an agent to learn in. For one, there are only a limited number of situations for the agent to be in: in the original pixelated arcade game, there might only be a thousand or so different states. A modern computer running the algorithm I described above would find the optimal strategy (or policy) in less than a minute. Consider instead the rooms example I just gave, where the agent must move from one room to another. There can be an arbitrary number of rooms, and each room itself can be arbitrarily large. Even if the starting and ending positions are unchanged every time the agent must complete the task, it might still take the agent a long time simply due to how much exploring it has to do.

Yet, unlike Pong, this more complicated problem also presents more information for the agent to use. For example, completely exploring one room is useless when the goal is in another room. If you were to give directions to a human, you might tell them to go out of the room, walk down the hallway, and take the last right. Alternately, you might tell them that the thing they're looking for is not in the living room, but in the kitchen. The other person, on hearing these instructions, could then ignore everything in their current room.

What these two instructions have in common is abstraction. The first instruction abstracts over actions (it doesn't say "take 10 steps forward, then 2 steps left,..."), while the second instruction abstracts over states (it's saying that everything outside the kitchen is one room which doesn't matter). Despite this distinction, the two types of abstractions are related: the first instruction is implicitly saying that the current room and the hallway are not worth exploring, and the second instruction can be thought of as the action "go to the kitchen".

My research is related to this idea, although it's a little more specific. Expanding the example beyond rooms, if I'm giving instructions for someone to get to the Eiffel Tower in Paris, I would tell them to get to the airport, take a plane, etc. Each of these "actions" can be further broken down: order a cab, get out of the house, get in the cab,... And even further: walk to the phone, call the cab company,... This is called an action hierarchy, as the first actions "contain" the second ones, which "contain" the third, and so on, until at the lowest level the actions are simply "move this muscle". How do humans break down such a complex task, and can a computer do the same thing?

So far, the state of the art (which is about 5 years old) is "sort of". Given just the MDP assumption, there are proposals for what subgoals should be. The most general ones are different ways of identifying bottlenecks - that is, states which an agent must go through to reach a goal. Think of the door to a room, and in order to get anywhere else you must first go through that door. Other ideas include things like looking at what states you have commonly visited in your experience, or perhaps looking at what you're rewarded handsomely for.

Even knowing what appropriate subgoals are, the agent is not done. Imagine that both the door to your room and the door to the apartment are given as subgoals. How will you know that the door to your room is the first thing to go for, before trying to read the door to your apartment? Depending on the viewpoint, this could either be a problem of restricting the proposal of actions, or it could be one of inducing a preference on different possible actions. This appears to be a slightly easier problem to solve, and I was surprised to find that there is almost no prior work in this area. I intend to look into this question further, and hopefully by the end of the summer I will have some intuition as to what might work and what won't.

Post script: to people not in computer science or perhaps in but not in AI, the problem of action hierarchies might sound insignificant. Despite my previous misgivings about being limited to too specific a field, I find this problem genuinely interesting. Although it may not change the world (yet), I do think its solution will contribute to an understanding of humans and intelligence.
No comments

Can You Hear Me?

I sang this song with the choir back in middle school. We got either second or third place in the Hong Kong wide competition with this.

Although my middle school years were mostly terrible, this song has always stuck with me. It floats into my thoughts every once in a while.

Can You Hear Me?
(By Bob Chilcott)

I look around me as I grow
I'd like to tell you all I know
I see life with all its energy
The city streets, the rush time
This is my world, it's where I like to be
So much to see, so much to find
I sometimes sit and wait a while
See the sun, it makes me smile
Can you see it
Can you see it too

I feel life with all its energy
The joy of waking every day
This is my world, it's where I like to be
So much to do, so much to say
I sometimes sit and feel the sun
Its warmth is there for everyone
Can you feel it
Can you feel it too

My world is a silent one
But it's enough for me
I hear you through your hands
The movement sets me free
But it could be a special thing
To hear your voice
To hear your sing
Can you hear me
Can you hear me too
1 comment


I have another post in the writing, but I want so share my discovery that my experiment on Craigslist is no unique. At least two other people have done it before: Simon Owens, whose results were "tame" like mine, and Jason Fortuny, who posted all the replies he got on Encyclopedia Dramatica (both of these are definitely NSFW). I believe several people lost their job over the Fortuny case, and more had to deal with problems in their family.

On the ethics over Fortuny's case, I happen to think he should not be punished. Legal considerations aside, I think the fault lies mostly on the people who replied to his ad for giving personal information to a stranger on the Internet. Without evidence otherwise, the worst case assumption should always be made.

That said, I don't think Fortuny is a shining example of how humans should act, either.
No comments

Prologue to Jurassic Park

I've only written once about the environment and environmental causes, although that was in reference to a larger issue. My apathy is mentioned in passing elsewhere as well. I might have eventually written more on the subject, but I just discovered that the prologue to Michael Crichton's Jurassic Park (now a major motion picture) very articulately explains my views. It is copied below for your enjoyment.
You think man can destroy the planet? What intoxicating vanity. Let me tell you about our planet. Earth is four-and-a-half-billion-years-old. There's been life on it for nearly that long, 3.8 billion years. Bacteria first; later the first multicellular life, then the first complex creatures in the sea, on the land.

Then finally the great sweeping ages of animals, the amphibians, the dinosaurs, at last the mammals, each one enduring millions on millions of years, great dynasties of creatures rising, flourishing, dying away -- all this against a background of continuous and violent upheaval. Mountain ranges thrust up, eroded away, cometary impacts, volcano eruptions, oceans rising and falling, whole continents moving, an endless, constant, violent change, colliding, buckling to make mountains over millions of years.

Earth has survived everything in its time. It will certainly survive us. If all the nuclear weapons in the world went off at once and all the plants, all the animals died and the earth was sizzling hot for a hundred thousand years, life would survive, somewhere: under the soil, frozen in Arctic ice. Sooner or later, when the planet was no longer inhospitable, life would spread again. The evolutionary process would begin again. It might take a few billion years for life to regain its present variety.

Of course, it would be very different from what it is now, but the earth would survive our folly, only we would not. If the ozone layer gets thinner, ultraviolet radiation sears the earth, so what? Ultraviolet radiation is good for life. It's powerful energy. It promotes mutation, change. Many forms of life will thrive with more UV radiation. Many others will die out. Do you think this is the first time that's happened? Think about oxygen. Necessary for life now, but oxygen is actually a metabolic poison, a corrosive glass, like fluorine.

When oxygen was first produced as a waste product by certain plant cells some three billion years ago, it created a crisis for all other life on earth. Those plants were polluting the environment, exhaling a lethal gas. Earth eventually had an atmosphere incompatible with life. Nevertheless, life on earth took care of itself. In the thinking of the human being a hundred years is a long time.

A hundred years ago we didn't have cars, airplanes, computers or vaccines. It was a whole different world, but to the earth, a hundred years is nothing. A million years is nothing. This planet lives and breathes on a much vaster scale. We can't imagine its slow and powerful rhythms, and we haven't got the humility to try. We've been residents here for the blink of an eye. If we're gone tomorrow, the earth will not miss us.

Cross Acting

I don't get in trouble often, but I consider myself something of a rebel. I don't like organizations in the general much - I prefer groups that are more self organized and sustained. Sometimes I do weird things for amusement, like pretending I don't know a secret to see whether the secret holder would divulge it.

This time, I posted a Criagslist ad... in Women Seeking Men.

The primary question I wanted answered was whether men on the internet are really crazy. Some of my female friends have posted ads for fun, and later compared their replies. What they found was that some emails were exact copies of each other - some people just write the same response to everything, looking for a shag. That story had a twist ending for my friend, but that's not my story to tell.

A word on the post: I didn't write the ad myself - I would never have passed as a female writer. I won't copy the post here, but it was itself copied from the blog of a friend. We haven't talked since I graduated from Northwester, and even before then I don't think she knew I read her writing. She is an objectivist before I read The Fountainhead, and the post expresses a related sentiment: her wish for someone who doesn't want only to please her, but someone who can contribute something of themselves to her. In the words of Ayn Rand, a prime mover, not a second hander. Regardless, I never told her I used her writing (if you're reading this, you should email me), so I don't feel right posting her work even with attribution.

First, some statistics. I got 50 replies in total, the first within the hour, the last almost a month later. About 60% replied within 5 days. 12 of the 50 replies had photos attached, and a few more had links to either MySpace pages or some other picture hosting site. A small minority didn't bother to write proper English for the task, and a few more asked questions which made me think they didn't read the post at all. Although it wasn't unexpected, there were several older men who replied. For an ad declaring the poster to be 22, I got a reply from someone 17 years older. It reminds me of what OKTrends found. Of the 50 replies, two were duplicates; the senders sent the same email twice.

My most common reaction (subject to availability bias) is: who do you think you are? Granted, the post is a little self indulgent, but they are still the suitors. What I had posted talks about an intellectual pursuits... and a fair number talk about sex, and how they "know what a women [sic] wants in the sex department" and that "[I] want to be seduced". At the bottom of the pile is a reply that promised to "open that box of hidden desire and fantasy". I want to puke.

On the other hand, I suspect some responders are delusional. One ranted on about hidden treasure. One had a "real" picture of himself tensing up. A few wrote poems.

As for the good... there aren't the many. The ones I like best (again subject to bias, this time of a different kind) were simpler. They mention how old they are, what they do and like doing (somewhat passionately), and invite me to reply. A picture, if one is included, is a simple face shot. Three people mentioned books/authors, one of them being Ayn Rand. Only a single person mentioned something interesting - by quoting an article on the large hadron collider.

Hypothesis: most men who reply to Craigslist personals are douchebags.

Result: Confirmed.

Maybe next time I'll reply to a men seeking women ad (the doucheist one... is that a word?), and see what I get.
No comments

Smashing Cars

I'm kinda unhappy with the latest MythBusters episode, the one about two cars smashing into each other at 50 mph is equivalent to one car smashing into a wall at 50 mph.

The way they tested the myth on a small scale was by swinging two pendulums into each other, with each pendulum containing a lump of clay and an additional weight in the back. The second weight would squash the clay, and from the deformation the impact of the smash can be visualized. It's a pretty clever rig.

My problem, however, is not with the rig; it's with the interpretation of the myth. There are two ways in which the two crashes can be "equivalent": from the perspective of the car, which is what the MythBusters tested, and from the perspective of something in the middle. Since the original myth is about a compact car being squished between two semis, it would seem that this myth should be about the latter interpretation. And in that case, I believe the crashes are, in fact, equivalent.

As an educational show, I think MythBusters should have at least covered this other scenario, and shown how the "myth" is more of an interpretation issue.
No comments

Pascal's Wager, Part 2

I have written about Pascal's Wager before, but I never finished that post. A few days ago I came across a video, which finally closed the case on that argument for me.

The inspiration is this, which I found on Reddit. Most of the video is junk and doesn't hold up to my standards of reason. Something at the end caught my attention though, and I came up with the following argument.

Recall that the problem with Pascal's wager is that, unlike most arguments about God, it's a completely rational. It frames the problem in turns of expected utilities, which is the standard practice in decision theory. The following payoff matrix summarizes the wager:

God ExistsGod Doesn't Exist
Believe in God+Infinity0
Don't Believe in God-Infinity0

Richard Dawkin's argument, that believing in God has a cost while living, doesn't hold up, as any cost incurred while alive is only finite, and does not offset the infinite payoff of being in heaven.

Pascal's wager is that simple. The video over-complicates and sets up several straw men (eg. Pascal not knowing what God is but then contradicting himself) and ad hominem attacks (eg. Pascal's bias towards Christianity). While the claim that Pascal ignores other gods is true, his wager still works - as long as gods reward belief, the expected utility is still to believe in god(s). The existence of non-god systems (frog's dream, video game, etc), doesn't matter, as they don't offer any utility. The gem of the video is in a 3 second frame: for every unknowable idea that rewards a particular behavior, another unknowable idea will punish that very same behavior.

Here's my new argument. Pascal's wager depends on supposing a Christian god, with the payoff matrix above. Since, however, the existence of the Christian god is unknown, it is equally valid to posit a different god (let's call Him AntiPascal), with different payoffs. In particular, let's imagine a god that will send people to heaven only if they don't believe in any god, and will send people to hell if they do. What does the payoff matrix look like?

God ExistsGod Doesn't Exist
Believe in God(s)-Infinity0
Don't Believe in God(s)+Infinity0

That's right - everything looks almost exactly the same, except the signs on those infinities switched. By this payoff matrix, people should not believe in gods at all.

So which payoff matrix is right? We don't know. To properly calculate the expected utility of believing in god(s), we need to know the probability of each payoff matrix itself - that is, the probability that the Christian god is real (and has the payoff matrix specified), and the probability that the AntiPascal god exists. Both of these probabilities, it turns out, is unknowable - and therefore the expected utility of believing or not believe in god cannot be compared.

Given the odds, Pascal's wager is not one you want to bet on - there's simply no telling whether you win or lose. While this is no argument against the belief in god, it is no longer the purely rational argument for it either.

The funny thing is, although this argument is new to me, the reasoning behind it is not. As Richard Dawkins himself has said several times, most theists are in fact atheists to all other religions. Real atheists just go one further.
1 comment