How to do humans learn about values?

TLDR: Tan Zhi Xuan builds cooperative, theory-of-mind agents for safer AI, favoring human-like value learning and interpretable, model-based systems over pure utility maximization.

Jun 19, 2025

Mykhaylo Filipenko: Hello Xuan, thanks a lot for taking the time to do this interview. I'm very excited to learn more about your work. The first question is always the same: Could you introduce yourself briefly?

Tan Zhi Xuan: I'm an incoming assistant professor at the National University of Singapore, and up till recently was a PhD student at the Massachusetts Institute of Technology working on relevant research specifically at the intersection of AI alignment, computational cognitive science, probabilistic programming and sort of model based approaches to AI. I work with Josh Tenenbaumand Vikash Mansinghka. I’m getting perspectives from two labs. One of them is more on the cognitive science side. One of them is more on the probabilistic programming side. A

And my work specifically focuses on how we can build cooperative agents that have something like a theory of mind. This means: How agents can infer from other people's actions, words, what they really want and what their goals, preferences and beliefs are.

I think more broadly when you're watching multiple agents: what do they collectively want? What social norms are they following? What values do they might share? So that's broadly the research program I've pursued over the PhD.

Mykhaylo Filipenko: Wow, that sounds very interesting. I think we're going to come to this in one of the next questions. But before that, I'd like to know from you how did you get started with AI alignment? Why did this topic excite you and what was your kind of journey to go into AI safety?

Xuan: Yeah. The way I often describe it is: I got “nerd-sniped”. In some ways this entry path is fairly common: I was in undergrad and got involved with the effective altruism student group in my institution. I was sold on the idea of not earning to give part but convincing rich people to give away more of their money for a good purpose. That appealed to me. And then I got sort of interested in AI alignment stuff when friends in that group recommended that I read “Superintelligence” by Nick Bostrom.

After reading that book – even though I have different opinions of it now – I wasn’t sure how soon or if ever this is going to happen, but it sure seems like an interesting question to tackle: How could you attempt to get AI systems to learn human values, whatever that is. Furthermore, to try to computationally model how humans learn human values, how kids assuming it's not entirely innate, learn value systems from the society around them. And the fact that people were even attempting to ask that question at a computational level, I think really excited me and effectively nerd-sniped me as someone who was already really interested in moral philosophy.

The fact that it potentially could have some positive impact on the world also was helpful. Even though I wasn't entirely convinced back then it was necessarily the most important thing. There were a bunch of additional considerations: I have been really interested in more computational neuroscience because I was really interested in sort of cyberization as a path to transhumanist style things. I don't even really identify as a transhumanist anymore and that's another long story. But I became convinced that it's going to be a long time before we do anything like reverse engineering the brain and probably much sooner we'll get something like advanced AI systems. So if there's anything that's more pressing is figuring out how AI systems, broadly speaking, have human-like value systems and don't try to eliminate us instead.

Mykhaylo Filipenko: Maybe one more detail here: This decision was during your PhD or was it specifically that you discovered this, and you said I want to do my PhD exactly in this field?

Xuan: I discovered it in undergrad and started figuring out ways how I could do what was called “machine ethics” back then. But I was trying to take a more learning approach to it. Not just embody what I think the right ethical theory is into the machine, but instead take one step back and to the level of metaethics, maybe even to like moral epistemology: So how do people learn morality if that's what people do at all?

To make it more descriptive: “Okay I'm not going to commit myself to saying that this is the right way of doing it the right way and anyone ought to act like this but instead how do people come to believe that certain ways are the right way to act.”

This was a question that really interested me. At the same time, I was reading a lot of interesting moral philosophy both from Buddhist philosophy, but also interestingly a bit from Confucian philosophy. And Confucians make a big deal about moral cultivation as something that is an important aspect in a way that I think doesn't come up as much in western moral philosophy.

This constellation of interests got me started working on a simple project in my late undergrad on getting robots to learn very simple social norms. In this case, there were just rules about what kinds of things you can or cannot pick up because people might own them. So, ownership norms from both examples and explicit instruction. Eventually in the period between undergrad and grad school, I discovered this whole field of computational cognitive science which was asking similar questions but from a more explicitly Bayesian perspective.

Using Bayesian approaches,researchers in computational cognitive science had been trying to model human moral learning or human moral cognition to form a moral theory of mind. That really gelled with the worldview I was forming around how people might be going about solving these problems. That's why I ended up applying to the lab I'm currently in.

So when I started my PhD,I wanted to build Bayesian models of how people learn social norms. I've recently started doing that again. Butthe reason I didn't end up doing that for most of my PhD is because I realized how technically hard that problem was, and how it required first solving the simpler problem of just modeling a single agent, before modeling many agents in a society. So instead, I tried the “simpler thing” of inferring a single person's goals or a single person's preferences in this Bayesian way, instead of throwing tons of data at the machine, because when I started my PhD, Bayesian goal inference wasn't really scalable.

And I think I've spent most of my PhD trying to develop algorithms and methods to do really fast efficient goal inference that is as efficient or even more efficient than how humans do the same task with other humans.

Mykhaylo Filipenko: I think you already started diving into the details of your work, in the details of your PhD. So let’s go ahead with that.

Xuan: When I started with AI alignment I think part of why I ended up pursuing my research agenda is because around that time the main idea was: We're going to understand AI systems as a powerful utility maximizer or something like that. Nowadays, I don't really agree with this as being the best model of how a powerful AI is going to go. For example, I don't think large language models really fit that paradigm for a bunch of reasons.

But the thought back then was: Okay, if we're going to think of AI systems as utility maximizers then we really need to make sure that they maximize the right utility function and so there's a whole bunch of work in the from 2016 onwards by people at Russell's lab and other people affiliated with the Machine Intelligence Research Institute trying to do things like model value learning in terms of learning the utility functions humans have.

On the other hand there is the probabilistic approach of learning someone's preferences: You observe their actions. Economists have been doing this for a long time. You observe someone’s choices and you impute a utility function that best explains their choices and as it turns out that sort of basic model is also the same kind of model that has been applied in the field of inverse reinforcement learning in AI and all sorts of theory-of-mind models.

Whether or not you think it's relevant to AI safety this is one basic approach to sort of modeling and inferring human values. You operationalize them as utility functions, and you try to infer them from human behavior, essentially.

So work been focused on a couple things: First of all, inferring utility functions is a hard technical problem because there's so many utility functions people could have, and there's so many link functions or likelihood functions that could explain how people turn their preferences into actions, Because traditionally the assumption is that you assume they're roughly optimal, right?

Usually people do something a bit noisier and that's captured in this Boltzman rationality assumption. But there are more specific kinds of biases that might arise. So in many cases it's just very hard to compute what the optimal thing is to do according to your values. You somehow need to model the fact that humans don't always do that. In fact, we make mistakes because for example, we forget things or we don't think very hard enough to realize that in fact one action was the better choice over another action. And because planning is hard, also doing inference over lots of subjects and steps is hard. After, I realized that how hard these problems were, a good chunk of my PhD has been focused on solving this core technical challenge.

This is important even if you don't think utility functions are the right way to represent human values because even basic tasks like “You're watching your friend in the kitchen and you're trying to figure out what meal they're trying to cook” are hard. You're just trying to figure out what goal they have in this specific short time horizon. You still run into the same technical problems because there are tons of possible meals they could be trying to cook and they might make all sorts of weird mistakes while trying to cook their meal, and so one of my first projects I ended up working on ended up being the basis of a lot of everything else is like “Okay how can we realistically model humans in this setting when they were pursuing their goals – not values at large – but just goals in a sort of short context” and also capture how humans are able to do this inference of other people's goals pretty rapidly.

The innovation was to think about the fact that when I'm modeling other people I don’t think of them as optimal planners that are always thinking a hundred steps ahead in order to achieve your goals. Instead, it seems like we're planning a few steps ahead, taking actions, then replanning to take actions again. And we can model people as these short-term rational planners instead. And this has two benefits. It's more realistic, but also it's more efficient for me as an observer to only simulate you as bounded.

So when I'm doing inference, I don't have to come up with all possible plans you might be trying to execute in order to achieve your goal. I just have to simulate you a few steps ahead and then check those actions you actually took. If you did take those actions, then probably the goal that produced that plan is the one that you were trying to achieve, right?

Mykhaylo Filipenko: May I throw something in? It's a really interesting point. In this kind of thinking it seems that humans could be very different from AIs. Because even if you think about the chess playing model, this really kind of “simple Ais” with very narrow scope. These models would already think many hundreds of thousand steps ahead. Humans would usually not do that.

Xuan: Yeah, this is interesting indeed. I think indeed AI can think very differently from humans. For example large language models – at least before these reasoning models – they didn’t do anything like explicit planning when they're trying to solve a task. Instead it's more like “who knows what's going on in neural networks?”. Some of it is memorization and some of it is maybe implicit reasoning. So it's quite hard to model what they do and as a result sometimes they fail in weird ways that humans don't whereas when humans make mistakes we can usually understand it. Probably the person didn't think hard enough or maybe they forgot something or probably had a false belief but they're still being sort of rational with respect to all of that.

In contrast, humans have been shown to solve many kinds of problems with something like a forward search algorithm. So you might think that humans are not doing tons of forward search and you might think their depth is only 10 but you can just tweak this knob to account for how people think more or less. I don't think Alpha Zero and all these systems, they don't actually go up to a thousand steps of default. They go to something to 100 because otherwise it gets too expensive. So you can just tweak that knob and hopefully capture behavior where people are thinking a lot more before taking an action.

Mykhaylo Filipenko: I think I interrupted you a little bit on your work. I think you've been saying that how you've been thinking about how to predict behavior and how to infer basically the norms or the values of humans from their behavior, right?

Xuan: The specific thing I ended up doing is “Okay, how do we infer goals reliably despite the fact that goal inference is a pretty hard problem?” And now I'm much more confident that you can actually solve this problem in relatively Bayesian way. One basic idea is to model people as boundedly rational, which means not to assume they're doing optimal planning. And the other thing is that in cases where there is a large number of goals that people could be pursuing, I've had some recent work on how people do this open-ended goal inference where they use what I think of as bottom-up cues:

It goes like this: First propose a bunch of reasonable guesses and then filter them down according to the principles of Bayesian inference whereby a bunch of guesses are wrong and some of them are right.

Then simulate a possible partial plan for each of them and check whether they explain the actions that the agent has taken so far. If they match then we keep that hypothesis. If they don't match we throw them away. That’s the rough idea. I think it’s a pretty good model and we did some nice human experiments showing that this is a pretty good model of how humans infer other people's goals even in cases where there are too many possible meals someone could be cooking. Essentially, this is a combination of a data driven process and a model based process that allows people to do this.

So, how does this help AI alignment? I think it’s super useful for AI assistants which need to help people on relatively time bounded tasks. This is the case where you're just trying to figure out people have a goal in a single episode and you're trying to help them in that task. Thus, you need to infer quickly what they're trying to do and help them with that task. This is a bit different from inferring people's longer-term preferences. Preferences are more useful if you are for example you’re building a system that helps people fill out their shopping carts. You have many previous episodes of people filling out their shopping carts and learning in that case is actually an easier problem because the planning problem is not something you have to solve online. You just think, okay, this person usually buys ingredients for this kind of pasta dish and they can probably extrapolate forward from there.

Goals and preferences are different. I come to think of them as different and it seems that utility functions are not quite right as they collapse everything into a single representation of what humans care about and I think going beyond that. What do the preferences come from? That's a start and then you can go on and ask questions more deeply about values and norms. I think those are richer and deeper and I've only started to explore them a bit more recently in some recent work like this beyond the preferences paper I mentioned.

Mykhaylo Filipenko: Yes! Maybe you could explain in a couple words what this paper was about because it was one of your recent ones.

Xuan: “Beyond preferences and AI alignment” is a kind of philosophy paper. I wrote it with a couple of other folks — Micah Carroll, Matija Franklin, and Hal Ashton — and we came together because I think we were each in our own ways a bit frustrated with ways in which preferences were treated in both the theory and practice of AI alignment; human preferences in particular. I've told you about utility functions. Typically they have been assumed as basically representations of human preferences. In traditional economic theory you get this idea of if your preferences follow certain axioms of rationality, namely the Van-Neumann-Morganstern axioms of rationality, then your choices or preferences can effectively be represented as you maximizing the expected value of some utility function.

And utility functions, they have certain properties. They're scalar, right? They sort of map every possible outcome into a single value. And there are a couple of issues that you could have with this model of both human preferences and AI alignment, right?

The “traditional” idea was always if humans have utility functions, how do we align AI? We just get AI systems to learn a human utility function and maximize that, right? And what we decided to do was to think descriptively that this is not a very good model of humans for a bunch of reasons. For instance, being rational or reasonable, you're not necessarily required to maximize expected utility.

If you take those arguments seriously, then that's going to really change your view on what it means to do AI alignment and you move away from the idea that we just need to recover this human utility function and maximize that. Instead, it suggests that you want something like a more pluralist approach that accounts for the fact that people's values may not always be collapsible into a single scalar value. One thing is that people may have incomplete preferences. So you need to think about what it means to build an AI assistant that can help people with incomplete preferences. and I think you also need to think about if not it's not only the case that single humans can't be represented with utility function but the preferences of multiple humans can even less be aggregated in a single utility function.

If you reject that view, then you need to think about a different conception of human or multi-principal AI alignment. And what we instead argue for is an approach that's more grounded in social contract theory instead of an approach where AI systems are aligned to normative standards that people of different values can reasonably agree upon.

That is broadly what the paper is about.

Mykhaylo Filipenko: That's very interesting. Especially what you said about trying to collapse such a complex system as a human or even more complex as a society to just one single number. Very likely we are under-representing all its complexity. In a previous interview I talked for instance to Jesse from Timeaus and he said we are trying to collapse everything that's going on in a neural network to just one number which is loss. But if you look in depth it's way more complex how the loss landscape looks like and we have to analyze to understand it better. Iit seems like you're coming to similar arguments that trying to represent everything just with this utility function is ..

.. in Germany there is a very popular word right now in politics .. “undercomplex”

It seems like you guys come to the same idea of an undercomplex representation of reality from different directions.

Let’s dive into a different question: What do you see as overrepresented topics and underrepresented topics in AI safety and AI alignment?

Xuan: Yeah, that's a great question. I think there's an underrepresented approach that deserves more attention that has been recently getting a bit more attention which is to do basically AI safety by design , from first principles or whatever you want to call it. It starts more from theory and says: “Okay, now we actually need to turn the theory into practical systems and for this to actually work it needs to be competitive with the current major AI machine learning paradigm or large language model paradigm.

I hope that others can consider this as a way forward of making AI safer. In the same way that we build traditional software engineer AI systems which deliver us the economic value we want, perhaps more we want the next generation to be safer than the current generation of machine learning systems.

Maybe we can design them more carefully, they're more bounded and we don't have to worry about what's secretly gotten into them in a training process. There the people who are the most representative proponents of this direction include Davidad, Yoshua Bengio, Max Tegmark and Stuart Russell, who recently came together to write a paper on guaranteed safe AI that I was also a co-author on

Now,there is a broad spectrum of perspectives on what exactly guaranteed safe AI means and how ambitious we should be there, but my version of it is that it is practically possible to design systems which are both safer and more economically competitive.

Mykhaylo Filipenko: So let’s jump to the last question: What's your theory of change? You have the idea which I think is a really great idea to think about AI safety by design. How could this be brought into the big AI labs? Probably these are the guys building the most powerful models right now. How could your idea find its way into practical applications?

Xuan: My theory of change doesn't actually involve bringing it to the big AI labs. I just think it's more likely that these alternative AI paradigms will succeed by disrupting the big labs. . We will see.

I mean from an outside view you maybe should be skeptical how some random person or whatever group of smaller companies are going to do that but I think it's important for people to try and take different technical bets. I think the big AI labs are too specialized and too committed to the current AI scaling paradigm to really make sense for them to pivot to something.

In the meantime, what should be done? There are a bunch of things but what I'm personally willing to bet on are that there are some key AI applications that can be built using what I think of as a better approach: Think of an AI that looks like you have a specific ideally interpretable model of how the world works. You get the AI system to reason and plan over that model of the world ideally under uncertainty. You combine that with the ability to interpret human instructions or human preferences to form uncertain representations of what humans want and then you get them to execute actions and achieve tasks for humans using this model of the world. Why should we expect this to be better in budget dimensions? Firstly, because we have an explicit model of the world, so we know what the agent knows and doesn't know.

So we can exclude aspects of the world that we don't think the agent should know. Secondly, because this model is explicitly represented in the same way that I think traditional code is really efficient, these models of the world can be really compact and reasoning over them using classical search algorithms can be really, really fast.

I think this is going to be more efficient than the attempts to do reasoning with natural language in the most recent generation of large language models. You trade off specialization or generality for efficiency here. But I do think in many cases that people actually want to deploy e.g. web browsing AI agents that do your shopping for you or video game agents that are essentially smart NPCs. It is possible to build world models specific for those tasks and do really efficient planning over them.

And I think it's also safer because we have the guarantees that come from “we actually know how this algorithm works”. We also are representing adequate uncertainty about what humans want in this context, And so we can avoid failure modes and have specifications the agent is going to achieve the user's goals but subject to achieving the safety constraints with high enough probability.

The applications I mentioned above, I think, are quite viable. I'm not suggesting that we can automate human writing assistance in this way. I think existing large language models are really good for that. But there are other tasks that everyone's excited about right now that are ripe for disruption by actually a safer class of systems. There are all sorts of obvious reasons why they can be done much more efficiently and reliably using more traditional sort of AI search techniques and combining them with large language models not for everything but only for handling natural language.

Mykhaylo Filipenko: All right. Thanks. Thanks a lot! That was very interesting. I liked your insights on many aspects and also I like the idea that you say hey let's bet that there is more than only the direction that the big labs are pointing at right now.

Xuan: For sure. Let me just add one bit: As I mentioned at the beginning, I will be starting as a faculty member in the National University of Singapore later this year, andif there's anyone interested in tackling AI safety using the approaches described above, they should reach out to me. [1]

[1] https://ztangent.github.io/recruiting/

hyper-exponential.com

Discussion about this post