TLDR:
Dr. Jobst Heitzig is a senior reseacher at the Potsdam Institute for Climate Impact Research. After working in the field for many years, he decided to transition to AI safety. As an outsider to the field he shares a lot of interesting insights for everybody who is about enter the field. Currently, he is working on modular AI systems with a focus on non-optimizers for decision making. As part of his work, he is starting an AI safety lab in Berlin. If you are interested in working with him, feel free to reach out to him [1].
Hey Jobst, first of all many thanks for taking the time to talking with me. It’s really great to be able to talk to you today. I am very excited to kick off this series of interviews! Could you briefly introduce yourself?
I’m a mathematician by training. I have a PhD in pure mathematics from Hannover on something that, at the time, didn’t seem to have any application. After doing that PhD, I got frustrated. I thought this is totally irrelevant stuff. I want to do something meaningful for society.
My first job was at the German National Statistical Office, where I spent four and a half years developing algorithms for statistical confidentiality protection. That’s a problem where you collect a lot of sensitive data from households, firms, and so on. You want to do some statistical analysis, like a regression analysis or creating a table. You want to publish some results, but you need to ensure that no one can infer anything about an individual subject, firm, or household, even if they have additional knowledge. For example, if you observe that your neighbor was interviewed for the micro census and know a lot of the answers your neighbor would have given to some of the questions on the form, you could identify the row corresponding to your neighbor in the microdata, even though it’s formally anonymized.
That’s a problem we solved with some algorithms, and that was interesting, but at some point, I noticed that the solutions I proposed would not be applied. They involved adding some noise, some randomness, and for political reasons, they didn’t consider that a viable option. So they kind of shelved it. I got frustrated and quit the job at the point where they offered me—what’s the English word for this?—to be a civil servant, a very secure, permanent position. That signaled to me, okay, if I do this now, then that’s it. I will stay here for the next 40 years doing this type of thing. And so, I quit the job. That’s not what they expected.
My next job was with the German equivalent of the World Bank—the KfW Bankengruppe, a state-owned bank but managed like any other bank. I was in the data warehouse department, also doing some statistical software training in-house. It was very well-paid but kind of a little boring. Even though that bank could have been considered the good guys, not all my leftist friends after work saw it that way. They saw my suit and put me into the bad guys’ box because I worked at a bank.
Then came the banking crisis in 2009. Everything got rough, and I quit that job as well. At 37, I thought, okay, I need to do something like a gap year. If I don’t do it now, I won’t do it ever. I had no idea what that gap year would be, but I ended up doing some volunteering work in Venezuela, teaching street children English. I also did an internship at the Potsdam Institute for Climate Impact Research, where I have now held a senior scientist position for 15 years. That internship resulted in them offering me a job, and by now, it is a permanent position. I have a very privileged position. I can work on more or less whatever I want, as long as I publish some papers per year in which the word “climate” occurs somewhere, then they leave me alone. And that’s fine. It allows me to work on all kinds of stuff, which I’ve done over the years, including game theory.
I’ve also worked on dynamical systems theory, chaos theory, time series analysis, environmental economics, and some modeling of social dynamics, like opinion formation processes—trying to model why movements such as the Fridays for Future movement grow while others don’t. I’ve modeled all kinds of things and analyzed all kinds of data. One recurrent theme has been trying to analyze international climate negotiations from a rational point of view using game theory and so on.
That is a very interesting journey indeed that you embarked on so far. So after doing so many different things, why have you decided to turn your attention to AI Safety?
At some point, I got frustrated because that knowledge didn’t seem to have any influence on reality. There’s a huge body of literature that tells governments how they can sign self-enforcing treaties that everyone would comply with out of self-interest and would solve global problems such as climate change, but no one’s doing that. At the beginning of last year, when everyone was speaking about GPT-3—or maybe it was 3.5—and the dangers coming from that, everyone was astonished by AI, and I was very frustrated with my current work. I thought, okay, let’s look into AI safety as a potential field, and as I had no funded projects to manage and also no people to supervise in that year, I went for it.
So, I decided the year 2023 would be my year to explore AI as a field. That led me to try to find a niche that fits my background and would be impactful. I identified two things I could work on, and I’m still working on those two things.
AI seemed like a natural fit given my background in formal sciences and dynamical systems. I also have a lot of friends who are effective altruists, and I had already read many things on the EA forum and noticed that AI was prominent there. I started reading material on LessWrong, and that convinced me this is really a pressing issue. While not entirely neglected, it seemed less treated than other issues like climate change. That was the motivation to start working on it.
When I began reaching out to people in the community through calls and emails, they connected me with others. Soon, I was talking to quite prominent people, despite not having anything concrete to offer besides my background in other fields. For example, I ended up talking to Ethan Perez from Anthropic, who at the time was essentially leading the development of Claude. He spoke with me for half an hour, and it felt like I had his attention. If I had had an idea at that moment, it felt like I could have pitched it, and if it had convinced him, he might have made it happen the next day.
That felt like a very short potential road to impact compared to climate impact. It felt like talking to Robert Habeck or someone of equivalent influence on climate. I also spoke with a grant maker from Open Philanthropy, Ajeya Cotra, and she asked me, “What would you need money for?” I said, “I have no idea yet; I’m just trying to get an overview and see where my place might be.” She replied, “Maybe we can fund smaller things like conferences or whatever.” I said, “Okay, I might need some money to attend a conference.” She then asked, “What else?” I said, “Maybe I want to bring some people together.” I just made it up, really.
I mentioned that I had an idea to connect the field of social choice theory—the theory of voting, collective decision-making, and deliberation—with AI safety. For example, social choice theory might help with fine-tuning large language models (LLMs) based on human input. She thought it was a good idea and told me to send her a budget. After the call, I thought, did this really happen? This was so different from how funding is organized in science. I realized I had to follow through.
I reached out to some people at Berkeley in the social choice community and a group specializing in logic and the philosophy of science, a very renowned group. I asked if they could co-organize a workshop with me. I felt it needed to happen in the Bay Area so that relevant people from industry and academia could easily attend. One of my co-organizers happened to know Stuart Russell from CHAI, and he suggested involving him, saying, “That might open some doors.” I agreed enthusiastically.
We ended up organizing what I think was a very successful, small, invitation-only workshop in Berkeley in December last year. It brought together many important people and explored the idea of applying voting methods and deliberation to reinforcement learning from human feedback—the main method used for fine-tuning LLMs—and to deciding on the “constitution” of an AI. This concept is used by Anthropic to guide the development of their LLMs. The workshop introduced collective decision-making processes into the field. While this is currently more of a community-building exercise for me, it’s one of the projects I’m working on and may develop further in the future.
I think it is very encouraging how you came in touch with so many important people in the field just by reaching out and looking at what comes back from the echo chamber. Building on these conversations you figured out a direction for yourself that you think can be impactful for the future of safe AI systems. Could you elaborate more that?
Almost the whole field of theoretical economics assumes people are rational. That means they’re maximizing something—they have preferences, a utility function they want to maximize—and this forms a strong paradigm for predicting people’s behavior based on the assumption that they maximize something. Behavioral economics, on the other hand, shows through numerous experiments that this is a flawed model for humans. Humans don’t actually behave this way in reality. Yet, the model seems to have strong normative power.
In philosophy, utilitarianism is essentially the idea of maximizing utility, whatever that may be. This paradigm is also strong in ethical theory and, of course, in machine learning. In machine learning, you have a metric that measures how good the model is, and the goal is to optimize it by minimizing the loss or maximizing the reward. This paradigm is deeply embedded in machine learning and, consequently, in much of alignment theory, where many come from a rationalist background. They assume a rational agent should maximize its utility and extend this idea to AI systems, suggesting that an AI system should also be a rational agent.
It seems intuitive: an AI system should maximize something. The whole problem is then framed as ensuring it maximizes the "right" thing. That’s why the field is called “alignment”—we want to align the AI’s objective function with our goals. However, this approach has significant problems. First of all, we don’t have a universally agreed-upon objective. Behavioral economics strongly underscores this point. Even if we could agree on an objective, your objective function might differ from mine. This raises the question: whose objective function should we use? Aggregating utility is a deeply problematic philosophical issue, as interpersonal comparisons of utility are notoriously difficult, and many argue they’re impossible.
This idea of an AI system maximizing utility is fundamentally flawed. Stuart Russell, for example, makes this point clearly in his book Human Compatible, arguing that we need to move away from the idea that an AI system should maximize something. Perhaps AI systems shouldn’t be rational agents at all. Max Tegmark, for instance, suggests they should be very powerful tools—tools without their own goals. While I wouldn’t go that far, I do believe AI systems should not aim to maximize a specific objective because there are theoretical reasons why this approach is dangerous. If you’re maximizing the wrong objective function, the consequences could be catastrophic.
For example, imagine an AI system tasked with managing the German economy, with the sole objective of maximizing GDP. The system might take extreme actions to achieve this goal, such as waging war on neighboring countries if it calculated that doing so would increase GDP. It might completely ignore environmental considerations or climate impacts because those weren’t explicitly included in its objective function. When maximizing a complex objective like GDP, the system might identify a single extreme policy that optimizes the target, leading to outcomes no one wants.
One might argue that we could avoid such outcomes by adding constraints—e.g., prohibiting war or requiring climate considerations. While this is a good idea for the issues we can foresee, there will always be unforeseen consequences. Some options are so ingrained in human norms that we wouldn’t even think to specify them as constraints. However, the AI would consider all options, including those we haven’t thought of. This makes it impossible to define all the constraints necessary to keep maximization safe. Thus, the entire idea of maximization is flawed.
So this is the second niche I’ve identified, in addition to the social choice theory stuff. Much of alignment theory still operates within the optimization paradigm, but I want to explore ways of making AI systems safe that don’t rely on optimization. Instead, these systems could make decisions based on more finite goals, for example, by adhering to constraints where any outcome within those constraints is acceptable.
In the real world, people often talk as if they’re optimizing something, but they rarely actually do so. Optimization requires significant cognitive capacity, and thankfully, humans are not very good at it. If we were, the world would be much stranger. Sometimes you meet people who really seem to optimize something and then you notice that something seems to be wrong with these people.
Thanks for explaining in such detail. It is a very interesting direction indeed. Do you also see other areas in AI safety that are underrepresented and should receive more attention?
In a sense, yes, there’s a growing movement that goes by different names. Some of these include "safe by design," "guaranteed safe," or "provably safe." There are position papers on these concepts with contributions from prominent figures like Max Tegmark, Davidad, Stuart Russell, Yoshua Bengio, and others. They argue that we need to approach this in a fundamentally different way. If you look at the overall composition of AI systems, they need to be modular. It cannot just be a monolithic system like one big GPT, where you feed it input, get output, and hope to interpret its behavior afterward using methods like interpretability. That approach is fundamentally flawed. Stuart Russell, for instance, advocates for a more modular design.
The system should consist of components with clearly defined roles. For example, there could be a perception component tasked with making sense of raw data and converting it into a meaningful, abstract embedding space with concepts relevant to human life. This perception model would be trained specifically for that purpose. Another component might be a world model, responsible for taking an abstract description of a state and a potential action, and then predicting the outcome of taking that action in that state. It would not evaluate the outcome but simply make predictions about consequences. This world model could be trained using supervised or self-supervised learning to make accurate predictions about outcomes, which is a clearly defined task.
There could also be other components designed to predict specific forms of evaluation, such as quantifying the power of an individual in a given situation, the amount of entropy, or the degree of change introduced. These evaluations could use criteria that matter in terms of achieving goals and ensuring safety. This evaluation model could be a neural network trained on annotated human-provided data. For example, similar to the reward models used in RLHF (Reinforcement Learning from Human Feedback), there could be models for assessing harmlessness, helpfulness, or other aspects that are critical. Evaluation components like these could be neural networks, Bayesian networks, or whatever architecture best fits the task.
The decision-making component would rely on the outputs of these other components. For instance, the perception model might indicate, "You are standing in front of a lion," and the decision component would then ask the world model, "Given this situation, what could I do?" The world model might respond with options such as running away, playing dead, or shouting. The decision component would further query the world model to predict the consequences of each option. For example, if running away has a 20% chance of survival but an 80% chance of being caught by the lion, it would return this information.
The evaluation model would then assess these outcomes. For example, it might evaluate the likelihood of being eaten and conclude that being eaten is not good, based on human-provided data. The decision component would then synthesize this information and make an informed choice, such as deciding to play dead. Importantly, this decision algorithm should be hardcoded, not learned through reinforcement learning. Hardcoding the decision algorithm ensures transparency and interpretability. For example, the code could involve querying the world model for all possible actions, using the evaluation model to assess the consequences, applying a weighted sum of the different criteria, and selecting an action using a softmax policy.
This approach allows investigators to understand why the system chose one action over another, as the process is explicitly coded and modular. While some components already exist in current systems, others may need to be redesigned from scratch. The decision algorithm, for instance, is one such element that needs to be carefully resolved before moving forward.
A short twist of topic. There seems to be a lot of independent AI research out there. How do you think this research can have an impact on the work that is going on at the big AI labs, e.g. OpenAI, DeepMind, Anthropic etc.?
So my overall theory of change is that eventually someone needs to stop the unsafe approaches, and that can only be done through regulation. Eventually, someone with enough power in the real world needs to put a stop to the current practices. AI governance, for example, includes concepts like the "narrow path," which suggests that the US and China need to agree on a high-level plan to pause superintelligence development for at least 20 years. This kind of intervention needs to happen, but it will only be feasible if decision-makers can point to a clear alternative: a safer way of doing things.
If decision-makers can only shut down the whole industry without offering a safer path, it becomes nearly impossible to sell the idea. It would appear as though they are shutting down the entire field, which would be difficult to justify. That’s where I see the value of this research—to provide enough evidence that a safer approach is possible. Decision-makers need to trust this enough to enforce a pause, stop unsafe practices, and direct resources toward promising avenues for safer methods. This could involve providing funding to scale these alternatives and developing proofs of concept in economically relevant situations. Once that’s accomplished, the pause button could be lifted.
My work focuses on one component of an AI system. I’ll never produce a complete AI system because I lack the resources and skills to do so. Instead, I am working on developing a decision-making algorithm. My road to impact involves publishing papers to get the attention of academics and creating software components that others can experiment with in small, toy environments. These examples aim to demonstrate the value and apparent safety of this type of algorithm.
Next, I want to get the attention of an industry actor in a specific application area important for safety—for example, self-driving cars. I plan to collaborate with a self-driving car company to develop a concrete proof of concept, perhaps in a highly detailed simulation. The goal would be to show how a car using these algorithms would behave. If successful, the next step would involve deploying it on real streets. If that works, it would provide a proof of concept demonstrating that these approaches can be effective in well-defined areas.
The next step would be scaling up to more generic applications, such as creating a general AI strategy assistant. This could be used for various strategic decision-making scenarios, such as career planning, business strategy, or even planning holidays. This stepwise approach to scaling demonstrations could eventually gain broader attention. The ideal shortcut would be for major AI labs to recognize that this approach is not only safer but also potentially more capable. There’s an argument that safety research might not necessarily reduce capabilities and could, in fact, enhance them.
However, I can’t rely on big companies to voluntarily adopt this alternative approach. While it’s a hope, my focus remains on creating clear, stepwise demonstrations of safety and effectiveness to drive adoption and regulation.
Talking about the “narrow path” you stated that US and China should sign a treaty, what about Europe?
Yeah, eventually obviously also Europe and other countries should then join in. I do think that given the current state of progress in AI capabilities research maybe China and the US are the most relevant and of course signing a bilateral treaty is always easier than signing a multi-party treaty. So, this would be a bottom-up approach which could have worked in climate as well. (I have some publications on that: Forming of small coalitions bottom up that then can grow over time.). I think that is more promising than trying to bring all the 200 countries of the world to one table and sign one big treaty which would then be totally watered down and be meaningless. Take COP29 and earlier climate summits as an example.
What advice would you give to someone starting in AI safety?
It certainly depends on the background. What I’ve noticed is that there are many highly motivated people who lack the relevant technical skills. This is often because they are young and have just started, perhaps pursuing a PhD or even just a bachelor’s degree. They want to contribute, but since they don’t have the necessary skills in machine learning or related fields, they gravitate towards community building or organizing—a typical EA (Effective Altruism) approach. That’s perfectly fine if you don’t have a technical background, but if you do, I think it’s important to work on something concrete that aligns well with your expertise.
When I was in Berkeley last year, I happened to sit next to Paul Christiano over lunch purely by coincidence. I thought, “This is my five minutes with Paul Christiano.” I felt I had to ask him an intelligent question. At that point, I was still unsure about what I should work on. I briefly described my background and asked him, “From your point of view, what should I work on?” His advice was simple yet insightful: find something that feels neglected and fits your background really well.
We still need to explore a lot of different paths, and it’s very uncertain which direction will ultimately be the most helpful. Don’t focus on something just because it’s trendy or because someone says, “This is the mechanism everyone should focus on.” That’s not the right approach. You should find your niche and explore something that truly fits your background and skills.
Maybe, the last question is would you like to share something that was untouched maybe that's on top of your mind that we didn't talk about right now that was not mentioned by any of my questions that you would say “hey I'd really like to include this”?
I noticed that many people coming from EA or rationalist communities tend to think that it's enough if EA and rationalist communities approach this and it doesn't need to be mainstreamed.
I think it needs to be mainstream. We need a lot of people from different backgrounds and much more diversity, including diversity in worldviews. I don't think it's a good idea to have only value-aligned people working on this who are EAs or rationalists. That would be far from ideal because it would miss out on relevant perspectives. This effort needs to diversify and become mainstream. That also means the EA and rationalist communities need to let go a little bit. Obviously, this involves relinquishing some control, but if they genuinely care about the cause, they should be willing to do so. Additionally, they need to find some kind of peace with the AI ethics community.
It's unfortunate that the AI ethics community and the alignment community are on such bad terms at the moment, at least at a high level. If you look at very vocal people, the ethics community seems hostile toward certain parts of alignment, especially the rationalist side, and for good reasons. They perceive those involved as arrogant, young, white, male, privileged individuals from the Bay Area who seem to think they can solve big problems that might only be speculative.
I can see why their buttons are pressed and why they're hostile. However, on a more rational and calm level, I think these communities should be natural allies because they address similar risks across a spectrum. Hopefully, measures to address one type of risk can also help mitigate the other. For example, the pause letter from last year, signed by thousands of researchers and originating from the Future of Life Institute—a clearly EA-aligned institution—was not widely signed by AI ethics people. They criticized the letter for failing to mention short-term risks. However, a pause would obviously also be helpful for addressing the short-term risks that concern the AI ethics community.
Many thanks Jobst for this very insightful interview – I learned a lot and I am sure the readers of this will, too!
[1] https://www.pik-potsdam.de/members/heitzig

