Safe AI with Singular Learning Theory ..
.. an interview with Jesse Hoogland from Timaeus
TLDR: Jesse Hoogland is a theoretical physicist from the Netherlands who is the founder and executive director of Timaeus. Timaeus is a non-profit that was formed in 2023 with the mission to empower humanity by making fundamental progress on AI safety. Their vision is to use singular learning theory (SLT) to develop connections between a model’s training data and its resulting behavior, with applications for AI interpretability and alignment. Timaeus has validated initial predictions from SLT on toy models. Now, they are building tools that allow to interpret the training processes of frontier-sized models. In doing so, Timaeus is establishing a new field of interpretability: Development interpretability.
Mykhaylo Filipenko: Jess, so thanks a lot for joining and taking the time to do a quick interview. I think the first question is always the same: Would you like to introduce yourself quickly?
Jesse Hoogland: Thank you, I'm Jesse Hoogland. I'm the executive director of Timaeus. We're an AI safety nonprofit doing research on applications of singular learning theory, SLT, to AI safety and alignment. We'll talk about the details in a second. I'm primarily in charge of outreach for the organization, operations and management. Also, I'm involved a lot of the research we do, mainly in a research engineering capacities.
My background is theoretical physics. I did a masters degree at the University of Amsterdam and I spent a year working on a health tech startup that went nowhere. And at some point, I felt the growing tide of dread at the rate of AI progress, and I decided to make the pivot into AI safety. It was the right call and shortly thereafter I met my co-founders. I discovered singular learning theory, and we pretty quickly got started on Timaeus, the project which we are working on right now.
Mykhaylo Filipenko: I think you already jumped into the second question: What is the history behind Timaeus? How did this whole thing get started?
Jesse Hoogland: I was just starting my transition into AI safety when I went to this Dutch AI safety retreat and there I met Alexander Gietelink Oldenziel, who's one of my founders. On the train ride back from that workshop I asked him: “What do you think are interesting directions within AI safety?”, and he said “I have two answers: One - computational mechanics and two - singular learning theory.” Then, he shared some links and I started reading. I read the singular learning theory content. I think just by chance I spent more time on it and I saw that there are words in there that I recognize: Things like “phase transitions” and “partition functions”. This is the language of statistical physics, and this is something that feels familiar to me with my background, so I decided to look into it further. I ended up writing a blog post on what SLT says about neural networks.
As I was finishing up that blog post, guess who walks into the office where I was working from? Alexander entered completely spontaneously and unplanned and he sees the post and his reaction is, "Wow, this is great….we should organize a conference on SLT." And I thought, Alex, do you know how much work goes into [organizing conferences]? How much preparation is need in the upcoming three months? And then he puts down the phone and says, "I already have $15k down. We just need to raise the rest." And at that point, I'm like, "Okay, we have no choice. We're going to have to do this.”
Now we're scavenging. We had to raise more money if we want to make this thing happen because $15k wasn't enough. We end up going to EA London and once there we talked to some friends including Alexandra Bos of Catalyze Impact and Stan van Wingerden who became the third co-founder.
Alexandra suggested that we go through the entire list of people who are at the conference and who had “earning to give” in their bios to solicit people for donations for the conference we were trying to organize. And so that's what we did. We crawled through this list and individually solicited people asking for donations. That’s how we raised the remaining funds we needed for this conference to happen.
In this conference, we brought together two communities: Daniel Murfet, who is a researcher at the University of Melbourne, and his group together with people interested in AI safety. We started thinking really hard about what this theory of neural networks could do for AI safety. That led to this agenda that we called developmental interpretability.
Developmental interpretability aims to understand what's going on inside of neural networks by tracking how they change over the course of learning, in analogy with developmental biology. That was the first starting point where we thought, there's something here that we could actually pursue to advance AI safety.
Shortly after that, we raised some initial seed funding through Evan Hubinger and through Manifund. A bit later, we raised some additional funds through the Survival and Flourishing Fund. That was enough to start hiring and to do this research.
Initially, our research was focused very much on validating SLT as a theory of deep learning and seeing that the predictions it makes are real: We looked at small toys systems in which SLT can make precise predictions and validated that those predictions bear out empirically.
That was our initial focus: The first year we put out a series of papers that did just this. About six months ago we were in a state where the theory is starting to look pretty good. We were making contact with reality in a bunch of places. And so, the next step is to start scaling things up to larger and large models. And that has been the story over the last, six months, even a little longer: Scaling these techniques up to models with billions of parameters.
We’re not reaching the frontier scale quite yet but we are at a size of models that are actually very capable so that we can start applying these techniques for interpretability to models already that have interesting capabilities.
That’s where we are today.
Mykhaylo Filipenko: Maybe just one last question on timelines: When was this train ride? And in which year was the conference? And how many people attended the first conference?
Jesse Hoogland: So, the Dutch AI safety retreat was in November 2022 and then I wrote this blog post early 2023, I think January. We had the conference in June. Then shortly after we got our initial funding over that summer. By October we were ready to go.
The conference was split in two parts. The first part was digital. It was basically a primer on the material. I think we probably had more than 100 unique visitors on that and then the second week was in person where we brought together about 40 to 50 people.
Mykhaylo Filipenko: You already started briefly to explain SLT but could you explain it maybe in two paragraphs? What is the idea behind it? What are the main concepts of singular learning theory?
Jesse Hoogland: The one sentence version is: Singular learning theory suggests that the geometry of the loss landscape is key to understanding neural networks.
Currently all of our existing techniques for trying to align models look like this: Train the model on examples of the kind of behavior you would like to see. It's a very indirect process. We iteratively update models and tweak them a little bit to behave closer and closer to the behavior we would like to see in these examples.
And techniques that that fall under this heading include constitutional AI, RLHF, DPO, deliberative alignment, refusal training. These are all basic variants of the same idea: Change the data and train on it. This is important because it means that in practice, the process of trying to make models actually to share our values and goals is essentially the same as the process we use to make these models capable in the first place, which is pre-training (or just machine learning). But the problem is this process is implicit and indirect.
We don't understand how it works and we don't know that if the way it's actually changing models is deep or significant or robust or lasting. So, as we develop more and more powerful systems, we'd like to be more and more sure that we're actually aligning them in a meaningful way with what humans want. And so we need to understand better the relationship between the training data we give them and the learning process. This means, how the models progressively learn from that information and the final kind of internal structures that models develop.
Structures like organs and how those structures actually underly their behavior and generalization properties. And singular learning theory provides a starting point for characterizing the relationship between these different levels.
Mykhaylo Filipenko: To my understanding it sounded like you did a lot of theoretical ground work on SLT to prove that those concepts work. Do you run empirical experiments and how do they look like?
Jesse Hoogland: I can give a few examples. But before doing so, I'll say just a little bit more about how the theory works. When we're training a model, we specify what's called the loss landscape. You basically have to imagine that what the learning process looks like for a neural network is that you have some huge landscape and you're walking down step by step trying to find the lowest value. And if you do this long enough, then you'll find very low solutions. The solutions correspond to configurations of model internals that achieve high performance and do all the kinds of things that current day language models can do. The key idea of SLT is that the topographical information in this landscape contains all the information about model behavior at the end.
Hence, the tools we're developing are grounded in this theory. These are tools that allow to probe this geometry. You can sort of imagine flying on a plane over the landscape and trying to sample a very coarse picture of what the salient features and landmarks are in this landscape.
For the physicists among us: It's like an atomic force microscope. The math is the same. These are spectroscopes. We're trying to sample a coarse grain picture of what this landscape looks like in the vicinity of our models. And there's information there that we're trying to find.
What we do has two components: On the theoretical side, we’re trying to figure out how to extract more information from the samples of this geometry that we’re sampling. And on the experimental side, we’re trying to come up with more and more accurate probes that yield more and more information that you can do something with. We build these measuring devices and then use them on real systems to learn something new.
One prediction that SLT makes is that the learning process for transformers or other models like neural networks should take place in stages. Just like in biological systems, the process of development from an embryo to an adult doesn't look like me just gradually growing bigger and bigger and bigger in size. Rather, all of my organs develop in some series of stages. My cells differentiate in really discrete steps. And the same should be true for neural networks is what the theory predicts.
One early project we did is that we looked at very simple transformers trained on natural languageto investigate whether this was true. What you observe if you look at the loss only is you notice that it goes down very smoothly. There is no real evidence that anything discrete or stage-like is happening just by looking at the loss. But if you look at the results from these geometric measurements that you get from these tools informed by SLT, you find that there's this hidden stage-wise development going on.
And you can find these plateaus and these plateaus separate really markers of developmental milestones. You go looking further into these and it turns out these stages are actually meaningful. So the model really is initially learning sort of very simple relationships between neighboring words. Then, it moves beyond learning bigrams to, tri-grams and so on. Stepwise, it starts to learn longer sequences of words and phrases. Then it learns what's called the induction circuit in several parts. This is a more sophisticated kind of internal structure that develops before the learning process finally converges.
You can detect all these physically meaningful things just by looking at this raw information from how the geometry is changing locally as predicted by the theory.
Mykhaylo Filipenko: That was very interesting. This kind of comparison and perspective on it with a living organism. Never thought about it this way.
The goal of the whole thing is AI alignment, i.e. to make AI systems safe. You guys do independent research work. But to many people it seems like the end game is happening the big labs, right? And the things that frontier labs are doing are more and more behind closed doors. Hence, what is your idea or the idea for your organization to having an impact in this whole process.
Jesse Hoogland: So I'll try to distinguish microscopic theory of impact or the research theory of impact from the macroscopic theory or organizational theory of impact.
So let's start with the research theory of impact. I see this as really composed of two parts. One part is that I want to come up with new tools for interpretability: I want to be able to read what's going on inside of a neural network. And I want new tools for alignment: I want to be able to write our values into models in a more reliable way. These interpretability tools look something like what I discussed previously: Like tools to extract information from the local geometry of the loss landscape.
And what we hope here is that SLT could give us tools for guiding the learning process towards the kinds of outcomes we want instead of what we do currently: We take all of the data on the internet and then we throw it into a cauldron. The cauldron is called a neural network architecture. And then we start swirling this mix of potions and reagents over a fire. The fire is called the optimizer. And we hope for the best. And we hope that we don't accidentally mix noxious ingredients together and produce chlorine gas or whatever. But of course, we don't really know. Unfortunately, it's the internet we’re training against. So, we probably are going to produce chlorine gas by accident.
What I hope could be the case is that we develop a better scientific understanding of how to choose data, how to design this learning process so we get the outcomes we want. We want to come to a point that we're really combining ingredients in a very fine grain way; in a way that looks more like modern chemistry rather than historical alchemy.
I think something like this is possible. So, the research theory of change is to give humanity tools to understand what's going on inside of neural networks and to steer it to desirable outcomes.
Yeah, I'm imagining tools that you would use while you're training a model that warn you when something unintentional is happening or there's structure forming here that we don't understand. We don't fully understand what's going on. Then we could back up and try this again and change the trajectory a little bit.
Mykhaylo Filipenko: And the macroscopic theory? The organizational part?
Jesse Hoogland: I think we should expect that at some point in the next few years the big labs will probably close their doors and take all the research private. Right now we already don't hear much about what's going on internally but soon we will hear even less. What does it look like to prepare for this? There are a few things: One thing you can do is just publish research that pushes towards making alignment easier and cheaper to do or in other words: The trade-off between making models more aligned and more capable is good. Then the labs will read this and if it's compelling enough, their internal researchers and automated researchers will absorb this information to guide their internal development.
One step up from this is to do targeted outreach to the labs: To have personal contacts in the labs, to give talks at the labs, to make sure people at big labs are aware of your research, to come up with proposals for research projects. You have to see yourself as a salesperson for your research agenda and try to make sure that the labs are actively including your work in their agendas.
So we're doing both of these things. Longer term, there are crazier possible outcomes where governments get more involved. You can think of some sort of a Manhattan project, where things could get weird. I don't know fully how to prepare for all these worlds, but I think these two directions – just doing good research and doing targeted outreach to make sure the labs are aware of this can make quite a big difference.
I think we see that very well with for example Redwood Research where the work they've been doing has now changed lab policy I think at all the major scaling labs. So we see that it is totally possible for a non-profit to have this kind of impact on big lab research agendas.
Mykhaylo Filipenko: That is encouraging to hear – that as a non-profit with with good research and a proactive outreach you can actually have an influence on things. So maybe two last questions. The first one is about outreach: What was the reaction of the community to singular learning theory?
Jesse Hoogland: So initially there was obviously some skepticism which is warranted. We're making pretty bold claims here about why neural networks generalize and what might be going on inside of them. Understandably, people want to see evidence and that's indeed what we also wanted to see, which is why we focused on validating the basic science.
As we progressed, I think some of the skepticism has moved more towards “Okay so maybe you can say something about neural networks but how do you actually cash this out in terms of impact for safety?” This has also been a question for us and it's been a major focus for us to clarify our vision for what SLT could do for safety. We recently put out this position paper called “You are what you eat - AI alignment requires understanding how data shapes structure and generalization” [1].
In this paper we put forth our broader vision of what SLT's role in alignment could be. And I think now we've put out a vision and the question is can we deliver on this vision? There are still questions about how to reach the frontier model scale and what does that mean? I think people are generally very excited and we had very positive reactions in the end. Skepticism is still warranted from a bunch of people but I think we will soon show that SLT can actually make a difference and that this will help us with near-term and long-term safety problems.
Mykhaylo Filipenko: And the second question: If people are excited getting started with SLT – what would you recommend as a starting point?
Jesse Hoogland: There a few places. So, the first thing is that there's a Discord server for people interested in singular learning theory and developmental interpretability [2]. That's one of the best places to just stay up to date with what's happening currently and get informed about new papers.
Then there's also a page where we've curated a selection of learning resources [3]. If you want to learn more about SLT, you should go through these things roughly in this order.
If you've got a mathematical or physics background then at some point you'll want to open up the “gray book” which is the name we have for Sumio Watanab's algebraic geometry and statistical learning theory which is the textbook that outlines singular learning theory [4].
And of course, you can just start reading the papers if you want more of the applied empirical side actually seeing what this looks like in practice. I think those are the resources I would recommend.
And yes, we have a list of project suggestions [5]. It's a little out of date but not too much. There are some ideas for things you might want to try out.
Mykhaylo Filipenko: Sounds very good. All right then. Yes. Thanks a lot for your time. it was very insightful and a pleasure to talk to you. Next time again at the whiteboard!
Jesse Hoogland: Thank you, Mike. My pleasure.
[1] https://www.arxiv.org/pdf/2502.05475
[4] Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory, Cambridge Univesity Press, 2009