Some thoughts on alignment ..

.. actually human alignment

Apr 02, 2025

The topics of AI safety, AI alignment and eventually super alignment have gained significant prominence in the public discourse after the release of ChatGPT in late 2022. At least some people realized that very strong and impactful AI systems are not a mere hypothetical topic left for scientist and intellectuals to be discussed in university halls and annual summits but potentially a very serious issue in the near future.

Consequently, research in the field has picked up momentum. Albeit alignment research is not anywhere close to capabilities research in terms of head count, compute or basically “capital deployed”, we can see a positive trend here.

Thinking about the current state-of-the-art in AI alignment research, I started to ask my-self the following question: In how far AI alignment research (alignment between AI and humans), can help with our own alignment? I mean how alignment of humans with it each other can be improved based on our findings from AI alignment research.

If you think about it, until this very day we are struggling to align the type of intelligence that we should know in principle best of – human intelligence – at any scale. We struggle to align nations to combat global warming instead of each other. We struggle to align companies to put AI safety first instead of competing with each other in an arms race towards AGI. We struggle to align the teams within a company towards a common mission and vision. We struggle to align the desires and aspirations of two partners in a relationship, and maybe at the very fundamental and basic level we struggle to align the worldviews, ideas, goals and desires within ourselves. So how in the world, are we supposed to align an unknown and alien type of intelligence with our own?

However, with artificial intelligence, we have an “unfair advantage”: In comparison to biological brains, we know exactly how these – let’s call them anthromorphologically – “digital brains” are structured and on which data they have been trained. At each point of computation we can retrieve their complete computational state. We can run as many experiments as we want to find out the most nuanced details about their inner workings. This allows us to understand and hopefully steer behaviour in the right direction. We can do this because in comparison to humans we have not granted moral status to AI (yet) and so we can continue until we do so (or until we can’t control these systems anymore).

Suppose that we are successful with our alignment research. Even without having AI systems that surpass human intelligence on nearly every possible task currently thinkable (i.e. AGI), I believe that we could benefit from AI alignment research in another way:

If we can assume that the “digital brains” have sufficient similarity to our own or at least can mimic our own thinking patterns sufficiently well, then we could think about reverse engineer the things that we found out about the alignment of AIs and apply such findings to ourselves.

Nevertheless, I think that this type of reserve engineering is not only limited to the issue of alignment. Let me try to lay out a couple of examples that come up to my mind, where we could think of applying the knowledge that we gain from the insights of digital brains to our own:

1) Maybe you have heard that there are people who think that “backpropagation” is a superior learning mechanisms compared to what is realized in our own brains. This is topic is still up to much debate but I wouldn’t be surprised if that is the case indeed: We have seen that engineering (driven by market forces and/or military needs) can outperform evolution. Cars can race faster than the fastest animals and jets can fly much faster than any bird. Thus, it is not unimaginable that the same mechanisms already created learning mechanism that are better than our biological ones.

However, optimizing the fuel efficiency and acceleration of a car doesn’t immediately tell you anything on how you could improve your performance in a 100m race. Hence, I wouldn’t know spontaneously what we could adopt from backpropagation to improve our own learning. Nevertheless, I am very excited to imagine that we could apply some of the insights that we have from backpropagation to hack our logarithmic learning curve.

2) There have been a couple of interesting posts and comments, e.g. from Jason Wei from OpenAI or Andrej Karpathy, in how far the learning objective of “next word prediction” in large language models during pre-training is a very powerful one [1]: By doing so over a sufficiently large corpus, the model acquires knowledge about grammar, world knowledge, some maths, translation, sentiment analysis and several other things. Thus, the “inverse” question arises: Given that a persons wants to learn particular skills, can we find strong learning objectives that would facilitate her/his learning process?

While there are still many challenges ahead that would allow us to do this kind of “reserve engineering”, I am positive about it. Probably, due to the fundamental structural differences between our brains (being an “analog computing machines”) and AI models (being “digital computing machines”) several things cannot be translated back directly but getting fundamental insights into how neural networks process data on the inside will surely be useful to learn a lot of things about ourselves.

[1] https://www.jasonwei.net/blog/some-intuitions-about-large-language-models

hyper-exponential.com

Discussion about this post