Can we have safer AI through certification?

An Interview with Jan Zawadzski from CertifAI

Feb 27, 2025

TLDR: Jan Zawadzki is the MD and CTO CertifAI. Given his background in autonomous driving, we explore what lessons can be transferred from the automotive industry to AI safety. A central topic that we talk about is reliability, the operations design domain and the importance of test data for each particular AI use-case.

Dear Jan, many thanks for taking the time to speak with me. Before jumping to the questions – could you introduce yourself?

I’m currently the CTO of CertifAI. CertifAI is an AI testing and certification company. It’s a corporate joint venture between PwC, DEKRA and the city of Hamburg and we focus on testing AI based systems and AI based products. Before that, I used to be head of AI at Cariad, which is the central software development company of the Volkswagen Group. And yeah, I've been in the AI sphere now for about eight or nine years and mostly focusing on reliability of AI.

Thanks for the introduction. Jumping straight to the questions. You're like the CTO CertifAI as far as I know and also one of the co-founders. What was the idea and your motivation basically to start Certify?

I'm one of the managing directors as it's a corporate venture. It was initially founded by the companies that I mentioned, but I'm the managing director together with Robert. And the idea was that I'm convinced that the biggest challenge we have is in making the AI do what it's supposed to do. There a lot of companies who develop AI based things but creating a product that is reliably tested is a whole different story. Only if it is reliable, it provides good quality and only then you can provide a good customer experience. I think this is the most important challenge and I think the certification comes only in the end. So, only if you develop reliable AI then you also get certified. But I'm excited about the reliability problem.

I think there's a lot to be learned from how automotive safeguards products. A lot of that can be applied to other industries and so far I haven't been wrong.

I think the automotive is very interesting as a comparison. It started about 150 years ago. At that time everybody was just was playing around building cars. Nobody cared about safety. Only thinking about how to make them go faster and then over years a whole kind of ecosystem emerged around it. The ecosystem does not only include OEMs but so many more companies, like gas station, insurrances, workshops, independent vendors etc. Do you see a similar ecosystem in the AI space building up?

In the physical world you need a very robust supply chain. OEMs typically develop only 10 % of the components themselves. They are usually big integrators.

Volkswagen, BMW, Mercedes, they all purchase brakes, steering wheels, ECUs, and then they put it together. They sell it and they get a margin in the end. There's not really a supply chain for software. There's not really a supply chain for AI development, but you need different tools and different ingredients to develop a reliable AI based product. And only if you plug different systems properly together, only then I think you can have a good final product that you can sell to customers. So, I think you need some minor system integration skills, but really minor. But nevertheless, I don't think one company will develop everything themselves. We rely on Python. We need rely on open source libraries. We rely on everything else that is out there to get our products out.

I think reliability is one of the last building blocks that we have to figure out.

So far it seems that the big AI labs do a lot of stuff themselves: The do the data curation, they do the training, they do also the deployment, they do a lot of testing themselves. If you look in the automotive industry, as you said, the OEMs only do 10 % of the things themselves. Do you think that we are going to see a similar trend with the AI field: I mean, that the big labs will also outsource more and more of the value chain to players who are very specialized in particular things?

I think some of the labs already do that. For instance, ScaleAI is a big supplier of annotations for computer vision or also text to a certain degree for some test cases.

Then you have other suppliers who do pen testing of models like Lakera for instance. They pentest some of the big labs I'm sure. And so you have different add-on services for which it doesn't necessarily make sense that the AI labs or AI based companies do themselves. And then if you go further down the application layer, than they need even less of those suppliers. They might just need a foundation model as a supplier and then an infrastructure company like a hyperscaler and then they can basically build that stuff around it.

Alright, let me then switch topics from looking in the ecosystem vertically to looking into the issue more horizontally or globally. The EU passed the EU AI act while it seems that in the US legislation is winding back with the election of Trump. Where do you see the main differences now between Europe, US and China regarding how AI development is happening, especially with a focus on reliability and safety.

Before talking deeper about reliability and safety, I would like to stress that the strength lies in the builders. Take the release of DeepSeek as an example: There is a lab with about 100 very good engineers and they really focus on building. You have the same thing in the US where you have a very strong builders culture and it's not so much about regulation. China has much stricter regulation on AI than Europe does. If you release a chatbot and the chatbot says anything are prohibited by the government you could have some serious issues.

In Europe the AI act is not even fully enforced yet. So, “prohibited systems” cannot be on the market right now but there is time until the AI act basically comes into force. So, we all have constraints and advantages. Europe has some and so does China and the US. I just see the US and China as a little bit stronger on the builder culture and I would strongly encourage a lot of people in Europe to also focus more on the building.

And then when it comes to reliability and safety, I would say in the US you often have the mindset of move fast and break things, but I think OpenAI is even moving more in the other direction. So if you read the GBD of the o1 system card, they list a few risks that they have and then they share a few tests that they have run and they share how they've mitigated those risks. So, they are also focusing more on the safety and on the reliability side

In the China for DeepSeek they have very strong guard rails. If you ask anything about Tanman Square, Taiwan or some other sensitive topics you get very quickly blocked. I'd say the safety and reliability part is now getting ingrained into the builder's culture. I think in Europe we have it integrated right from the beginning. And I think building fast doesn't have to exclude reliability and safety. I think that could be a good way forward.

So jumping to the next question: Do you see in the part of AI safety and reliability particular topics that are underrepresented in the main stream discussion?

Let me mention one topic that is heavily overestimated: That is bias. I went to a few trustworthy AI panels and everything people talk about is bias. How LLMs can now discriminate against people of certain minorities and in many cases bias is not the most important problem. I would forget all these words. I would forget trustworthiness. I would forget responsible. It's really just about making the AI do what it's supposed to do.

In practise, you have a use case and then you have to define a certain application scope and you have to make sure that within the boundaries, this non-deterministic system approximately does what it's intended to do because only then you can release it safely. There is risk that there's bias but it's just one risk.

You have security risks, you have reliability risks, you have privacy risks, you have autonomy risks and those need to be mitigated in any product. You always have risks and any good product manager even for non-AI based systems, really just for simple physical products, has to think about the risks and how to mitigate them.

Maybe to go from this back to the topic of autonomous vehicles. In that particular use-case of AI it's obvious that if AI does things badly, people are going to die. It's not obvious for chatbots. Thus, for those in the automotive space regulation played an important role for a long time. Hence, it seems that space of autonomous vehicle people are way ahead thinking about risks and safety. What do you think people in the AI safety community could learn from the field of autonomous vehicles?

I think the concept of the operational design domain is something that can be applied across other AI industries.

The operational design domain basically states under which conditions an autonomously driving vehicle can drive. So for level four autonomous driving WAYMO has the OD in Phoenix, right? They have certain streets in certain areas under certain lightning and weather conditions under which the vehicle can drive completely autonomously even without a safety driver. Then, they continuously expand this OD or this application scope and you can do the same thing for any AI based application.

Think about a breast cancer screening app, so an app that takes in an MRI image and then you want to reliably know if it's detecting breast cancer at an accurate rate or not. And what you want then is to have a distribution in age of the participants. You want to potentially exclude male breasts. Men can have breast cancer too, but the chance are just rare so that you forget about it. You might want to include certain piercings. You might want to include silicon. And then you have an application scope of the requirements of what the input should look like and what your expected performance should be.

Then you should go out and try to collect as much test data as possible and see if the AI performs well. Also, you don't have to basically release the product for all situations but then for only for situations where it works well.

I guess this is the biggest thing that can and should can be learned.

If you think about especially chatbots it seems like the input space is just so large. Think about 100 characters and do the combinatory kind of exercise with 32 characters or let's say a thousand tokens: The number of inputs that you can get is just so arbitrarily large right. How can we deal with this? I mean it's a similar problem with autonomous vehicles. The number of situations for autonomous vehicles can also be arbitrarily large.

It's a mix of implementing guardrails - which is pretty common and almost everyone does - but also about creating specific test cases where you want your chatbot to really get it right. So you can think about how you can also exclude certain areas so that you can clearly say: “Okay you are only supposed to work in this area.” Also, you can exclude any sort of comments on wars or on racism or anything else. You can explicitly exclude different languages just to reduce the variability that you have.

You can also include very common scenarios where you want the AI to get it right and that’s how you create a targeted test set. And then it also makes sense to integrate in the test set some use cases and some requests where the AI shouldn't answer or where you want the AI to specifically detect: “Okay, I'm outside of my OD, thus I apply a guardrail now or I simply say that I'm outside of my application scope and I shouldn't answer that.”. It's more complicated to administer that to an LLM than to a computer vision based systems for example. There's still a lot more thinking that needs to be put into this but I think that we can go in that direction.

A last question regarding evaluations: Many independent labs like METR, Apollo Research, ARC, CAIS etc. are building all kind of different evaluations. Do you think that can help with reliability and evaluations be mostly standardized?

I’m split. I think it’s good to have an external body to evaluate models for generic risks. I think for each use case you will very likely have your own risks and then yes it's good if someone has looked into its particular details before: What's the toxicity? What's the bias? What are some general risks that a certain application has? What are the security risks? As a consequence, you might have to do less work in mitigating those risks but I think you still need to do it for each application. You need to do your own risk assessment and see what else you have to do on top of that. So the one big learning I have from doing this job for about two years is that there are not as much synergies between use cases and industries as you would think. There are always different particular risks because it’s just very different for each use case.

I wanted to ask if you have anything on the top of your tongue that you'd like to share regarding safety and reliability.

I think it can be a strength. So any product that you develop, you want the product to basically be reliable and on a repeated basis do what it's supposed to do. I would ask everyone to focus on a test set. I think if you have a targeted test set created for your evaluation benchmark, I think that can be really an asset. I'm also thinking about how you have test driven development where you write tests first and then you do coding. If you can do something similar for AI based applications where you kind of create a benchmark or at least a mini evaluation test set first, that is great. The you can get the model to do what it's supposed to do. So long story short, I think there's still a lot of thinking to be done.

I think there's still a lot of things that are not completely figured out yet, but I think the industry is also narrowing in on this risk assessments and then I would bet that the application scope topic is also going to take a hold.

hyper-exponential.com

Discussion about this post