Why do specialized AI models win in CX? Artwork

Deep Learning with PolyAI

PolyAI's CEO/co-founder Nikola Mrkšić and team invite guests to candidly discuss trends and tech in AI, voice throughout the enterprise, and nailing the customer experience.

All Episodes

Deep Learning with PolyAI

Why do specialized AI models win in CX?

April 16, 2026 • Team PolyAI

0:00 | 23:42

Send us Fan Mail

Many AI models today are designed as generalists. But in customer experience, that approach often falls short.

In this episode of Deep Learning with PolyAI, Nikola Mrkšić sits down with Matt Henderson, VP of Research at PolyAI, to unpack why specialized AI models outperform general-purpose ones in real-world CX environments.

They explore the tradeoffs between speed, accuracy, and reasoning, why voice AI requires fundamentally different design decisions, and what makes PolyAI’s Raven 3.5 a step forward. The conversation offers a behind-the-scenes look at what it actually takes to build AI that works in production, not just in demos.

Follow PolyAI on LinkedIn
Watch this and other episodes of the Deep Learning pod on YouTube

SPEAKER_01 0:00

As a race, we are adapting towards how to speak to AIs, but we're trying to meet people where they are right now. That's the great thing about working here for me is that we have so much data that we can train on. And it's all anonymized and everything, but it ha how people talk over the phone. And the model knows that this is uh speech and uh it might be interrupted or it might need to recover from bad interruptions and stuff like that. We just train that in. The general models are not so specializing towards speech yet.

SPEAKER_00 0:42

Hi everyone, and welcome to another episode of Deep Learning with Poly AI. Today with me is our VP of Research, Matt Henderson. Um, and uh we're here to talk about our uh new model, our new LLM, Raven 3.5. Matt, welcome to the podcast. Before we start, everyone, please like, share, subscribe. Um, and uh Matt, over to you.

SPEAKER_01 1:02

Yeah, hey, thanks for having me on again. I enjoyed uh uh my first appearance. This is my second time on. Yeah, we're talking about things 3.5 today, right?

SPEAKER_00 1:11

Yeah, absolutely. I mean, like, you know, I think we are very proud of you know our continued uh investment in RD and just like how much, you know, we are a research-led company. Um, I think your PhD was the first of all uh of the other people of the company still. Uh and I think that you know, we've had a thread from well, 2012 to today of just working on this stuff, right? Um and um I think it'd be great to just tell the world like why is this special? Why is it different? Why is this not easy? What's special about it? I think there's more confusion than ever. And I think a lot of people have just decided that they're gonna either pick up the model from like OpenAI or Anthropic or you know, Mistral or Quen. Um, why are we doing this?

SPEAKER_01 1:57

Yeah, so I guess this is a question, why train your own model, right? Like why go, why specialize and as you mentioned, competing models would be the generalist, the big um models on the cloud. Um, of course, a big reason for us is speed. We target and achieve a sort of sub 300 milliseconds uh response uh time. And uh that just puts us like on the edge of optimizing. Um and I think at least for now, specialized models are always going to be a generalist, especially when you're optimizing against so many things. Yeah. So we're optimizing for latency cost. Those are easy, right? But you also have to optimize for uh instruction following, then it becomes hard.

SPEAKER_00 2:43

And we can really maybe double-click on that for like the, you know, I think we've got like two clusters of audience, one being technical and they'll know what you mean, and the others who um might not know what that means. I think there's like the whole like, how do I make a system not hallucinate, right? Uh piece, and you know, like what what are we doing to that end, right?

SPEAKER_01 3:03

Yeah, well, we train Raven to do things that are quite specific and specialized for the use case of you know customer support over the phone or over web chat. One of those things is being grounded in the prompt or being grounded in the documents that it has access to. So it's specifically trained to respond to you and then to cite its the source of information for uh its uh what it said, and optionally say, well, actually this was an out-of-domain request. So you get that like built-in LLM powered uh flagging of what topics are being cited, what pieces of content and documents are being cited, and what are people asking that the system just can't answer yet? Super useful. Not just for keeping the model from hallucinating, that helps as a sort of additional training signal, but also for your logging and just tracking all that stuff. Um you wouldn't get that in a general model without sort of extra harnessing and framework around like prompting it. You just got something that's built into the model, you don't need to prompt for it specially, and you sort of you get it for free.

SPEAKER_00 4:17

Yeah. And I mean, I think when you look at like, you know, people increasingly are playing with these models and with a genetic systems. And you know, it does feel like you know, in the matrix, when you see the machine just building a more and more complicated thing, right? It if you're using a general model and prompting your way around both stylistic things and then like the right tool calling, it ends up being quite, well, frankly, impossible to read what you've done or to know what you've done. And then it's kind of like just another layer of paint, just kind of like saying, hey, please don't do this, or in this case do this, and in this case do that, and it starts getting contradictory. So think like a model that just inherently doesn't force you to do that, kind of like layers of paint gets you to a point where you build a system that's much better and less complex. And less complex is always good because it means that it's more interpretable, right?

SPEAKER_01 5:02

Yeah, so we basically put all of these kind of constraints that we know ahead of time into the model weights rather than into this into a dynamic prompt. So you know, another example would be sort of consistency with what language it speaks. So we want Raven to be instructable in whatever language you're comfortable with. Yeah, that's typically English. And then with a flick of a flip of a switch, then it it speaks in Spanish or Portuguese or Japanese or whatever. That's super confusing for the general models. Because you know they you ask them uh the user asks them a question in Spanish and it retrieves a document that's in English, all the all the prompting is in English, and it says, you know, there's a maybe a reminder respond using this document or something, and then it's forgotten it's supposed to speak Spanish. And uh it just replied in English, it's a super bad error to make as one of those things that we can just put into our reward signal and and just be tighter the model.

SPEAKER_00 6:00

Yeah, yeah, no. I mean, I think and I think like when when you look at specialization, you kind of like you are what you eat, right? Where like if you're working on problems that require this level of precision and like multilinguality, like you kind of have to like get way better at it. And tell us a bit more about like the 3.5 aspect. Like what's special relative to to Raven 3? Like auto-reasoning would be one of one of the things, right?

SPEAKER_01 6:21

Yeah, autoreasoning is is the is the coolest new feature, I think. Um, apart from well, maybe we'll dive into that in a bit, but I mean across the board it's just sort of improvements on everything. It's we have uh we start from a better pre-trained model that's that has uh better multilingual capabilities. So we're just a big lift in in non-English languages as well as English. We have these new features like out-of-domain detection. We work in web chat very well now, a few improvements in things like custom style following. So one of the other things we sort of build into Raven is a good output style for conversations that are happening via voice or no, optionally web chat, right? Um, there's a lot of kind of LLM Ease style outputs that the models like GPT will you'll start to recognize and get really frustrated with if you're a caller, you know, been transferred three times and you get through to this. Like, I'd be really happy to drill down and get to the bottom of this for you.

SPEAKER_00 7:25

And then ask a question at the end, right? I mean, like I think I think what what's like one thing that I've been like uh trying to explain to a lot of people is latency is an obvious kind of like property of a system that makes it better, right? If it like spends less time thinking and responds quickly without interrupting you when you've just made a pause, that's obviously a great thing, right? I think like the bargin feature is one that is quite interesting, where you know, obviously it's a great thing to show off, and in some places it's incredible, in others it can be quite disruptive if people accidentally you know interrupt the thing or another person speaking in the background does something. It is essential for a lot of the rappers to have good margin because their systems built on GPT models just won't shut up, and they're kind of building this like UX assumption of like to anyone who's used the voice mode in Chat GPT, it becomes quite obvious that you have to be very authoritative and frankly, in human terms, quite rude when you speak to it and that you just have to cut it off and direct it where you want it to go. And if you do that, it's a bit of a different modality of conversation, but it is very powerful. But we what we see on the phone often is that people don't know how to do that, and there's this like discrepancy between, you know, how the power users of voice mode and chat GPT, which are probably not more than 0.1% of the planet, right? Uh, behave with it versus everyone else who's just kind of like following the like rules of behavior from a human conversation. A lot of it is just like you don't want it to like speak forever and then ask you a question at the end of the whole thing.

SPEAKER_01 8:57

As a race, we are adapting towards how to speak to AIs, but we're trying to meet people where they are right now. And um I guess we kind of get that a bunch because of all our training is on you know the data that we collect. That's the great thing about working here for me, is that we have so much data that we can train on, and it's all um you know anonymized and um uh everything, but it has it how people talk over the phone. And the model knows that this is uh speech and uh it might be interrupted or it might need to recover from bad interruptions and stuff like that. We just train that in. Yeah. General the general models are not so specialized in towards speech yet. I mean, yeah. So GPT real time is like a very good model that we compete with and and benchmark against.

SPEAKER_00 9:48

I mean, like I've heard a lot of customers. In fact, it's generating a lot of demand where people speak to it and they're like, I want that, right? And then it's like, well, you know, that's a certain kind of car, right? It's it might be a hypercar. Like it's equally difficult, you know. You can't really drive it around London because you'll hit a curb wherever you go, or in the case of our model, it will hallucinate things and you know, won't do like tool calling, but it's definitely showing us what's possible in terms of speed and the naturalness of dialogue, right? I mean, I find it very impressive. We don't want to talk about Raven 4 here, but I think so excited about that one, right?

SPEAKER_01 10:23

Yeah, Raven 4, Raven Omni, we're starting to do audio. Yeah, so we're we um don't need speech recognition anymore, and we can sort of hear the user, a native understanding of the speech or the audio, but yeah, I guess that's the next podcast I'll appear on.

SPEAKER_00 10:40

Maybe we maybe stop on that thread. Tell me a bit about you know, just like for the audience, right? I mean, you know, I remember inheriting your code base, and you know, it was the wonderful library of Tiano, and you couldn't even do like, you know, TensorFlow with its optimizers was revolutionary in terms of how easy it was to like run an experiment. Like, why is this still hard when people, you know, live in a world where like a come you know, open claw can set up like an entire operating system and create like development pipelines that you needed a DevOps team for before? My latest nation, but like um, why is this still hard?

SPEAKER_01 11:14

I think yeah, we're try we're training this generative model, right? That that can that you can speak to. And then we can train and and and get a loss and think, okay, this looks fine, even run it on a benchmark and and get some average number. But it really takes us to start talking to the model, getting in front of our um agent deployment teams or in in in front of people to see what are the issues. You know, is it calling tools robustly? Does it have some sort of weird style? When you start talking to a multi-turn conversation, what shows up? Maybe it's repetitive, it says great after every call, uh, after uh you know everything the user says. Um we it's just the kinds of things that you only come across if you've been building conversational systems for like the length of time that that you and I have. We have a list of behaviors we want to put into the model and then we test for. And everyone has every one of those behaviors might have a sort of individual creative solution for like more data, more preferences, a different reward signal, something like that.

SPEAKER_00 12:17

When you talk about those kind of like different metrics that you're optimizing for, right? Like I guess we talked about style, we talked about kind of like instruction following. Then there's the whole like balancing, reasoning, and latency, and how you decide to do that, and then the naturalness of voice versus like the precision of answer and like the length of the answer. How do you like optimize for all that in training? And like post-training setup, like how does it work, right?

SPEAKER_01 12:42

And I'll add to your list as well. So there's also like uh tool calling, and then there's all of those, but in all the different languages and modalities, right? Yeah, and different modalities and what some of web chat. Um, we also want to check that those things like about hallucination and citing your topics and stuff work well. Uh languages is particularly interesting because we would have a whole different style guide for each language. You know, that you might want to have different rules about politeness when it comes to an English conversation versus a Japanese conversation, might be more formal. And in some languages, you can't sort of avoid but gender the person you're talking to to the pronouns you're using. And so we have certain kind of uh rules there. So the by the way, what the the question was how do we uh balance all of those things? We track all of them in all of our evaluations, uh, and I guess making them measurable is the is the first important thing we do on the research team. So we have this internal benchmark, and you'll see that in our uh blog post announcing Raven 2.5 that we we beat um the big public nodals, G5, uh the latest cloud sonnets on those benchmarks. And then each has an individual um uh solution for making it work. And I guess I want to sort of paint the picture of this isn't just launching a you know run DPO script, getting a model at the end and saying there we train 2.5, it has more data, blah, blah, blah. If you were to sort of plot the lineage of the final model checkpoint, right? It's it's the average of a bunch of runs of the reasoning tuning stage, which comes from a base model of an average of all these different one-off DPO and GRPO and combined GR GPO, DR, DPO uh runs. Um it's become this kind of like messy law like art of figuring out how to post-train for all of those things at the same time.

SPEAKER_00 14:39

The more we talk about the data being like a moat, and it absolutely is a moat. Like, you know, one way that Sean, our um uh our CTO described it, and that I thought was really interesting, was like you're basically kind of like just applying generations of different it's like you go to different grades as a model, right? And like how you learn chemistry in grade five could very well inform whether you go for like physics or chemistry in like grade seven, right? And then like your whole learning path takes a different like evolution, where really like what you did at what part of the training and how like you know you got to rewrite like the the rewards and stuff, kind of like it's not really a linear like training phase in the way that people used to think about it, right? But it's really like a multi-stage like building of a system where it's more of a dark art than ever, in a way, right?

SPEAKER_01 15:36

It's pretty sort of surgical and targeted in that like yeah, before you and I are doing our PhDs, we usually you know launch an experiment that trains on this data set and it finishes and that you know it was randomly initialized usually. I think you did work on bringing in pre-trained word embeddings and stuff, which is first steps to you know where we are now. But now we have these massive models, lots of data. We want to sometimes we want to reuse this old model, it's it's really good. It's just not quite in this certain situation, it doesn't use the right tool. So then we we we fix it surgically.

SPEAKER_00 16:10

You know, I think about this a lot when people are like, what are we going to be doing in the future? And you know, if we were explaining what we're doing, even to like a mathematician in like the 30s or 40s, they'd probably think we lost our minds, or that we're doing something very trivial and that it makes no sense, right? Uh I mean they'd understand everything. It's just like, you know, you think of like people building like handcrafted products, you know, like leather goods or furniture or whatever, and you kind of see them like you know, polishing this one thing, shaping this other thing, leaving for a few days. And you know, I can't help but like draw the comparison where it's really like it's really not that different to this, where your intuition is like the meta-level kind of like sequencer, and like it informs like how much even money and resources and time you want to invest in this like whole like sausage making thing.

SPEAKER_01 16:59

And yeah, we're sort of artisanal model developers. Um and we become you know reward function engineers and um you know GRPO loss mixing engineers, but at least we're not prompt engineers, so stuck with a generalist model, and all we have at our disposal is playing around with a with a prompt. We we we can backpropagate and uh adjust uh model to exactly what we want. The auto-reasoning stuff was like surprisingly difficult to train, I guess. And uh when it comes to reward engineering, you know, the model's always going to try to um reward heck. And uh well, so with auto-reasoning, what that means is you're you want to get the benefits of reasoning, which means that the model takes some time before responding, right? So it does some little generation and thinks thinks to itself. But in our case, we don't that would that would that adds latency necess necessarily to the to the call because it's doing its own, it could in theory respond immediately. So we only want to do that in cases where it's gonna help. Um so how do we teach the model when it should reason when it shouldn't? So the naive thing to do is that you kind of just generate with and without reasoning. And if the if the one where it reasons is better, then but that then you train like that.

SPEAKER_00 18:19

How do you factor like how long it reasons for into the balancing?

SPEAKER_01 18:24

Yeah, that's the that's the other tricky. That's one of our GRPU losses, is making the reasoning short, because all of these um well, um the typical sort of base model that you would get doesn't have any kind of constraint like that. And sort of original, if we look back to when reasoning first came in with Deep Seek, so we have some crazy long tra repetitive traces, and you really benchmark performance go way up, but sure, you have to wait like five minutes to for it to actually reply to you. So we do that, we do that in this artisanal way, right? So we warm it up in a in SFT stage with some traces that we think are good. And then when it comes to later stages, we've got specific loss function to say if you can get to the same quality of response with shorter reasoning, that's you know, prefer the shorter reasoning reward. But I think one of the core difficulties here is we're trying to teach the model to know what it doesn't know, right? So the these models are sort of famous for being confident bullshitters sometimes, right? So they're they're not well calibrated inside. So how can we teach it to know when it needs to stop and think? Uh so that was particularly hard. With this like shortening the reasoning thing, those sort of degenerate reward hacking cases. Well, I seem to be doing better and better every time I reason less, so I'm just never going to reason. So there's these all these things we're bouncing. And you know, the trick with when to reason, well, we just warmed it up with some SFT, basically. Some good you should reason here. When it comes to this, we know this is a very difficult, you know, date time request rather than asking the model to figure out that it can be unreliable without reasoning in those cases.

SPEAKER_00 20:06

We just like a s of adding a bit of structure that helps it like make that decision or learn to make that decision rather than letting it just infer from like data.

SPEAKER_01 20:16

That's right. So the beginning of training, it's it has something to latch on to, and then it and then it can kind of like figure out the other things that where reasoning helps. It's not obvious to why should it be obvious to the model that needs to reason to figure out how many R's are in the word Raspberry? Yeah.

SPEAKER_00 20:31

That doesn't sound like Yeah, because it uh the like the the overall like um signal, whether it was good or not, is one thing, but then like predicting the next word versus predicting it faster is quite a and you know how much faster it becomes quite a it really is some kind of like racing, right?

SPEAKER_01 20:50

Yeah, but is it but is it you know this is a nice feature, so it's kind of if you want just switch on auto-reasoning, Raven will think when it uh when it thinks it should help and it wouldn't add uh latency.

SPEAKER_00 21:02

How much better would would like one do in these benchmarks if it reasoned, if time wasn't an issue, and you know, like we weren't optimizing? Like, how much better are reasoning models? You know, when we compare to like you know, Sonic and stuff, if we took Opus, for instance, because I can't detach myself from my like you know uh OpenClaw and what it does with the Opus model, I see a very distinct difference in like coding ability with one versus the other. How well would it do like relative to these models? Or would it not make much of a difference?

SPEAKER_01 21:33

Those are going to do very well in our benchmarks. Uh large reasoning models where latency is not a doesn't matter, then you can you can certainly top all of our our benchmarks. So we want to guard a bit against the feature where they do become uh instantaneous uh and fast. And you know, when when you talk about using your your claw or clawed butt or whatever, your open claw. There's a whole lot of like agent harnessing and stuff that sort of makes it respond very well but adds on latency. Yeah. So yeah, they you know those and I guess our our approach is to benefit from these types of models, our open models where we can run reasoning and do stuff that's inefficient. We used to do that offline and during training and sort of distill it down into these uh smaller models.

SPEAKER_00 22:25

Yeah. Because you could think of it as just like we're getting it ready for a race, it has to run a race. And at the end of the day, like are we able to get most of the benefits in a faster model then? Of you know how much better a bigger model would do?

SPEAKER_01 22:40

Yeah, I think we basically we we can uh get we benefit from all that all that teaching and achieve something which you just uh you you wouldn't be able to do with the latency budget and uh obviously it much, much, much cheaper.

SPEAKER_00 22:56

Okay, cool. Well, um I think like with that, this was really just a teaser for what's gonna come in a few weeks' time with the release of the next model, but I think we've probably already said a bit too much about that. Thank you so much for joining to everyone in to everyone listening. Like, check out the link, check out Raven 3.5. Um, there are very exciting releases coming about the platform and all the models included with it over the coming few weeks. So, you know, subscribe, like, share, and we'll see you in the next one. Matt, thank you for for joining me today.

SPEAKER_01 23:27

Great, thanks for having me on. Great to chat to you, Nicola.

Michelle Schroeder

Host

Nikola Mrkšić

Host