AI search for high-stakes environments: Inside the build

If you run secure messaging at scale, you already know the problem: the information was shared, the thread exists, and nobody can find it when it counts.

In high-stakes environments, scattered knowledge is a mission risk.

In Rocket.Chat v8.0, we built Intelligent Search to close that gap. We stress-tested it against 1.2 million real, messy, conversational messages, the kind that actually move through operational environments, and published every result including the uncomfortable ones. See what the data actually showed.

About Intelligent Search

Most search tools make one assumption: that you remember exactly what you are looking for. In practice, nobody does. You remember the topic. You remember it was somewhere in a thread from three weeks ago. You remember it involved the logistics team. You do not remember the exact words, and that is where most search tools leave you stranded.

Intelligent Search works the way people recall information. "Authentication service is down" and "users can't log in" return the same results. You describe what you are looking for in plain language and the system finds it, regardless of how it was originally worded.

Every result comes directly from your organization's own message history. What was said is what you get. In environments where a fabricated answer is worse than no answer at all, that matters more than any accuracy benchmark.

We sat down with Devanshu Sharma, Lead Research Engineer, to get into what the data actually showed and what it took to build something that performs on a bad day, not just a benchmark day.

On the surface it looks like a few lost minutes. But the actual effect is much broader.

The first cost is time, and not just one person's time. It is the entire team grinding to a halt while everyone searches for something that was already shared.

The second cost is duplication. If people cannot find what was documented, they rebuild it from scratch, burning resources and pushing decisions further down the line.

But the third cost is the one that matters most: it erodes the confidence of decision makers.

The problem isn't that organizations fail to generate information. It's that critical intel is distributed across channels, and the people who need the full picture are always working with a partial one.

When someone is making a call without the full context, working off stale information or a partial picture, their confidence in their own decision drops. They second-guess. They delay. Or they proceed without knowing what they don't know. In the environments we are building for, that is not an efficiency problem. That is a mission risk.

It's almost always the dataset.

Most demos are built around refined, curated data designed to showcase the system at its best.

In the LLM space right now, the company with the best benchmark trends in the news, so everyone optimizes for the benchmark. The assumption becomes: if a model performs well on a standard dataset, it performs well for everyone.

But real operational environments do not look like benchmark datasets. The data is noisy. It's fragmented across multiple channels. It changes constantly. It is not a static file sitting on a server. It is a live feed of human communication, and it is messy in ways that controlled datasets simply do not capture.

The gap between benchmark performance and field performance is real, and it is wider than most people expect until they are actually in deployment.

There is no single silver-bullet question. But the most powerful thing a procurement team can ask is: Can I run your system against my own data?

Not a demo. Not a reference customer story. Just: let me test it in my environment, against the kind of messages my people actually send.

Every deployment context is different, and the only way to know how a system will perform in yours is to test it there. Alongside that, ask for transparency on methodology and documented results you can verify independently.

We did not just make that argument. We tested it.

Because that's exactly how people search.

When I'm trying to find something I sent or received weeks ago, I don't remember the exact room, the exact channel, or the exact wording. I remember the topic. I remember roughly who it involved. Maybe I remember a timeframe.

That's how operators search too, they're not going to sit there reconstructing the precise keyword. And that's where keyword search completely fails you.

The dataset we used was deliberately messy: short messages, slang, domain-specific terminology, multi-domain conversations, the kind of noise you'd find in any real workspace. It mirrors what's actually happening in the field. We chose it because it reflects reality, not because it would produce favorable numbers.

Because if we optimize the numbers just to show a pretty benchmark, we are setting a trap for ourselves.

When customers deploy in their own environment and test against their own data, and they will, they are going to get the real results. We wanted to find the worst case before anyone else did.

At 380,000 conversational records, the system delivered a Mean Reciprocal Rank of 0.72, meaning the most relevant result was consistently surfacing near the top. At 1.2 million documents, that shifted to 0.56. The system was still retrieving the right content. The challenge at that scale was ranking the best match first, with more content competing for the top result. That is expected behavior, and it gives us a reproducible baseline to improve against.

The dataset is public. The methodology is documented. Anyone can run it themselves. In the research community, reproducibility is highly valued.

You publish results and make them verifiable because open reproduction is how the field moves forward. For organizations that go through rigorous accreditation processes, that openness is a stronger signal than any slide deck.

Retrieval accuracy only matters if the system is available when you need it.

Picture a search system with exceptional accuracy that slows to a crawl under load, or goes dark at the one moment you genuinely need it. That is not a retrieval problem. That is a trust problem. And once you break a user's confidence in the system, you probably do not get it back.

Under sustained load at million-document scale, our API response times stayed in low single-digit milliseconds at the 95th percentile. Error rates remained near baseline. Worker failures were essentially zero. The system absorbed traffic spikes through backpressure and batching without degradation.

For teams running communication infrastructure that supports operational decisions, that stability story matters more than any retrieval benchmark. An AI feature that introduces latency or instability into that stack is not a feature. It is a liability.

Three things. The third is the one that keeps people up at night.

First, irreproducibility. When a user searches for the same thing twice and gets completely different results, trust in the system collapses fast and it does not come back.

Second, opacity about sources. If the system cannot show you where the answer came from, you cannot verify it. In high-stakes environments, an unverifiable answer is not just unhelpful. It is unusable.

But the third is access control. These organizations operate with strict information hierarchies, and the last thing anyone needs is a search system that surfaces sensitive data to someone who should not see it. That is not a retrieval quality problem. That is an institutional catastrophe.

In an early version of the pipeline, we were using a library for embedding inference that had a hidden dependency on endpoints behind Cloudflare. We didn't catch it because we weren't testing in a fully air-gapped environment. Then Cloudflare went down, and suddenly our embedding model was broken.

If that had happened in a real deployment for a defense customer, it would have been a serious incident. We stripped out the entire library and rebuilt it with one that had zero cloud infrastructure dependencies.

It slowed us down. It was not an easy path. But for air-gapped deployments, which is the operating reality for most of the organizations we are building for, there is no other path. The system has to work completely offline, with no external dependencies, under any conditions.

That decision, and others like it, reflect a particular kind of engineering discipline. Not optimizing for the demo. Optimizing for the day when the system has to work and nobody has time to troubleshoot.

That discipline shows up everywhere in this build. And it is what shapes where the work goes from here.

The next iteration brings hybrid search, combining semantic and keyword retrieval, alongside LLM-as-judge evaluation to better measure what the numbers are actually missing. The baseline is set. The work continues.

Devanshu Sharma is a Lead Research Engineer at Rocket.Chat, where he builds and evaluates production AI systems for large-scale communication platforms. His work focuses on intelligent search, retrieval-augmented systems, and LLM safety, with particular attention to the tradeoffs between retrieval quality, speed, and operational reliability.

This conversation is part of the Rocket.Chat Labs series on Intelligent Search. The first piece covers the full technical breakdown. This one goes deeper, straight from the engineer who built it.

Frequently asked questions about <anything>

Pavithra is a Product Marketing Manager at Rocket.Chat. She represents the voice of the customers and helps shape the voice of the product. She is highly passionate about bringing new offerings to the market. When she isn’t donning the hat of a Product Marketer, she tries her hand at multiple cuisines, lives a hundred different fictional characters through books, and enjoys playing badminton.

Pavithra Sudhakar

Team collaboration: 5 reasons to improve it and 6 ways to master it