RAG vs. Lucene: Architecting AI Knowledge Bases for On-Premises Customer Support Systems

In this post, I will walk you through the architectural selection and thinking process behind the "Knowledge Base"—the core feature of ShenDesk's AI Intelligent Customer Service—and how I brought it to life.

Introduction

In the current wave of AI, no customer service software can afford to ignore it. Adding AI capabilities is no longer an option; it is a necessity. Generally, there are three ways to integrate AI into a customer support system:

Fully Managed AI Cloud Services: Upload the user's knowledge base to an AI platform and call their APIs to handle visitor inquiries in the chat window.
Self-Hosted AI Orchestration Platforms (e.g., Dify): Set up a dedicated platform like Dify and connect it to the support system via APIs.

Dify is an open-source LLM application development platform designed to simplify the creation and deployment of generative AI apps, providing a user-friendly interface for building production-grade AI applications.
Fully Built-in AI Capabilities: Implement native vectorization and Retrieval-Augmented Generation (RAG) within the system, alongside direct support for calling open-source models.

Trade-offs and Challenges

Option 1 (Cloud APIs) is the most lightweight and easiest to implement. However, it directly conflicts with the goal of 100% private deployment and 100% data sovereignty (keeping data local). For data-sensitive sectors like government, finance, and insurance, this is a deal-breaker.

Option 2 (Self-hosting Dify) is far too "heavy" for many small teams. Most small businesses lack dedicated IT specialists. Deploying and maintaining such a complex stack is a high barrier to entry, not to mention the requirement for GPU-equipped servers for model inference.

Option 3 (Fully Native AI), while powerful, carries a high development cost and significantly complicates deployment for smaller teams. It typically requires additional database components to support vector searches and, again, necessitates GPU servers for local inference. The barrier to entry remains prohibitively high.

My Situation and Goals

Let’s go back to my core mission: developing a secure, stable, reliable, lightweight, and self-hostable customer service system.

I specifically highlighted "lightweight" because, over years of development, I’ve worked with countless small teams—some with only a handful of people, or even solo founders. Often, they just want to add a simple chat feature to their existing website or app to communicate with potential leads and close deals. When they see complex system requirements and technical jargon, they’re immediately discouraged. What they need is simplicity—pure and simple.

Furthermore, these small teams often operate under tight server and bandwidth budgets. I’ve seen many users deploy their support system directly on their existing web server or on a budget-tier cloud instance (like a 2vCPU / 4GB RAM machine bought during a sale). This represents the vast majority of my user base. To be clear, their limited budget doesn't mean they have a high tolerance for instability or security flaws.

This essentially rules out Option 2 and Option 3. Unless I’m willing to abandon the majority of my users—which I’m not—those paths are non-starters.

That leaves Option 1: using managed AI cloud services.

However, an AI chatbot is more than just a chat window connected to a model. The real goal is to have the AI communicate based on the user's own knowledge base (e.g., answering specific questions like "How do I place an order?" or "What is your delivery window?").

The "knowledge base" is the bridge. If I rely entirely on a managed platform, users are forced to upload all their documents to a public cloud. For many, this is a deal-breaker. In an era where data security is a top priority—especially in government, finance, and insurance—moving sensitive data to the cloud is an immediate "no-go."

This brings us to a critical requirement: The knowledge base must reside 100% within the user's local database.

The most viable compromise is this: When a visitor asks a question, the system first searches the local knowledge base, constructs a prompt with the retrieved information, and then calls an AI model via API. The difference here is that I’m calling a raw model (like Gemini/GPT) rather than hosting the entire knowledge base on a third-party platform.

Managing the Local Knowledge Base

The core of this strategy is straightforward: How do we build and manage a local knowledge base?

While a vector database is technically the superior choice, I have a non-negotiable constraint—the on-premises deployment must remain lightweight and cannot increase the user's operational burden.

Consequently, my initial architectural concept became: Local Database + Full-Text Search + Top-N Retrieval + Prompt Engineering.

There are several mature solutions for full-text indexing:

Elasticsearch: The industry de facto standard. Its distributed architecture natively supports massive clusters, sharding, and replication.
OpenSearch: The open-source fork of Elasticsearch created by AWS following Elastic's licensing changes.
Solr: A veteran choice with strengths in precise word segmentation and traditional text retrieval, though its distributed scalability lags behind ES. It’s rarely a first choice for new projects today.
Others: Meilisearch (written in Rust), Typesense (written in C++), etc.

When I worked on large-scale corporate projects, we would default to Elasticsearch. Back then, we had big clients, multi-million dollar contracts, expansive server environments, and mature DevOps teams.

Now, I have none of those luxuries. The small teams I serve simply cannot afford to deploy a "heavy weapon" like Elasticsearch for their on-premises setup.

How do I resolve this dilemma? Looking at my current user base, their AI knowledge bases have a distinct characteristic: they are relatively small. If I told them my solution could query tens of thousands of documents across dozens of gigabytes in milliseconds, they would feel like I’m solving a problem they don't have.

Most small teams have a knowledge base consisting of only dozens of documents; reaching the hundreds is already rare. Let’s summarize the current constraints and requirements:

100% Data Sovereignty: Use a Local DB + Full-Text Search + Top-N Retrieval + Prompt Construction workflow, sending the final prompt to AI models like Gemini/GPT.
Extreme Portability: It must be lightweight with zero additional deployment overhead. Being able to index and retrieve a few hundred documents is more than enough.

Ultimately, only one mature choice remained: Lucene.

Lucene requires no standalone service installation. It is zero-dependency and can be integrated directly into the main server application. From the user's perspective during deployment, its presence is completely invisible.

In reality, Lucene is incredibly powerful. For collections ranging from 1 million to 10 million documents, query latency typically stays between 10–50ms. Using it to manage dozens or hundreds of documents for small-scale users is, quite frankly, effortless.

RAG vs. Lucene: Architecting AI Knowledge Bases for On-Premises Customer Support Systems

Introduction

Trade-offs and Challenges

My Situation and Goals

Managing the Local Knowledge Base

Behind the Scenes

Building a Real-Time Customer Support System - Prologue Architecture, Constraints, and Engineering Decisions

Join the Pioneers
Redefine Service Excellence

Let's Connect

Contact Support

Introduction

Trade-offs and Challenges

My Situation and Goals

Managing the Local Knowledge Base

Behind the Scenes

Building a Real-Time Customer Support System - Prologue Architecture, Constraints, and Engineering Decisions

Join the Pioneers Redefine Service Excellence

Let's Connect

Contact Support

Join the Pioneers
Redefine Service Excellence