In my journey to understand genAI better, spent a good part of last 5-6 weeks building a RAG AI chatbot platform.
And what an amazing 1.5 months its been !
Quite a few false starts esp when I built it end-to-end using the OpenAI vectorStore + Assistants + FileSearch + Threads. The lag was so crazy (almost 18-20 seconds) that it felt like it was dead-on-arrival !
And then began the mode to unpack each step of a RAG setup and querying and build it one by one.
And yes needed to embrace Python. PHP just couldnt do the heavy lifting.
So my current set up is as under:
- LangChain + PostgreSQL + Python + PHP + JS
- LAMP server for user onboarding, chatbot front-end management
- Self hosted PostgreSQL db with pgvector. Experimented with Chroma etc, found this to be easiest to work with 🙂
- Python server running 4 different flaskapps with their own end points (one each for 1. pulling URLs from a site, 2. scraping a URL into .md, 3. loading and embedding a file into the db and 4. responding to user queries)
- Using multiple models as under
- Embeddings – sentence-transformers/all-MiniLM-L6-v2. Locally hosted
- Re-ranker – cross-encoder/ms-marco-MiniLM-L-6-v2
- Summarization – using OpenAI 4o-mini via APIs
- Easy to embed at any website by just copy/pasting 2-3 lines of Javascript code
Other observations
Each step of the RAG setup is an opportunity to experiment/learn and improve. Have sincere and deep respect for the data-scientists/AI-engineers who understand these nuances.
At my end, I tapped into chatGPT, Perplexity and Claude (free) for helping me write code and also make me understand each step sans the jargons. And I cannot but wonder whats possible with Cursor + Claude – the setup that everyone seems to be raving about !
- Current query-steps/ pipeline is
- Query Understanding
- Query embedding
- Hybrid Retrieval
- Reranking
- Summarization
- Pipeline Details
- Skipped the query expansion step for now. My guess is building my own bi-directional dictionary may be more powerful .e.g. the en_core_web_sm model failed to consider risk-covers as proxy for insurance
- Doing a hybrid retrieval with “adjustable” weights for each client/knowledge base. Currently at 50% for each. Will have to really test out
- Adjusted the retrieval to now return only return results above a threshold score(thanks Naresh). Currently at 0.4
- The overall lag is still at 10 seconds. So lots of optimization needed. And this is with just 450 records in db and indexes in place.
- “Request received: 0.0003s”,
- “Query expansion: 0.0000s”,
- “Query understanding: 0.0104s”,
- “Query embedding: 0.0217s”,
- “Hybrid retrieval: 4.3745s”,
- “Reranking: 1.7954s”,
- “Prompt generation: 0.0000s”,
- “Summarization: 3.9765s”,
- “Total processing time: 10.1945s”
- Data-ingestion is the key challenge esp with complicated tables. And perhaps the most critical step also. Experimented with multiple options for tables with merged cells in a pdf with low luck. I now do understand the need to invest in building very accurate data-ingestion libraries/algos.
- Accuracy. Varies by the complexity of the query. E.g. in the demo which is trained on HDFC Diners Black Credit Card – it answers most queries pertaining to fees, benefits etc. But it does miss out on a few edge cases.
Next steps:
Have a list of experiments/optimizations to run including
- How does tinkering with postgresql configurations impact time for retrieval
- Whats the best way to cache for similar queries
- How to improve the search step – test out other methods/algos
- Which model is a good candidate for a locally hosted model for summarization step (currently spending almost 2500 tokens on each RAG query)
- How to build domain specific keywords for query understanding/expansion. Or are there models which already do this. (refer to insurance example)
- Is there a small locally hosted model which can classify user queries against – RAG, non-RAG-simple-chat, non-RAG-do-specific-action?
Leave a Reply