blog.post
RAG Knowledge Base: From Document Ingestion to Traceable Answers
Tech choices, RAG pipeline, and test results from the Cloud Knowledge Base graduation project.
My undergraduate capstone Cloud Knowledge Base is an end-to-end RAG Q&A system: users upload documents, the system parses and chunks them, embeds vectors into a store, and answers questions with cited sources.
Architecture overview
- Document parsing — PDF / Word / Markdown unified parsing and semantic chunking
- Embedding — Tongyi embeddings written to Milvus
- Retrieval-augmented generation — Top-K similar chunks + prompt assembly
- Traceable answers — Responses include referenced source passages
When deleting a document, also remove its vectors in Milvus to avoid "ghost retrieval" — an easy detail to miss in multi-user setups.
Backend highlights
- Spring Boot 3 + JWT for multi-user isolation
- Delete document → delete vectors, keeping the index consistent
- 28 functional and security tests all passed
Frontend highlights
- Vue 3 for document management, Q&A history, and resource library modules
- Bookshelf / notes / media extensions decoupled from the core RAG pipeline
Lessons learned
Chunk granularity
Too large: poor retrieval. Too small: fragmented context. We used a hybrid strategy: paragraph boundaries plus a max token limit.
Citation UI
Users need to know where answers come from. Showing cited snippets alongside the answer noticeably improved trust.
Email without domain verification is fine for testing only — configure SPF / DKIM in production.
Further reading
- Project detail: Cloud Knowledge Base
- More notes: Blog list