Department of CSE (Data Science), ACE Engineering College, Telangana, India.
World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 089-097
Article DOI: 10.30574/wjaets.2025.15.2.0513
Received on 18 March 2025; revised on 29 April 2025; accepted on 01 May 2025
In today’s digital age, we are surrounded by a massive amount of information in different formats—documents, images, and videos. However, making sense of all this data in a meaningful way is still a challenge. This project proposes a smart, unified chatbot system that can understand and interact with content from multiple sources using a multi-modal Retrieval-Augmented Generation (RAG) approach powered by Google’s Gemini-1.5 model. The chatbot allows users to upload PDFs, Word documents, CSV files, images containing text, and even YouTube links. It then extracts key information using techniques like OCR and video transcription, and allows users to ask questions directly about the content. What makes this system powerful is its ability to merge different types of inputs and generate accurate, context-aware answers. The entire interface is built using Streamlit, offering an easy and interactive user experience with features like real-time previews, downloadable notes, chat history, and multilingual support.The project reflects the growing need for AI systems that are intelligent, flexible, and capable of understanding information the way humans do—from all angles and in all forms.
Multi-modal Retrieval-Augmented Generation; Gemini-1.5 Language Model; Document and Image Processing; YouTube Transcript Summarization
Preview Article PDF
P Chiranjeevi, Nagalaxmi Kalluri, Sai Saket Gurubhagavatula, Abhishek Kuncham and Mohammed Sami. Unified AI Multi-modal Chatbot. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 089-097. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0513.