How to Build a Chat App With LLM in Python: Complete Guide April 2026
Learn how to build a chat app with LLM in Python with streaming, RAG, and production interfaces. Complete implementation guide for April 2026.
Tom Gotsman
Your chatbot with an LLM in Python works great in the terminal, but now you need it to remember conversations, stream responses as they generate, and run in a browser. The jump from script to production app involves managing state, connecting to vector databases for custom knowledge, and building an interface that updates in real time. This guide shows you how to build all of that while keeping your code in Python, so you're not switching between languages just to add a text box.
TLDR:
- You can build a Python chatbot by connecting to OpenAI's API, managing message history as a list, and looping through user input and AI responses.
- Streaming responses display tokens as they generate using
stream=True, making your chatbot feel faster than waiting for complete answers. - RAG extends chatbot knowledge beyond training data by retrieving relevant document chunks from vector databases and injecting them as prompt context.
- Reflex lets you build production web interfaces entirely in Python with real-time streaming and state management, eliminating the need for JavaScript.
An AI chatbot processes user input through a text prompt, sends that prompt to an LLM via an API call, and returns generated responses. The conversation happens through a request-response cycle where your Python code acts as the intermediary between your user interface and the AI service.
You need three components:
- an LLM provider like OpenAI, Anthropic, or Groq that hosts the AI model;
- Python code that formats messages, makes API requests, and processes responses; and
- a way to store conversation history so the model remembers context across multiple exchanges.
The LLM is a neural network trained on text data that predicts the next most likely words given an input prompt. Your Python application manages this history, appending each user message and AI response to maintain conversational flow.
You need Python 3.10 or higher installed on your system. Check your version by running python --version in your terminal. If you need to install or upgrade, download the latest version from python.org.
Create a virtual environment to isolate your project dependencies:
Install the OpenAI Python client:
Store your API key securely using environment variables. Create a .env file in your project root:
Install python-dotenv to load environment variables:
Add .env to your .gitignore file to prevent accidentally committing credentials to version control.
Load your API key from the environment and create an OpenAI client instance:
The chat completion endpoint accepts a list of message dictionaries with role and content fields. Roles include system for instructions, user for input, and assistant for AI responses:
Extract the generated text from response.output[0].content[0].text. The response includes metadata like token counts and model version.
A conversational loop collects user input, appends it to your message history, sends the full history to the LLM, and displays the response. Here's a working implementation:
The messages list grows with each exchange, giving the model full conversation context. Each API call includes every previous message in the input parameter, so the chatbot remembers earlier discussion points.
Conversations with AI chat bots can get expensive quickly so it's necessary to manage the context window and memory to make sure your token costs don't spiral out of control. Keep in mind three basic ideas as you build your memory strategy:
- LLMs process each request without inherent memory of past exchanges. You must include the full conversation history in every API call to maintain context. The input array acts as this memory, growing with each turn as user prompts and assistant replies accumulate.
- As conversations extend, token usage rises. A typical exchange might reach hundreds or thousands of tokens once you account for system prompts, user messages, and responses. Manage this by limiting history to recent messages or tracking token counts to stay within model limits.
- For production chatbots, save conversation state per session. Store message arrays in databases or caching layers keyed by user ID. When users return, retrieve their history to resume where they left off. Clear old sessions to prevent unrelated topics from contaminating new conversations.
RAG extends your chatbot's knowledge beyond its training data by retrieving relevant information from your documents before generating responses. When a user asks a question, your system converts the query into a vector embedding, searches your vector database for the closest matching chunks, and inserts the top results into the prompt as context. The LLM then generates answers grounded in your specific documents. Chatbots with RAG can answer domain-specific questions that generic models cannot, making them valuable for customer support and documentation search.
Vector databases store document embeddings as numerical arrays and retrieve the most similar vectors when you query them. ChromaDB and Weaviate are commonly used vector databases. ChromaDB works well for local development without external services.
To make use of these kinds of databases for memory management and knowledge retrieval, you should aplit documents into chunks, generate embeddings with OpenAI's API, then store them. When users ask questions, convert queries to embeddings and search for nearest vectors. Return matching chunks and inject them into your prompt context. The LLM answers using your documents instead of just training data.
The table below provides a quick overview of the different kinds of vector databases, when they should be used, and how they can be integrated into your chatbot using Python.
| Vector Database | Deployment Type | Best Use Case | Python Integration |
|---|---|---|---|
| ChromaDB | Embedded local database that runs in-process with your application | Local development and prototyping without external infrastructure dependencies | Install with pip install chromadb and use simple Python API for storing and querying embeddings |
| Weaviate | Self-hosted or cloud-managed with GraphQL and REST APIs | Applications needing hybrid search combining vector similarity with traditional keyword search | Native Python client supporting both vector and scalar filtering with schema-based data organization |
The system message in your messages array controls your chatbot's behavior, tone, and output format. You can change the system prompt to define personality:
You should test different system prompts to find what works. A customer support bot might use "You are a professional support agent. Be empathetic and solution-focused." A code reviewer could get "You review Python code for bugs and suggest improvements. Be direct and technical." Adjust personality without changing any other code. The same API integration produces completely different chatbot behavior based solely on your system prompt.
Streaming displays tokens as the model generates them instead of waiting for complete responses. Set stream=True in your API call:
Streaming displays tokens as the model generates them instead of waiting for complete responses. Set stream=True in your API call:
Each event contains a fragment of the response. Print them immediately without line breaks to show text appearing word-by-word. The flush=True parameter forces Python to display output instantly instead of buffering it. Users see responses start within milliseconds, making your chatbot feel faster even though total generation time remains constant.
API calls fail. Network timeouts, rate limits, and service outages happen regularly when making requests to LLM providers. Wrap your API calls in try-except blocks to catch exceptions:
Rate limits restrict how many requests you can make per minute. When you hit limits, wait before retrying. The tenacity library automates retry logic with exponential backoff, gradually increasing wait times between attempts.
Reflex lets you build the entire chatbot interface in Python without writing JavaScript. Install Reflex and create a new project:
Define your chatbot state and event handlers in a single Python file:
Build the frontend using Reflex components in the same Python file:
The yield statement in the backend state updates the UI as tokens arrive, showing streaming responses in real time. rx.foreachmessages[1:]The rx.foreach loop displays each message dynamically, and messages[1:] skips the system prompt so users only see their conversation. Your Python code manages conversation state, API calls, and frontend display without switching languages.
The foundation of any AI chatbot is surprisingly simple: collect input, maintain message history, call the API, and display responses. When you build a chatbot with Python, you can handle everything from API calls to the web interface without leaving the language. Get the basic loop working first, then expand with features like RAG and streaming that make your chatbot feel professional.
You can create a basic conversational chatbot in under 30 minutes once your development environment is set up. Setting up Python 3.10+, installing the OpenAI client, and configuring API credentials takes about 10-15 minutes, while implementing the core message loop and streaming responses adds another 15-20 minutes.
Reflex lets you build the complete chatbot interface in pure Python without writing JavaScript, while basic Python scripts only handle backend logic and require separate frontend development. With Reflex, your state management, API calls, and UI displays all live in one Python file, and streaming responses update the interface in real-time using yield statements.
Implement RAG when your chatbot needs to answer questions about specific documents, proprietary data, or information outside the LLM's training cutoff date. If your chatbot handles general conversation or doesn't require domain-specific knowledge beyond what models like GPT-5 already know, the base LLM works fine without the added complexity of vector databases.
Yes, the conversational loop pattern works with any LLM provider that offers a chat completion API, including Anthropic, Groq, Cohere, and Hugging Face. You'll need to swap the client initialization and adjust the API call syntax to match each provider's SDK, but the core message management and streaming logic remains identical. For example, if you want to build an Anthropic dashboard that visualizes chat interactions, you can apply the same Python patterns with Anthropic's client library.
Store the complete message history in a list that grows with each exchange, sending the full conversation array with every API call so the model sees all prior context. For very long conversations that exceed token limits, either truncate older messages while keeping recent exchanges or implement a summarization step that condenses early conversation portions into a single context message.
More Posts

Compare the best business intelligence BI tools and business analytics software in 2026. See which options handle KPI dashboards, CRUD operations, and full application development.
Tom Gotsman
Top 7 Flask alternatives for April 2026. Compare web framework Python options to build apps using Python for interactive websites without JavaScript. Full comparison guide.
Tom Gotsman
Compare the best web frameworks for AI generated code in April 2026. Reflex, Flask, Django, and FastAPI ranked by token usage, readability, and maintainability.
Tom Gotsman