LangGraph Tutorial (Python): optimizing token usage for beginners
This tutorial shows you how to build a small LangGraph workflow in Python that keeps token usage under control by trimming state, summarizing history, and avoiding unnecessary model calls. You need this when your graph starts growing conversation state or tool output and your LLM bills begin rising for no good reason.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-openai - •
langchain-core - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with LangGraph nodes, edges, and state
Install the packages:
pip install langgraph langchain-openai langchain-core
Step-by-Step
- •Start with a minimal graph state that stores only what the model actually needs. The main mistake beginners make is carrying full chat history through every node when a short summary would do.
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage
class State(TypedDict):
messages: Annotated[list, add_messages]
summary: str
token_budget: int
- •Add a compacting node that summarizes older messages once the conversation gets too long. This reduces prompt size before the next LLM call and is the simplest reliable token-saving pattern.
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def summarize_if_needed(state: State):
messages = state["messages"]
if len(messages) <= 4:
return {}
recent = messages[-2:]
older = messages[:-2]
prompt = [
HumanMessage(content=f"Summarize these messages in 3 bullet points:\n{older}"),
]
summary_msg = llm.invoke(prompt)
return {
"summary": summary_msg.content,
"messages": recent,
}
- •Build the answer node so it uses the summary plus only the recent turns. This keeps the prompt small while still preserving enough context for good answers.
def answer(state: State):
summary = state.get("summary", "")
recent_messages = state["messages"][-2:]
system_text = (
"You are a concise assistant.\n"
f"Conversation summary: {summary}\n"
"Use only the recent messages and summary."
)
response = llm.invoke(
[{"role": "system", "content": system_text}, *recent_messages]
)
return {"messages": [response]}
- •Wire the graph so it compacts first, then answers. In production, this pattern prevents every node from seeing raw history unless it truly needs it.
builder = StateGraph(State)
builder.add_node("compact", summarize_if_needed)
builder.add_node("answer", answer)
builder.add_edge(START, "compact")
builder.add_edge("compact", "answer")
builder.add_edge("answer", END)
graph = builder.compile()
- •Run the graph with a small initial state and inspect how much context survives after compaction. The important part is that only a subset of messages gets forwarded after the threshold is crossed.
initial_state: State = {
"messages": [
HumanMessage(content="My policy renewal failed."),
AIMessage(content="What error did you see?"),
HumanMessage(content="It said invalid payment method."),
AIMessage(content="Try updating the card."),
HumanMessage(content="I updated it but still get rejected."),
],
"summary": "",
"token_budget": 1000,
}
result = graph.invoke(initial_state)
print(result["summary"])
print(result["messages"][-1].content)
- •If you want stricter control, gate expensive work behind a simple budget check before calling the model again. Beginners often skip this and let every branch call an LLM even when no new information was added.
def should_answer(state: State):
if len(state["messages"]) < 2:
return END
return "answer"
budget_builder = StateGraph(State)
budget_builder.add_node("compact", summarize_if_needed)
budget_builder.add_node("answer", answer)
budget_builder.add_edge(START, "compact")
budget_builder.add_conditional_edges("compact", should_answer)
budget_builder.add_edge("answer", END)
budget_graph = budget_builder.compile()
Testing It
Run the script with OPENAI_API_KEY exported in your shell and confirm that the graph returns an answer without error. Then increase the number of input turns and verify that summary starts filling in while messages gets trimmed down to just recent content.
To check token savings, compare prompt length before and after compaction by printing the serialized message count or using your provider’s usage metadata if available. If you see every turn being sent back to the model unchanged, your compaction step is not firing early enough.
A good smoke test is to feed in 10-15 alternating human/AI turns and confirm that only a small tail of messages reaches the final answer node. That tells you your graph is controlling context growth instead of letting it compound.
Next Steps
- •Add token-aware routing using a real tokenizer or provider usage metadata.
- •Replace naive summarization with structured memory fields like
facts,open_questions, anddecisions. - •Learn LangGraph persistence so summaries survive across sessions without reprocessing old turns.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit