LangGraph Tutorial (Python): optimizing token usage for advanced developers
This tutorial shows you how to build a LangGraph workflow that actively reduces token usage without breaking agent quality. You need this when your graph starts doing too many full-context LLM calls, repeating retrieved text, or carrying irrelevant state across nodes.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-openai - •
langchain-core - •OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with LangGraph state graphs and message passing
- •A terminal and a virtual environment
Install the packages:
pip install langgraph langchain-openai langchain-core
Step-by-Step
- •Start by using a compact state model instead of passing raw chat history everywhere. The main trick is to store only what each node needs, then summarize or trim aggressively before the next model call.
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
summary: str
topic: str
- •Build two models: one for expensive reasoning and one for cheap summarization. In production, this split is where most token savings come from, because you stop using your best model for every maintenance task.
from langchain_openai import ChatOpenAI
reasoner = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarizer = ChatOpenAI(model="gpt-4o-mini", temperature=0)
- •Add a summarization node that compresses old context into a short running summary. This lets later nodes work from a few lines of durable context instead of replaying the entire conversation.
from langchain_core.messages import HumanMessage, SystemMessage
def summarize_state(state: State):
prompt = [
SystemMessage(content="Summarize the conversation in under 80 words. Keep only facts needed for future reasoning."),
HumanMessage(content=f"Current summary:\n{state.get('summary', '')}\n\nMessages:\n{state['messages']}")
]
result = summarizer.invoke(prompt)
return {"summary": result.content}
- •Add a routing node that decides whether the graph needs the full message list or just the summary. This avoids sending large histories into every branch, especially when the user asks follow-up questions that only need the compressed context.
def route(state: State):
last = state["messages"][-1].content.lower()
if any(word in last for word in ["recap", "summary", "what did we decide"]):
return "answer_from_summary"
return "answer_from_messages"
- •Create separate answer nodes for summary-based and full-context responses. The summary path should be your default for lightweight follow-ups, while the full-context path is reserved for cases where recent details matter.
from langchain_core.messages import AIMessage
def answer_from_summary(state: State):
prompt = [
SystemMessage(content="Answer using only the summary. Be concise."),
HumanMessage(content=f"Summary:\n{state.get('summary', '')}\n\nQuestion:\n{state['messages'][-1].content}")
]
result = reasoner.invoke(prompt)
return {"messages": [AIMessage(content=result.content)]}
def answer_from_messages(state: State):
prompt = [
SystemMessage(content="Answer using the conversation messages. Be concise."),
*state["messages"]
]
result = reasoner.invoke(prompt)
return {"messages": [AIMessage(content=result.content)]}
- •Wire the graph so it summarizes first, then routes to the cheapest valid answer path. The important part is that you are not blindly feeding every node the same payload; you are controlling context size at each edge.
from langgraph.graph import END, START, StateGraph
builder = StateGraph(State)
builder.add_node("summarize", summarize_state)
builder.add_node("answer_from_summary", answer_from_summary)
builder.add_node("answer_from_messages", answer_from_messages)
builder.add_edge(START, "summarize")
builder.add_conditional_edges("summarize", route, {
"answer_from_summary": "answer_from_summary",
"answer_from_messages": "answer_from_messages",
})
builder.add_edge("answer_from_summary", END)
builder.add_edge("answer_from_messages", END)
graph = builder.compile()
- •Run it with a small input and inspect what gets returned. In a real app, you would also log prompt sizes per node so you can see exactly where tokens are being burned.
from langchain_core.messages import HumanMessage
result = graph.invoke({
"messages": [HumanMessage(content="We decided to prioritize fraud alerts over account enrichment.")],
"summary": "",
"topic": "fraud"
})
print(result["messages"][-1].content)
print("Summary:", result["summary"])
Testing It
Run one request that contains several long messages, then ask a short follow-up like “what did we decide?” The follow-up should hit the summary path instead of replaying the whole history.
To verify token savings, compare prompt sizes before and after adding summarization by logging len(str(prompt)) or using your provider’s usage metadata if available. You should see lower input tokens on summary-based turns and fewer repeated instructions across nodes.
Also test a case where recent detail matters, such as “what was the last customer complaint?” That should force the full-message branch so you do not optimize away correctness.
Next Steps
- •Add a token budget gate that trims messages before every model call based on estimated input size.
- •Replace free-form summaries with structured state fields like
decisions,open_questions, andentities. - •Add observability with LangSmith so you can track prompt growth per node and catch regressions early.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit