How to Build a document extraction Agent Using LlamaIndex in TypeScript for investment banking
A document extraction agent in investment banking takes messy deal documents, extracts structured fields, and returns validated JSON you can feed into downstream systems. That matters because bankers spend too much time manually pulling terms from CIMs, credit agreements, pitch books, and KYC packs, and every manual pass adds latency, inconsistency, and compliance risk.
Architecture
- •
Document ingestion layer
- •Loads PDFs, DOCX files, and text from secure storage.
- •In practice, this usually means pulling from S3, SharePoint, or an internal DMS.
- •
Text extraction and chunking
- •Uses LlamaIndex readers to convert files into
Documentobjects. - •Splits long documents into manageable chunks for extraction.
- •Uses LlamaIndex readers to convert files into
- •
Schema-driven extractor
- •Defines the exact fields you want back: issuer name, deal size, maturity date, covenant terms, governing law.
- •Keeps outputs consistent enough for downstream validation.
- •
LLM-backed parsing layer
- •Uses
OpenAIor another supported LLM through LlamaIndex. - •Extracts structured data from each chunk with prompt constraints.
- •Uses
- •
Validation and normalization layer
- •Checks dates, currencies, percentages, and required fields.
- •Normalizes output into bank-friendly formats before persistence.
- •
Audit logging layer
- •Stores source document IDs, chunk IDs, extracted values, confidence signals, and model version.
- •This is non-negotiable for compliance and review workflows.
Implementation
1. Install the packages
Use the TypeScript packages from LlamaIndex plus a PDF reader if your documents are mostly PDFs.
npm install llamaindex dotenv zod
Set your environment variables:
export OPENAI_API_KEY="your-key"
2. Load documents with LlamaIndex
For a real workflow, start by loading one or more source files into Document objects. If you are processing PDFs from a local secure mount or an internal bucket sync, keep the file path handling outside the agent itself.
import "dotenv/config";
import { SimpleDirectoryReader } from "llamaindex";
async function loadDocs() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData({
directoryPath: "./deal_docs",
});
return docs;
}
This gives you Document[] that LlamaIndex can process further. In banking workflows, keep the raw file hash and source URI alongside these documents for auditability.
3. Build a schema-driven extraction pipeline
For investment banking, don’t ask the model for “summary data.” Ask for specific fields tied to a schema. That makes validation possible and reduces garbage output.
import "dotenv/config";
import { OpenAI } from "llamaindex";
import { z } from "zod";
import { Document } from "llamaindex";
const DealSchema = z.object({
issuerName: z.string(),
transactionType: z.string(),
dealSizeUsd: z.string(),
maturityDate: z.string().optional(),
governingLaw: z.string().optional(),
});
type DealData = z.infer<typeof DealSchema>;
const llm = new OpenAI({
model: "gpt-4o-mini",
});
async function extractDealFields(doc: Document): Promise<DealData> {
const prompt = `
Extract the following fields from this investment banking document:
- issuerName
- transactionType
- dealSizeUsd
- maturityDate
- governingLaw
Return only valid JSON with these keys.
If a field is missing, use null.
Source text:
${doc.text}
`;
const response = await llm.complete({ prompt });
const parsed = JSON.parse(response.text);
return DealSchema.parse(parsed);
}
This pattern is simple but production-friendly. You can wrap it with retries and chunk-level processing later without changing the schema contract.
4. Process multiple documents and persist results
In real deals, one document rarely contains everything. You usually need to scan multiple files and merge results by source metadata.
import { SimpleDirectoryReader } from "llamaindex";
async function main() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData({ directoryPath: "./deal_docs" });
const results = [];
for (const doc of docs) {
const extracted = await extractDealFields(doc);
results.push({
source: doc.metadata?.file_name ?? "unknown",
extracted,
});
}
console.log(JSON.stringify(results, null, 2));
}
main().catch(console.error);
If you need higher recall on long PDFs, split them first using LlamaIndex’s node parsing utilities such as SentenceSplitter, then run extraction per node and reconcile duplicates afterward.
Production Considerations
- •
Compliance controls
- •Restrict which document classes the agent can process.
- •Add human review for high-risk outputs like covenant language or legal jurisdiction.
- •Keep immutable logs of prompts, model version, source document IDs, and final extracted fields.
- •
Data residency
- •Route EU or APAC client data to region-specific infrastructure.
- •Do not send sensitive deal materials to unmanaged endpoints.
- •If your bank requires it, use an approved model deployment inside a private network boundary.
- •
Monitoring
- •Track extraction accuracy by field type.
- •Dates fail differently than names or dollar amounts.
- •Build field-level metrics instead of one global score.
- •Monitor parse failures, invalid JSON rate, retry rate, and manual override rate.
- •Track extraction accuracy by field type.
- •
Guardrails
- •Reject outputs that fail schema validation.
- •Use strict types for dates and currency values.
- •Normalize “USD $1bn” into one canonical format before persistence.
- •Block unsupported document types or scanned images unless OCR is in place.
- •Reject outputs that fail schema validation.
Common Pitfalls
- •
Using free-form summaries instead of schemas
This creates inconsistent outputs that are hard to validate. Define exact fields up front with Zod or a similar validator.
- •
Ignoring chunk boundaries
Long offering memorandums and credit agreements will exceed context limits. Split them with
SentenceSplitteror another node parser before extraction. - •
Skipping audit metadata
If you can’t trace an extracted value back to the source file and model run, you will have problems during review. Persist source filename, page reference if available, model name, timestamp, and validation status.
A solid document extraction agent in investment banking is not just an LLM wrapped around PDFs. It is a controlled pipeline with schema enforcement, traceability, and deployment constraints that match how banks actually operate.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit