Back to Case Studies

AI Consultant for a Multi-Agent Workflow Builder

Multi-Agent AIFintechWorkflow AutomationPydantic AIPython

I helped a fintech company turn their multi-agent AI system from an architectural design into a working product. The platform is a portfolio management system where users describe what they need in plain language, and AI agents build and execute the entire data workflow automatically, like Zapier for financial data, but powered by LLMs.

Client: Fintech (PMS)

The Challenge

Portfolio managers and analysts working with digital assets need to monitor positions across multiple exchanges, track performance metrics like time-weighted returns, detect risk signals early, and get notified when something needs attention. Traditionally, each of these workflows requires a data engineer to build and maintain custom pipelines.

The client, a fintech company, had built a portfolio management system (PMS) and wanted to let users create these workflows through natural language. Not just ask questions about their data, but actually generate executable pipelines: scheduled alerts, interactive dashboards, automated reports, conditional notifications. Think Zapier or n8n, but for financial portfolio data, where the workflows are built by AI agents from a single conversational prompt.

The team had initially considered migrating to n8n as the workflow engine, but decided to build a custom platform instead because the domain requirements (financial entity hierarchies, typed data operations, real-time portfolio computations) were too specific for a generic workflow tool.

There's also a constraint that sets financial platforms apart from most AI applications: auditability. In finance, every action the platform takes must be reproducible and traceable for compliance and legal reasons. This is why the system generates deterministic workflows rather than answering questions on the fly. Once AI agents build a workflow, it runs on fixed rails through Dagster, with every step logged, every input and output recorded. This approach has a double advantage: it makes everything auditable, and it minimizes hallucination risk by grounding each execution step on structured, validated data rather than free-form LLM output.

The platform had been designed and partially implemented when I joined, but it was not yet functional. The architecture was in place, the agent roles were defined, but the system couldn't reliably produce working workflows. Prompts broke with unexpected inputs. Agents picked the wrong tools or generated code that didn't match the data structures coming from upstream steps. My job was to make it actually work and get it to production.

Solution Approach

My Role

I came in as an AI consultant to take the system from designed-but-not-working to production. My work focused on the AI layer: making agents produce correct outputs, coordinate effectively, and handle real-world inputs reliably.

  • Prompt engineering across agent tiers: the system uses multiple specialized agents, each with a fundamentally different job (interpreting user intent, planning workflow structure, generating executable code). Each required a different prompting strategy, and getting the balance between specificity and flexibility right took systematic iteration across hundreds of test cases.
  • Agent coordination: in a multi-agent system, the quality of what one agent produces depends on the context it receives from the previous one. I refined how information flows between agents to minimize both token waste and information loss, which directly impacted the accuracy of the final output.
  • Output validation and correctness: some agents generate executable code, so I implemented validation steps that verify both the structure and the outputs of what agents produce. This includes syntax checks, type validation against expected data schemas, and execution-time verification to catch issues before they reach production data.
  • Model selection by agent category: not all agents need the same LLM. Some need to be fast and cheap because they run on every request. Others need to be highly accurate because they generate executable code that touches financial data. I ran extensive A/B tests across multiple providers and model sizes for each agent category. For the core agentic flow, where accuracy on code generation and tool orchestration matters most, Claude Opus by Anthropic consistently outperformed alternatives. The final model map balances cost, latency, and accuracy: lighter models handle routing and classification, while Opus drives the critical path where mistakes are expensive.
  • Error recovery and reliability: when a workflow step fails, the system feeds the error context back to the agent for a corrected attempt. I improved this feedback loop and hardened the overall state machine against edge cases that caused failures in production-like scenarios.

Results & Impact

  • Helped bring the multi-agent workflow builder from prototype to production readiness
  • Optimized prompt strategies across three agent tiers (planners, code generators, orchestrator), improving output accuracy
  • Selected Claude Opus (Anthropic) as the primary model for the agentic flow after extensive A/B testing across providers, with lighter models for routing and classification
  • Improved the build-execute-retry cycle, reducing failed workflow creation
  • Hardened tool definitions and database access patterns for production safety and performance
  • Users can create scheduled dashboards, conditional alerts, and multi-step financial reports through conversation

Technologies Used

  • Python multi-agent framework with structured LLM output
  • State machine for incremental, self-correcting workflow construction
  • REST API for conversational interaction and workflow management
  • MCP (Model Context Protocol) for tool interfaces between agents and services
  • Deterministic workflow engine for scheduling, execution, and audit logging
  • PostgreSQL for state persistence and portfolio data
  • React-based UI for the workflow editor and interactive dashboards