Building An Agentic Company Risk Assessment Workflow: Lessons Learned

12 min read Arkadij Kummer
#AI Agents #PocketFlow #LangGraph #Modal #Technical #Open Source #LLM

When we set out to build an automated tool that researches companies and generates reports, I was excited about the promise of "agentic AI"—systems that could autonomously plan, execute, and iterate. What I discovered was that the journey from a simple linear workflow to a truly agentic system is filled with subtle lessons about context management, tool usage, and the surprising value of constraints.

This post documents our journey building a company risk assessment agent for Bollwerk AI that queries internal databases, searches the web, analyzes social sentiment, and generates comprehensive reports. We built it using PocketFlow (a 100-line minimalist LLM framework), deployed it on Modal, and iterated heavily with AI-assisted coding.

As part of our commitment to open sourcing as much of our work as possible, we've released the underlying framework as a public template: modal-agents provides a minimal starting point for building agentic workflows using PocketFlow and Modal. It includes GPU-accelerated Ollama deployment, private networking via i6pn, and optionally Langfuse tracing integration. The example agent is intentionally simple - a task breakdown tool - but the patterns apply to any multi-step LLM workflow.

Overview

  1. Phase 1: Starting Simple
  2. Phase 2: The Overly-Agentic Experiment
  3. Phase 3: The Sweet Spot
  4. PocketFlow
  5. Why We're Migrating to LangGraph
  6. Key Lessons
  7. Deploying on Modal
  8. Practical Tips
  9. The Evolution of "Agentic"

Phase 1: Starting Simple - The Linear Workflow

Before diving into agentic architectures, we needed to understand PocketFlow's fundamentals. We built a simple linear workflow with one branching point: gather data about a public company, analyze it, and decide whether to generate an all-clear, elevated risk, or high-risk report.

%%{init: {'theme': 'default', 'themeVariables': {'fontSize': '12px'}}}%%
flowchart TD
    Start([Input: Ticker]) --> Gather[Gather Data]
    Gather --> Analyze[Analyze]
    Analyze --> Decision{Risk Level?}
    Decision -->|Low| ReportLow[Generate All-Clear Report]
    Decision -->|Medium| ReportMed[Generate Elevated Risk Report]
    Decision -->|High| ReportHigh[Generate Critical Risk Report]
    ReportLow --> End([Done])
    ReportMed --> End
    ReportHigh --> End

This branching decision was somewhat artificial. Modern models can easily handle all three outcomes with a single prompt. The goal was simply to introduce a split decision and learn how PocketFlow implements conditional routing.

Even with a branching point and some LLM-driven decision-making, this was still too linear to qualify as an agentic pipeline. True agency requires feedback loops and interconnected decision-making.

Phase 2: The Overly-Agentic Experiment

When we ported the Marimo notebook prototype into a real codebase, I got excited and wanted to make it genuinely agentic. The refactoring involved:

  1. Tool abstraction: We wrapped our intelligence-gathering methods into specialized, self-describing tools:
    • Database: Query the Bollwerk Plainview curated database containing financial filings, red flags, and other structured company data
    • Web Research: LLM-powered deep web search
    • Social Sentiment: Twitter/X sentiment analysis via Grok AI
    • Company Analysis: LLM analysis of the structured database results to surface patterns and anomalies
  2. Open-ended planning: The agent (LLM) could decide when and how often to call any tool, with upper limits in place.
  3. Multiple feedback loops: Quality control loops followed by report generation.
%%{init: {'theme': 'default', 'themeVariables': {'fontSize': '12px'}}}%%
flowchart TD
    Start([Input: Ticker]) --> Plan{Pick Tools}
    Plan --> A[Tool A]
    Plan --> B[Tool B]
    Plan --> C[Tool C]
    A --> Decide{Need more?}
    B --> Decide
    C --> Decide
    Decide -->|Yes| Plan
    Decide -->|No| Validate{Validate}
    Validate -->|Issues| Plan
    Validate -->|OK| Out([Output])

    style Plan fill:#c9a0a0,color:#000
    style Decide fill:#c9a0a0,color:#000
    style Validate fill:#c9a0a0,color:#000

The problem: This very open-ended approach didn't produce the best results. The agent frequently:

  • Over-queried resources: Calling the database redundantly multiple times or calling social sentiment analysis repeatedly for marginal gains
  • Burned through tokens: Stuffing prompts with unnecessary context
  • Produced lower quality output: Too much noise, not enough signal

The agent had a bias towards using all available tools even when it wasn't necessary. Giving the agent more tool options felt like making it more capable, but in practice it just made it more expensive and less focused.

Phase 3: The Sweet Spot - Constrained Agency with One Feedback Loop

What ultimately worked much better was returning to a more linear workflow, but with one crucial feedback loop. The key insight: constrain the agent's choices while preserving genuine agency where it matters most.

%%{init: {'theme': 'default', 'themeVariables': {'fontSize': '12px'}}}%%
flowchart TD
    Start([Input: Ticker]) --> Gather[Parallel Gathering<br/>DB + Web + Social]
    Gather --> Extract[Extract Entities<br/>Score for signals]
    Extract --> Decide{Need more?}
    Decide -->|Yes| Search[One Focused Search]
    Search --> Decide
    Decide -->|No| Validate[Validate]
    Validate --> Out([Output])

    style Gather fill:#a0c9c9,color:#000
    style Extract fill:#c9a0c9,color:#000
    style Decide fill:#a0c9a0,color:#000

The updated architecture works like this:

  1. Parallel Gathering: We run database lookup, comprehensive web research, and social sentiment analysis in parallel using an async batch node. This establishes a strong baseline while minimizing latency.
  2. Entity Extraction: An LLM step parses the research for key people and organizations, scoring each for concerning signals (past bankruptcies, fraud allegations, sudden departures, etc.).
  3. Planning with Constraints: Given all initial data plus any flagged entities, the agent plans at most one additional focused search to address the most important gap, verify the most critical claim, or investigate a flagged entity.
  4. Execute → Evaluate Loop: After incorporating new information, the agent evaluates whether more research is needed. Entity-focused queries can reveal concerning patterns that company-level searches miss.
  5. Constrained Iteration: The model can ask one more question and repeat—up to a configurable maximum—or stop early if it decides it has everything it needs.
  6. Self-Check: Before generating the final report, a dedicated validation step verifies claims, flags contradictions, and identifies unsupported assertions.
  7. Report Generation: The final step weaves all additional findings into the comprehensive base report, preserving citations and clearly delineating facts from interpretation.

This approach results in a much stronger final report. By avoiding unnecessary prompt stuffing and wasted computation, costs are significantly reduced. The output is better focused and higher quality, and the process reveals entity-level insights that company-level research alone would miss.

PocketFlow: Zero Vendor Lock-In, Maximum Flexibility

We chose PocketFlow for its simplicity and complete freedom from vendor lock-in. Since the whole framework is only 100 lines of code, it also worked great with coding agents as they did not need to query extensive documentation to understand how to use it.

What We Gained

  1. Total Control: Every utility function is ours. We know exactly what's happening at every layer.
  2. Easy Model Swapping: We built a simple config abstraction that lets us configure different providers (GPT, Gemini, Ollama) for each workflow step—making cost/performance experimentation trivial.
  3. Deep Understanding: Building everything from scratch forced us to truly understand the fundamentals of agentic systems.

What We Had to Build Ourselves

The flip side of zero dependencies is that everything must be implemented from scratch:

  • LLM wrapper functions for each provider
  • Tool registry with budget tracking
  • Tracing/observability integration
  • Caching of initial research results (so we can skip expensive gathering when debugging later steps)
  • Error handling and retry logic
  • Prompt management

Why We're Planning to Migrate to LangGraph

Despite our appreciation for PocketFlow's simplicity, we're planning to migrate to LangGraph for several practical reasons:

  1. Automatic Caching: Built-in caching of intermediate results means we don't waste API calls during development and debugging.
  2. General Cache Hitting: Smart cache invalidation based on input changes.
  3. Native Langfuse Integration: We manually integrated Langfuse tracing, but LangGraph has it built in.
  4. Future-Proofing: Larger ecosystem, more maintained, and better suited for production at scale.

That said, using PocketFlow first was invaluable. It helped us grok the fundamental patterns without the abstraction layers hiding important details.

Key Lessons for Building Agents

1. Context Management is Everything

Make sure the agent gets only the context it needs, but gets all of that context - not truncated versions.

We discovered bugs where inputs were silently truncated. The model made confident decisions based on incomplete information. This is why tracing is critical (more on that below).

How did these truncations sneak in? Here's the meta twist: we used AI-assisted coding (mostly Claude Opus 4.5) to build this agent. The AI coding assistant would sometimes "helpfully" add truncations like [:500] or [:1000] to strings it assumed were too long, without asking. It was trying to be efficient, but it introduced subtle bugs that were invisible until we traced the full prompt inputs and noticed critical context was missing.

When you're using AI to build AI systems, you inherit the AI's biases about what's "reasonable." Always review generated code for arbitrary limits.

Summarization is fine if intentional. Let the LLM summarize when context is genuinely too long, but make that explicit rather than arbitrary string slicing.

2. Modularity is Key

You'll want to experiment with different models from different providers to find the sweet spot between cost and performance. Make model swapping easy.

We built a unified LLM interface that hides provider details - changing from Gemini to GPT to Ollama is just a config change, not a code change.

3. Don't Add Tools Just Because You Can

It feels cool to give your agent more tool options, thinking you're making it more capable. The reality: The AI has a bias towards using all available tools even if it might not be necessary or give enough benefits to justify the additional costs.

We learned this the hard way with multiple social sentiment calls. The agent would re-check Twitter sentiment after each piece of new information, even when the initial analysis was comprehensive.

4. Tracing is Non-Negotiable

You need to be able to look at full inputs and outputs to models. Otherwise, you will never catch small hidden bugs.

We integrated Langfuse for observability at two levels:

1. Flow-level tracing — wrap the entire workflow in an observation context:

Python
from langfuse import get_client as get_langfuse_client

langfuse = get_langfuse_client()

with langfuse.start_as_current_observation(
    as_type="span",
    name=f"RiskTriage-{ticker.upper()}",
    metadata={"framework": "PocketFlow", "workflow": "risk_triage"},
) as trace:
    asyncio.run(flow.run_async(shared))

langfuse.flush()

2. LLM-level tracing — decorate each LLM call to capture inputs, outputs, and costs:

Python
from langfuse import observe, get_client as get_langfuse_client

@observe(as_type="generation")
def call_gemini(prompt: str, model: str = None, ...) -> str:
    """The @observe decorator auto-captures prompt, response, and timing."""
    response = client.models.generate_content(...)

    # Manually add LLM-specific metadata
    langfuse = get_langfuse_client()
    langfuse.update_current_generation(
        model=model_name,
        usage_details={"input": input_tokens, "output": output_tokens},
    )
    return response.text

Because every LLM function is decorated with @observe, all calls are automatically nested under the parent flow trace. This gives us a complete tree of every model invocation.

Tracing revealed:

  • Prompts that were accidentally truncating critical context
  • Model confusion from poorly structured inputs
  • Unnecessary repetition of similar queries
  • Points where the agent was "going in circles"

Without tracing, these bugs would have been invisible. The output looked plausible, but the reasoning was flawed.

Side note: Some Langfuse features like cost estimation didn't work properly even after manually supplying input/output token counts. This might be specific to our manual integration approach and could work better with LangGraph's native integration, or it could be a general Langfuse issue. Something to watch.

Deploying on Modal: Cost-Optimized GPU Access

We deployed the workflow on Modal, which gave us an elegant architecture for mixing cheap CPU orchestration with on-demand GPU inference.

The main workflow runs on CPU-only instances. When it needs to call Ollama for local model inference, it reaches out to A10G or H100 GPU instances that spin up on demand and scale down after 10 seconds of inactivity.

i6pn Private Networking

Modal's i6pn feature enables secure internal networking between containers—no public endpoints, no API keys to manage:

Python
@app.function(
    i6pn=True,  # Enable private networking
    region="us-east-1",  # Must match GPU service region
)
def run_triage(ticker: str, ...):
    # Calls to Ollama go through i6pn - no internet exposure
    ...

The main workflow calls the Ollama service via internal DNS, and Modal handles authentication automatically. This is workspace-scoped, so only our containers can talk to each other.

Shout-out to Irfan Sharif and Eric Ma for this repo which we used as a template for our Ollama service deployment.

Persistent Storage with Modal Volumes

Modal Volumes give us persistent storage that survives container restarts:

  • Research cache: Skip expensive initial gathering when debugging later workflow steps
  • Final reports: Store generated reports for retrieval via API
Python
volume = modal.Volume.from_name("research-cache", create_if_missing=True)

@app.function(volumes={"/data": volume})
def run_triage(...):
    # Cache and reports stored to /data, persisted across runs
    ...
    volume.commit()  # Persist changes

The combination of cheap CPU orchestration, on-demand GPU inference, and private networking makes this architecture surprisingly cost-effective for production use.

Practical Development Tips

Implement Mock Methods Early

This saves a lot of money during development when your agentic flow can break at various stages.

Python
# In your tool implementations
class WebResearchTool(Tool):
    def execute(self, **kwargs) -> ToolResult:
        if os.getenv("MOCK_MODE"):
            return ToolResult(
                success=True,
                data={"findings": "Mock findings for testing..."},
                sources=["https://example.com/mock"]
            )
        # Real implementation
        return self._real_web_search(**kwargs)

Without mocks, you can waste dozens of expensive API calls only for the flow to break towards the end. Mock mode lets you validate the entire flow structure for free.

Cache Initial Research Immediately

We save the results of the initial parallel gathering immediately after it completes:

Python
async def post_async(self, shared, prep_res, exec_res_list):
    # ... store results ...

    # Save to cache IMMEDIATELY after initial research
    # This ensures cache is persisted even if later workflow steps fail
    cacheable_data = extract_cacheable_data(shared)
    save_research_to_cache(shared["ticker"], cacheable_data)

When debugging later steps, we can skip expensive initial gathering and jump straight to the part we're fixing.

Design Your Shared State Carefully

PocketFlow uses a shared store for inter-node communication. Design it upfront with clear categories:

Python
shared = {
    # Input
    "query": "user input here",

    # Gathered data
    "initial_results": None,
    "additional_findings": [],

    # Iteration control
    "iteration": 0,
    "max_iterations": 3,

    # Output
    "result": None,
}

A well-designed shared store makes debugging much easier - you can inspect the state at any point in the workflow.

The Evolution of "Agentic"

Looking back, our journey reveals an important insight about what "agentic" really means in practice. Our first version was a pure linear workflow with branching - cheap and predictable, but not really agentic. The second version swung to the opposite extreme: open-ended tool calling with maximum autonomy. It was the most "agentic" by any theoretical measure, but it produced the worst results at the highest cost.

The third version found the sweet spot: constrained iteration where the agent decides what to research next, evaluates when it has enough information, and validates its own conclusions - but within a largely deterministic structure. The addition of entity extraction with signal-based scoring exemplifies this approach: rather than blindly researching every person mentioned, we use LLM judgment to identify which entities have concerning signals that warrant deeper investigation.

True agency isn't about giving the AI maximum freedom. It's about giving it freedom at the right decision points while maintaining a coherent, efficient structure.

Conclusion

Building our first production agentic system taught us that:

  1. Start simple: A linear workflow with clear data flow is the foundation
  2. Add agency incrementally: One feedback loop can be more powerful than five
  3. Constrain tool usage: More tools ≠ more capability
  4. Invest in observability: Tracing is how you catch the subtle bugs
  5. Make experimentation cheap: Mocks and caching are essential for rapid iteration
  6. Stay modular: You will swap models, providers, and approaches - design for change

PocketFlow gave us the foundation to understand these principles deeply. Now we're ready to build on that understanding with more sophisticated tools like LangGraph.

The best agentic system isn't the one with the most agency - it's the one that applies agency precisely where it creates value.