8 things we learned building AI agents
January 2026

We've been building agentic features for over a year now. Here's what we learned.
1. agents fail in weird ways
Traditional software fails predictably. An API returns a 500, a database query times out, a null pointer crashes your app. You can write tests for these.
Agents fail... differently. They confidently do the wrong thing. They get stuck in loops. They interpret instructions in ways you never imagined. Your test suite won't catch "the agent decided to refactor your entire codebase instead of fixing a typo."
2. context is everything
The difference between a useful agent and a useless one is context. An agent with access to your analytics, your codebase, your user feedback, and your team's past decisions will outperform a generic agent every time.
This is why we built Twig the way we did. The agent needs to understand your users, not just your code.
3. humans need to stay in the loop (for now)
Full autonomy sounds great until the agent decides to delete your production database to "clean up unused resources."
The best agent experiences we've built have clear checkpoints where humans review and approve. Not because the AI can't be trusted, but because the human often has context the AI doesn't.
4. streaming matters more than you think
When an agent takes 30 seconds to respond, users assume it's broken. When an agent streams its thinking process, users wait patiently for 2 minutes.
Show your work. Let users see what the agent is doing. It builds trust and helps users understand when to intervene.
5. tool design is underrated
Most agent failures we've seen aren't model failures - they're tool failures. Give an agent a poorly designed tool and it will use it poorly.
Good tools have clear names, focused purposes, and obvious parameters. Bad tools try to do everything.
6. cost scales faster than you expect
Running agents at scale is expensive. Not just the API calls - the infrastructure, the observability, the human review time.
Budget for 10x what you think you'll need, then budget more.
7. evaluation is hard
How do you know if your agent is getting better? Traditional metrics don't work. "Task completion rate" doesn't capture whether the agent completed the task well.
We've found that human evaluation, while slow, is still the gold standard. Automated evals help for regression testing but can't replace human judgment.
8. the ceiling keeps rising
What seemed impossible a year ago is routine now. What seems impossible today will be routine next year.
Build for the capabilities that are coming, not just the capabilities that exist. The agents of 2027 will make today's look primitive.
These learnings shaped how we built Twig. An agent that understands context, keeps humans in the loop, and is built for the capabilities of tomorrow.