I Tried Entire's Checkpoints CLI for a Week — Here Is What the AI Reasoning Traces Actually Look Like

I spent the weekend actually trying out Entire’s Checkpoints CLI and wanted to share what the experience is like from a working engineer’s perspective, not the 30,000-foot CTO/VP view.

Setup and first impressions

Checkpoints currently supports Claude Code and Gemini CLI. I tested it with Claude Code since that is my primary agent. Installation is straightforward – it is a CLI tool that you run alongside your existing agent workflow. You do not need to change how you use Claude Code at all. After a session, Checkpoints captures the reasoning trace and stores it on a hidden branch.

The “zero friction” claim is mostly accurate. Once installed, you forget it is there. It does not interfere with your normal git workflow, and your branch history stays clean. The reasoning traces live on entire/checkpoints/v1 and you only see them when you explicitly look.

What it actually captures

For each agent session, Checkpoints records:

  • The prompts you gave the agent
  • The agent’s reasoning chain (what it considered, what it rejected)
  • Tool calls (file reads, writes, shell commands)
  • The resulting code changes with before/after diffs
  • Metadata about which model version was used

The detail level is impressive. I ran a session where I asked Claude Code to refactor a React component from class-based to functional with hooks, and the checkpoint included the agent’s reasoning about which lifecycle methods to map to which hooks, edge cases it considered, and why it chose useEffect cleanup patterns over alternatives.

Where it actually helped me

Here is the surprising part: I found the reasoning traces most valuable not for auditing but for learning. I have been using Claude Code for about a year, and I thought I understood how it approaches problems. But looking at the reasoning traces, I noticed patterns I had missed:

  1. The agent was considering performance implications I never explicitly asked about
  2. It was making accessibility decisions in my React code that I would have missed
  3. Some of its “choices” were actually model biases – it consistently preferred certain patterns over equally valid alternatives

This visibility changed how I prompt. I started giving more specific instructions about trade-offs I wanted the agent to optimize for, and the output quality improved noticeably.

Where it falls short

A few real problems I hit:

  • Storage bloat: After a week of normal usage, the checkpoint data was about 3x the size of my actual codebase. For a solo developer this is fine, but at team scale with 50+ engineers, this is going to be a significant storage issue.

  • No search yet: You can browse checkpoints chronologically, but there is no semantic search. I wanted to ask “show me every time the agent made a security-related decision” and that is not possible today. Dohmke has said this is coming with the semantic reasoning layer, but right now you are scrolling through raw traces.

  • Context window limits: The reasoning traces are only as good as what the agent actually logged. Some sessions had gaps where the agent’s reasoning was truncated due to context window limits. You get a partial picture, which might be worse than no picture if you are relying on it for compliance.

  • Multi-agent workflows: I sometimes use Claude Code and Cursor in the same project. Checkpoints only captures the Claude Code sessions. If I switch to Cursor mid-task, there is a gap in the trace that makes the audit trail incomplete.

The developer experience verdict

Honestly? As an individual developer, Checkpoints is a nice-to-have, not a must-have. The learning benefits are real but subtle. Where I see the real value is at the team level – if I can look at a teammate’s checkpoint to understand why their AI agent made a particular architectural decision in a PR I am reviewing, that significantly improves code review quality.

But it needs search, it needs multi-agent support, and the storage issue needs a solution before it is viable for serious team adoption. I give it a 7/10 for potential and a 5/10 for current utility.

Has anyone else hands-on tested this? Curious if your experience matches mine.

Thanks for the actual hands-on review, Alex. Everyone else has been theorizing and you actually used the thing. Much appreciated.

The storage bloat issue you flagged is exactly what I was worried about from an infrastructure perspective. Let me do some rough math:

If a typical Claude Code session generates ~50KB of reasoning traces, and an engineer runs 10 sessions per day, that is 500KB per engineer per day. For a team of 100 engineers, that is 50MB per day, or about 1.5GB per month, just in reasoning traces. That is manageable for the checkpoint data itself, but the real cost is in the git operations. Every push, pull, and clone now includes a hidden branch with potentially thousands of checkpoint commits. If you are running CI/CD that clones your repo, those clone times will increase unless you are doing shallow clones or sparse checkouts that explicitly exclude the checkpoint branch.

This is the kind of performance degradation that does not show up in a solo developer evaluation but becomes a real problem at scale. The architecture decision to use a hidden git branch rather than an external data store is a trade-off that favors simplicity over scalability.

Your point about multi-agent workflows is critical for infrastructure teams. We routinely switch between Claude Code for application logic, custom scripts for Terraform generation, and Copilot for quick edits. Any observability tool that only captures one agent’s interactions gives you a partial picture that could be misleading – you might see a checkpoint showing an agent “decided” to use a certain architecture when in reality the decision was made in a different tool and the agent was just implementing it.

The “no search” limitation is a dealbreaker for me. In a production incident, I do not have time to scroll through chronological traces. I need to query “what was the reasoning behind the Kubernetes resource limits set in commit abc123?” instantly. If the semantic search layer is still on the roadmap, this tool is not ready for production use cases.

That said, your observation about using it as a learning tool is genuinely interesting. I might set it up for onboarding purposes – new engineers reviewing checkpoints from senior engineers’ AI sessions could be an effective way to learn prompting patterns.

Alex, the learning angle you mentioned is the most underappreciated part of this entire discussion and I want to expand on it from a design perspective.

I have been thinking a lot about how AI tools change the way we learn craft. When I started as a designer, I learned by looking at other designers’ work and reverse-engineering their decisions. Why did they choose that type scale? Why that spacing system? The decisions were visible in the output because design is inherently visual.

Code is different. You can read someone’s code, but you often cannot see the reasoning. Why did they choose a context provider over prop drilling? Why a custom hook instead of a third-party library? With human-written code, those decisions are sometimes captured in code comments or ADRs (architecture decision records), but mostly they live in someone’s head and disappear when they leave the company.

What you are describing with Checkpoints changes that dynamic fundamentally. If I am a junior engineer reviewing a senior engineer’s checkpoint, I am not just seeing what code was written – I am seeing how a skilled engineer collaborates with an AI agent. I am seeing their prompting strategy, their review process, and their decision-making criteria. That is a form of tacit knowledge transfer that we have never had before.

From a UX perspective though, Checkpoints needs much better visualization. Raw reasoning traces are like reading server logs – useful for debugging but terrible for learning. What I would want is something like a “decision graph” that shows the branching points where the agent considered alternatives and the engineer made choices. Think of it like a visual diff, but for decisions rather than code.

If Entire ships a good visualization layer, I could see this becoming standard in engineering onboarding programs. The “7/10 for potential” rating feels right – the raw material is incredibly valuable, but the UX needs to catch up to make it accessible to people who are not power users of git and CLI tools.

I want to pick up on something Alex mentioned that I think deserves its own discussion: the “model biases” visible in reasoning traces.

From a data science perspective, this is actually a goldmine. If you can aggregate reasoning traces across multiple engineers and multiple projects, you can start to identify systematic patterns in how AI agents make decisions. Some of these patterns are useful (consistent security practices, accessibility defaults) and some are problematic (preferring certain libraries, architectures, or coding styles regardless of context).

At my company we have started tracking what we call “agent drift” – the tendency for AI coding agents to slowly shift the architectural patterns in a codebase toward whatever the model was most heavily trained on. For example, we noticed that Claude Code consistently prefers functional patterns over object-oriented ones in TypeScript, even in codebases that have an established OOP architecture. Over time, this creates an inconsistency in the codebase that is hard to attribute to any single commit.

If Checkpoints data were aggregated with proper statistical rigor, you could build dashboards that show:

  • Which model biases are affecting your codebase most
  • How agent behavior changes between model versions
  • Whether different engineers’ prompting styles produce measurably different code quality outcomes
  • Where the agent consistently makes suboptimal decisions that get caught in review

This is exactly the kind of analysis that turns raw observability data into actionable engineering intelligence. But it requires the semantic search and analytics layer that Alex notes is still missing.

My concern is the same one I have with most developer analytics: the data is only valuable if you ask the right questions. Most teams will collect reasoning traces and never analyze them, the same way most teams collect code coverage metrics and never actually use them to improve quality. The tooling needs to surface insights proactively, not just store data for manual querying.

The 5/10 current utility score feels generous to me for anything beyond personal learning. The potential is significant, but “potential” does not ship features.

Everyone is discussing the benefits, but let me bring up a security concern with the learning/onboarding use case Maya and Alex described.

If you are encouraging junior engineers to review senior engineers’ checkpoints as a learning exercise, you are giving juniors access to the full reasoning context of senior engineer sessions. Those sessions might include:

  • Prompts that reference internal security architecture (“refactor the auth middleware that handles our JWT rotation”)
  • Agent interactions that expose how your secrets management works
  • Reasoning traces that describe vulnerability mitigations in detail
  • Tool call logs that show what internal services the engineer was accessing

In a healthy security culture, access to this information would be scoped by role and need-to-know. But if checkpoints are stored on a git branch that anyone with repo access can read, you have effectively given every engineer in the organization a detailed map of your security implementation, written in natural language that is far easier to understand than the code itself.

This is not hypothetical. At my previous company, we had an incident where a junior engineer accidentally exposed internal architecture details in a public repository. It took us weeks to assess the damage. Now imagine that exposure includes not just the code but the reasoning behind every security decision.

I am not saying Entire is a bad tool. I am saying it needs granular access controls on the checkpoint branch before it should be used in any environment where security matters. And the “hidden branch” architecture makes this harder, not easier – in git, branch-level access control is limited and varies by hosting platform.

Alex, in your testing, did you see any options for filtering what gets captured? For example, can you exclude sessions that touch certain file paths (like auth or secrets management) from the checkpoint data?