I spent the weekend actually trying out Entire’s Checkpoints CLI and wanted to share what the experience is like from a working engineer’s perspective, not the 30,000-foot CTO/VP view.
Setup and first impressions
Checkpoints currently supports Claude Code and Gemini CLI. I tested it with Claude Code since that is my primary agent. Installation is straightforward – it is a CLI tool that you run alongside your existing agent workflow. You do not need to change how you use Claude Code at all. After a session, Checkpoints captures the reasoning trace and stores it on a hidden branch.
The “zero friction” claim is mostly accurate. Once installed, you forget it is there. It does not interfere with your normal git workflow, and your branch history stays clean. The reasoning traces live on entire/checkpoints/v1 and you only see them when you explicitly look.
What it actually captures
For each agent session, Checkpoints records:
- The prompts you gave the agent
- The agent’s reasoning chain (what it considered, what it rejected)
- Tool calls (file reads, writes, shell commands)
- The resulting code changes with before/after diffs
- Metadata about which model version was used
The detail level is impressive. I ran a session where I asked Claude Code to refactor a React component from class-based to functional with hooks, and the checkpoint included the agent’s reasoning about which lifecycle methods to map to which hooks, edge cases it considered, and why it chose useEffect cleanup patterns over alternatives.
Where it actually helped me
Here is the surprising part: I found the reasoning traces most valuable not for auditing but for learning. I have been using Claude Code for about a year, and I thought I understood how it approaches problems. But looking at the reasoning traces, I noticed patterns I had missed:
- The agent was considering performance implications I never explicitly asked about
- It was making accessibility decisions in my React code that I would have missed
- Some of its “choices” were actually model biases – it consistently preferred certain patterns over equally valid alternatives
This visibility changed how I prompt. I started giving more specific instructions about trade-offs I wanted the agent to optimize for, and the output quality improved noticeably.
Where it falls short
A few real problems I hit:
-
Storage bloat: After a week of normal usage, the checkpoint data was about 3x the size of my actual codebase. For a solo developer this is fine, but at team scale with 50+ engineers, this is going to be a significant storage issue.
-
No search yet: You can browse checkpoints chronologically, but there is no semantic search. I wanted to ask “show me every time the agent made a security-related decision” and that is not possible today. Dohmke has said this is coming with the semantic reasoning layer, but right now you are scrolling through raw traces.
-
Context window limits: The reasoning traces are only as good as what the agent actually logged. Some sessions had gaps where the agent’s reasoning was truncated due to context window limits. You get a partial picture, which might be worse than no picture if you are relying on it for compliance.
-
Multi-agent workflows: I sometimes use Claude Code and Cursor in the same project. Checkpoints only captures the Claude Code sessions. If I switch to Cursor mid-task, there is a gap in the trace that makes the audit trail incomplete.
The developer experience verdict
Honestly? As an individual developer, Checkpoints is a nice-to-have, not a must-have. The learning benefits are real but subtle. Where I see the real value is at the team level – if I can look at a teammate’s checkpoint to understand why their AI agent made a particular architectural decision in a PR I am reviewing, that significantly improves code review quality.
But it needs search, it needs multi-agent support, and the storage issue needs a solution before it is viable for serious team adoption. I give it a 7/10 for potential and a 5/10 for current utility.
Has anyone else hands-on tested this? Curious if your experience matches mine.