I’ve been deep in the weeds building pre-deployment cost gates for our LLM infrastructure, and I want to share what we’ve built. This is a technical deep-dive for anyone actually implementing this.
Context: We’re running inference for large language models where costs can spike from $10K to $100K+ per month if we’re not careful. A single misconfigured deployment can burn through serious money before anyone notices. We needed gates that actually work.
Architecture Overview:
┌─────────────────┐
│ Developer │
│ Pushes Code │
└────────┬────────┘
│
▼
┌─────────────────┐
│ GitHub Actions │
│ CI/CD Pipeline│
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ Cost Estimation │───▶│ Policy Engine │
│ (Infracost) │ │ (OPA) │
└─────────────────┘ └────────┬─────────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ ALLOW │ │ BLOCK │
│ Deploy │ │ + Notify │
└──────────┘ └──────────┘
│
▼
┌─────────────┐
│ Approval Bot│
│ (Slack) │
└─────────────┘
Component 1: Cost Estimation Engine
We use Infracost for Terraform infrastructure and custom scripts for application-level costs.
Terraform cost estimation:
# .github/workflows/cost-check.yml
- name: Checkout base branch
uses: actions/checkout@v3
with:
ref: '${{ github.event.pull_request.base.ref }}'
- name: Generate base cost estimate
run: |
infracost breakdown --path=. \
--format=json \
--out-file=/tmp/infracost-base.json
- name: Checkout PR branch
uses: actions/checkout@v3
- name: Generate PR cost estimate
run: |
infracost breakdown --path=. \
--format=json \
--out-file=/tmp/infracost-pr.json
- name: Generate cost diff
run: |
infracost diff \
--path=/tmp/infracost-base.json \
--compare-to=/tmp/infracost-pr.json \
--format=json \
--out-file=/tmp/infracost-diff.json
Application-level cost estimation (custom):
For things like Lambda, API calls, and LLM inference that Infracost can’t estimate:
# cost_estimator.py
def estimate_lambda_cost(config):
"""Estimate Lambda costs based on memory, duration, invocations"""
memory_gb = config['memory_mb'] / 1024
duration_sec = config['estimated_duration_ms'] / 1000
invocations_per_month = config['estimated_invocations']
# Lambda pricing: $0.0000166667 per GB-second
compute_cost = memory_gb * duration_sec * invocations_per_month * 0.0000166667
# Request cost: $0.20 per 1M requests
request_cost = (invocations_per_month / 1000000) * 0.20
return compute_cost + request_cost
def estimate_llm_cost(config):
"""Estimate LLM inference costs"""
model = config['model'] # e.g., 'gpt-4', 'claude-3'
estimated_tokens_per_month = config['estimated_tokens']
pricing = {
'gpt-4': {'input': 0.03/1000, 'output': 0.06/1000},
'claude-3': {'input': 0.015/1000, 'output': 0.075/1000}
}
# Assume 60/40 input/output split
input_tokens = estimated_tokens_per_month * 0.6
output_tokens = estimated_tokens_per_month * 0.4
cost = (input_tokens * pricing[model]['input'] +
output_tokens * pricing[model]['output'])
return cost
Component 2: Policy Engine (OPA)
We run Open Policy Agent to evaluate costs against our policies:
# policies/cost_limits.rego
package cost
import future.keywords.if
import future.keywords.in
# Default limits by environment
default_limits := {
"dev": 1000,
"staging": 2000,
"prod": 5000
}
# Get environment from resource tags
get_environment(resource) := env if {
env := resource.tags.environment
}
# Deny if monthly cost exceeds environment limit
deny[msg] if {
some resource in input.resources
cost := resource.monthly_cost
env := get_environment(resource)
limit := default_limits[env]
cost > limit
not resource.tags.cost_approved
msg := sprintf(
"Resource %s costs $%.2f/month, exceeds %s limit of $%.2f",
[resource.name, cost, env, limit]
)
}
# Deny if aggregate deployment cost is too high
deny[msg] if {
total_cost := sum([r.monthly_cost | r := input.resources[_]])
total_cost > 10000
not input.deployment.tags.high_cost_approved
msg := sprintf(
"Total deployment cost $%.2f exceeds $10,000 threshold",
[total_cost]
)
}
# Require cost tags on all resources
deny[msg] if {
some resource in input.resources
not resource.tags.cost_center
msg := sprintf("Resource %s missing required cost_center tag", [resource.name])
}
# Warn for expensive instance types in dev
warn[msg] if {
some resource in input.resources
resource.type == "aws_instance"
resource.tags.environment == "dev"
resource.instance_type in ["m5.4xlarge", "r5.4xlarge", "c5.4xlarge"]
msg := sprintf(
"Resource %s uses expensive instance type %s in dev environment",
[resource.name, resource.instance_type]
)
}
Component 3: CI/CD Integration
Putting it all together in GitHub Actions:
- name: Run cost estimation
run: |
python scripts/cost_estimator.py \
--terraform-plan=tfplan \
--output=costs.json
- name: Evaluate cost policies
run: |
conftest test costs.json \
--policy=policies/ \
--output=json \
--fail-on-warn=false \
> policy-results.json
- name: Post PR comment with results
uses: actions/github-script@v6
with:
script: |
const results = require('./policy-results.json');
const comment = generateCostComment(results);
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
- name: Block if policies failed
run: |
if jq -e '.failures | length > 0' policy-results.json; then
echo "Cost policies failed, blocking deployment"
exit 1
fi
Component 4: Exception Workflow (Slack Bot)
When a deployment is blocked, we notify via Slack with one-click approval:
# slack_approval_bot.py
@app.command("/cost-approve")
def handle_approval_request(ack, command, client):
ack()
pr_url = command['text'].split()[0]
justification = ' '.join(command['text'].split()[1:])
cost_amount = get_cost_from_pr(pr_url)
# Determine approver based on cost
approver = determine_approver(cost_amount)
client.chat_postMessage(
channel=approver_slack_channel(approver),
text=f"Cost approval request for {pr_url}",
blocks=[
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Cost Approval Needed*\n\n"
f"PR: {pr_url}\n"
f"Estimated cost: ${cost_amount}/month\n"
f"Justification: {justification}\n"
f"Requested by: {command['user_name']}"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "Approve"},
"style": "primary",
"value": f"approve:{pr_url}",
"action_id": "approve_cost"
},
{
"type": "button",
"text": {"type": "plain_text", "text": "Deny"},
"style": "danger",
"value": f"deny:{pr_url}",
"action_id": "deny_cost"
}
]
}
]
)
@app.action("approve_cost")
def handle_approval(ack, body, client):
ack()
pr_url = body['actions'][0]['value'].split(':')[1]
# Add cost-approved tag to PR
add_tag_to_pr(pr_url, "cost-approved")
# Notify requester
client.chat_postMessage(
channel=get_requester_channel(pr_url),
text=f"✅ Cost approval granted for {pr_url}. You can retry deployment."
)
Technical Challenges & Solutions:
Challenge 1: Accurate serverless cost estimation
Problem: Hard to estimate Lambda costs without knowing invocation patterns.
Solution:
- Use historical data from similar functions
- Conservative estimates (assume high-end of expected range)
- Monitor actual vs estimated, improve model over time
Challenge 2: Policy evaluation performance
Problem: Running Infracost + OPA added 45 seconds to CI/CD initially.
Solution:
- Cache Infracost pricing data (refresh daily, not every build)
- Run policy evaluation in parallel with other CI/CD steps
- Optimize Rego policies (avoid unnecessary iterations)
- Now down to 18 seconds
Challenge 3: Handling multi-cloud deployments
Problem: We use AWS, GCP, and on-prem K8s. Different cost models.
Solution:
- Normalize costs to “monthly cost” abstraction
- Cloud-specific estimators feed into unified format:
{
"resources": [
{
"name": "api-server",
"type": "compute",
"cloud": "aws",
"monthly_cost": 450.00,
"breakdown": {...}
}
]
}
- Policies evaluate normalized format, don’t care about cloud
Challenge 4: Cost estimation for usage-based services
Problem: API Gateway, LLM calls, etc. have variable costs.
Solution:
- Estimate based on historical usage patterns
- Show cost ranges: “Estimated: $200-$800/month”
- Policy handles ranges:
deny[msg] if {
cost_max := resource.cost_estimate.max
cost_max > limit
}
Monitoring & Iteration:
We track:
- Estimation accuracy: Compare estimated vs actual costs monthly
- False positive rate: Blocked deployments that should have passed
- Approval time: How long exception workflows take
- Developer satisfaction: Regular surveys on the process
Current stats after 4 months:
- Estimation accuracy: ±25% for infrastructure, ±40% for serverless
- False positive rate: 4.2% (down from 18% initially)
- Average approval time: 22 minutes
- Developer NPS: 7/10 (up from 5/10)
Lessons Learned:
- Start with read-only policies: Show violations without blocking, build confidence
- Invest in developer experience: Clear error messages matter more than policy sophistication
- Monitor actual vs estimated: Feedback loop improves estimation
- Keep policies simple initially: Add complexity gradually
- Fast approval workflow is critical: If approval takes >2 hours, people find workarounds
Open Source:
I’m planning to open-source our cost estimation scripts and OPA policies. Would include:
- Terraform cost policy library
- Application-level cost estimators (Lambda, API Gateway, LLM)
- GitHub Actions workflow templates
- Slack approval bot code
- Policy testing framework
Would this be useful? What else should I include?
Questions I’m still working through:
- How to handle cost trends vs absolute costs? (Deployment is cheap, but deploying 10 of them is expensive)
- Best way to estimate costs for new services we haven’t run before?
- How to integrate with FinOps tools (CloudHealth, Kubecost, etc.)?
- Should policies block on estimated cost or committed cost (RIs, savings plans)?
Anyone else building this? Let’s compare notes.