Last month, researchers at SentinelLABS published findings that should alarm anyone running AI infrastructure: 175,000 Ollama instances are exposed to the public internet across 130 countries, the vast majority without any form of authentication. Ollama, the increasingly popular tool for running large language models locally, binds to 0.0.0.0 by default in Docker deployments — and most developers simply never change it.
Why This Matters
An exposed Ollama instance isn’t just an information leak — it’s a multi-vector attack surface:
- Free GPU compute: Attackers can use your expensive GPU hardware to run their own inference workloads. You pay the electricity and hardware depreciation; they get free AI.
- Model access: Any models loaded on the server are accessible — including fine-tuned models that may contain proprietary data or trade secrets baked into their weights.
- Code execution via tool-calling: If the Ollama instance has tool-calling (function calling) enabled, attackers can instruct the model to execute arbitrary functions — reading files, making network requests, running system commands.
- Network pivot point: A compromised Ollama server on your internal network gives attackers a beachhead to discover and attack other internal services.
Operation Bizarre Bazaar
SentinelLABS documented a threat operation they dubbed “Operation Bizarre Bazaar,” where threat actors built an actual marketplace called silver.inc that resells stolen GPU compute time harvested from compromised Ollama (and other AI serving) instances. Customers pay a fraction of what AWS, GCP, or Azure charge for GPU inference, and their workloads run silently on someone else’s infrastructure. The operators built automated tooling that continuously scans for exposed Ollama instances, tests for authentication, and adds vulnerable servers to their compute pool. It’s essentially a botnet, but instead of DDoS attacks, it’s running AI inference.
The Shadow AI Infrastructure Problem
This is part of a broader pattern I’m calling shadow AI infrastructure. Developers spin up Ollama, vLLM, text-generation-inference, LocalAI, and other model-serving tools for experimentation and prototyping. They never intend these services to be internet-facing. But then:
- They deploy to a cloud VM with a public IP and don’t configure security groups properly
- They run Docker with default networking that bridges to the host’s public interface
- They set up port forwarding on their home router to access their model from their phone
- They deploy to Kubernetes without an ingress controller or network policy
The result is thousands of AI inference endpoints scattered across the internet with zero access control. This is the shadow IT problem all over again, except the stakes are higher because GPUs are expensive and the attack surface is novel.
The Tool-Calling Attack Vector
Perhaps the most alarming finding: nearly half of the exposed Ollama instances have tool-calling enabled. This means an attacker doesn’t need a traditional exploit to achieve remote code execution. They simply send a prompt to the model asking it to:
- Read the contents of
/etc/passwdor~/.ssh/id_rsa - Make HTTP requests to internal services (SSRF)
- Write files to disk
- Execute shell commands through a code interpreter tool
The model happily complies because it has no concept of authorization boundaries. It’s been given tools and told to use them — it doesn’t distinguish between a legitimate user and an attacker. Every exposed LLM with tool-calling enabled is effectively an unauthenticated RCE endpoint with a natural language interface.
Remediation Checklist
If you’re running Ollama or any other local AI inference tool, here’s what you need to do immediately:
- Bind to localhost only: Set
OLLAMA_HOST=127.0.0.1in your environment. This single change eliminates the majority of exposure. - Use a reverse proxy with authentication: If you need remote access, put Ollama behind nginx, Caddy, or Traefik with proper auth (OAuth2, mTLS, or at minimum basic auth with strong credentials).
- Scan your public IP ranges: Run
nmap -p 11434against your organization’s public IPs. Port 11434 is Ollama’s default. You might be surprised what you find. - Disable tool-calling if not needed: If you’re only using the model for text generation, disable function calling entirely to eliminate the RCE vector.
- Network segmentation: AI inference servers should be on their own network segment, isolated from production infrastructure and sensitive data stores.
- Monitor for anomalous usage: Track GPU utilization and API request patterns. A sudden spike in inference requests at 3 AM is a strong indicator of unauthorized access.
What This Means for “Run AI Locally”
The “local AI” movement is built on the premise that running models on your own hardware is more private and secure than sending data to cloud APIs. That’s true in principle — but only if you actually secure your local deployment. The security posture of local AI tools is years behind web application security. There’s no WAF equivalent, no rate limiting middleware, no auth ecosystem, no standardized security headers. We’re essentially in the “Apache server in 2002” era of AI infrastructure security.
Have you audited your organization’s AI infrastructure for exposed endpoints? I’d be willing to bet most companies have at least one instance they don’t know about.