175,000 Ollama Servers Are Exposed to the Public Internet — Attackers Are Reselling Your GPU Compute

Last month, researchers at SentinelLABS published findings that should alarm anyone running AI infrastructure: 175,000 Ollama instances are exposed to the public internet across 130 countries, the vast majority without any form of authentication. Ollama, the increasingly popular tool for running large language models locally, binds to 0.0.0.0 by default in Docker deployments — and most developers simply never change it.

Why This Matters

An exposed Ollama instance isn’t just an information leak — it’s a multi-vector attack surface:

  1. Free GPU compute: Attackers can use your expensive GPU hardware to run their own inference workloads. You pay the electricity and hardware depreciation; they get free AI.
  2. Model access: Any models loaded on the server are accessible — including fine-tuned models that may contain proprietary data or trade secrets baked into their weights.
  3. Code execution via tool-calling: If the Ollama instance has tool-calling (function calling) enabled, attackers can instruct the model to execute arbitrary functions — reading files, making network requests, running system commands.
  4. Network pivot point: A compromised Ollama server on your internal network gives attackers a beachhead to discover and attack other internal services.

Operation Bizarre Bazaar

SentinelLABS documented a threat operation they dubbed “Operation Bizarre Bazaar,” where threat actors built an actual marketplace called silver.inc that resells stolen GPU compute time harvested from compromised Ollama (and other AI serving) instances. Customers pay a fraction of what AWS, GCP, or Azure charge for GPU inference, and their workloads run silently on someone else’s infrastructure. The operators built automated tooling that continuously scans for exposed Ollama instances, tests for authentication, and adds vulnerable servers to their compute pool. It’s essentially a botnet, but instead of DDoS attacks, it’s running AI inference.

The Shadow AI Infrastructure Problem

This is part of a broader pattern I’m calling shadow AI infrastructure. Developers spin up Ollama, vLLM, text-generation-inference, LocalAI, and other model-serving tools for experimentation and prototyping. They never intend these services to be internet-facing. But then:

  • They deploy to a cloud VM with a public IP and don’t configure security groups properly
  • They run Docker with default networking that bridges to the host’s public interface
  • They set up port forwarding on their home router to access their model from their phone
  • They deploy to Kubernetes without an ingress controller or network policy

The result is thousands of AI inference endpoints scattered across the internet with zero access control. This is the shadow IT problem all over again, except the stakes are higher because GPUs are expensive and the attack surface is novel.

The Tool-Calling Attack Vector

Perhaps the most alarming finding: nearly half of the exposed Ollama instances have tool-calling enabled. This means an attacker doesn’t need a traditional exploit to achieve remote code execution. They simply send a prompt to the model asking it to:

  • Read the contents of /etc/passwd or ~/.ssh/id_rsa
  • Make HTTP requests to internal services (SSRF)
  • Write files to disk
  • Execute shell commands through a code interpreter tool

The model happily complies because it has no concept of authorization boundaries. It’s been given tools and told to use them — it doesn’t distinguish between a legitimate user and an attacker. Every exposed LLM with tool-calling enabled is effectively an unauthenticated RCE endpoint with a natural language interface.

Remediation Checklist

If you’re running Ollama or any other local AI inference tool, here’s what you need to do immediately:

  1. Bind to localhost only: Set OLLAMA_HOST=127.0.0.1 in your environment. This single change eliminates the majority of exposure.
  2. Use a reverse proxy with authentication: If you need remote access, put Ollama behind nginx, Caddy, or Traefik with proper auth (OAuth2, mTLS, or at minimum basic auth with strong credentials).
  3. Scan your public IP ranges: Run nmap -p 11434 against your organization’s public IPs. Port 11434 is Ollama’s default. You might be surprised what you find.
  4. Disable tool-calling if not needed: If you’re only using the model for text generation, disable function calling entirely to eliminate the RCE vector.
  5. Network segmentation: AI inference servers should be on their own network segment, isolated from production infrastructure and sensitive data stores.
  6. Monitor for anomalous usage: Track GPU utilization and API request patterns. A sudden spike in inference requests at 3 AM is a strong indicator of unauthorized access.

What This Means for “Run AI Locally”

The “local AI” movement is built on the premise that running models on your own hardware is more private and secure than sending data to cloud APIs. That’s true in principle — but only if you actually secure your local deployment. The security posture of local AI tools is years behind web application security. There’s no WAF equivalent, no rate limiting middleware, no auth ecosystem, no standardized security headers. We’re essentially in the “Apache server in 2002” era of AI infrastructure security.

Have you audited your organization’s AI infrastructure for exposed endpoints? I’d be willing to bet most companies have at least one instance they don’t know about.

The default bind to 0.0.0.0 is a design choice that prioritizes developer convenience over security — the exact same mistake web frameworks made 15 years ago. Rails, Django, and Express all eventually changed their defaults to bind to localhost after years of developers accidentally exposing development servers to the internet. Ollama and similar tools need to follow suit. The default should be secure, and opening it up should require a deliberate, documented configuration change.

We did a scan of our cloud infrastructure after reading the SentinelLABS report and found 3 Ollama instances that a data science team had spun up on EC2 without telling anyone. Two of the three were publicly accessible. No authentication, no logging, no monitoring — running on p3.2xlarge instances at roughly $3/hour each. They’d been running for 6 weeks. That’s approximately $3,000 in compute that could have been accessed by anyone on the internet. When we checked the Ollama logs, we found requests from IP addresses that didn’t belong to anyone on our team. Someone had already found them.

The immediate response was threefold:

  1. Shut down the exposed instances and rotate all credentials on the affected machines
  2. Created a Terraform module for AI inference that includes VPC isolation, an ALB with OIDC authentication, CloudWatch monitoring for anomalous usage, and automatic security group rules that block all inbound traffic except from our VPN CIDR
  3. Added Ollama’s default port (11434) to our continuous external attack surface monitoring alongside the usual suspects (database ports, admin panels, etc.)

The Terraform module has been the real win. Data scientists can still spin up Ollama instances whenever they want — they just have to use our module, which handles all the networking and auth automatically. Deployment time went from “spin up an EC2 and install Ollama” (10 minutes, insecure) to “run terraform apply with our module” (15 minutes, secure by default). The five-minute difference is negligible, and now every instance is properly isolated, authenticated, and monitored.

The bigger takeaway: your cloud security tooling probably isn’t looking for AI infrastructure ports. Most CSPM tools have rules for exposed databases, SSH, RDP, and web servers. Almost none check for port 11434 (Ollama), 8080 (vLLM), or 5000 (text-generation-inference). Update your scanning rules.

The tool-calling attack vector is what scares me the most about this. With function calling enabled, an attacker doesn’t need a traditional exploit — they don’t need to find a buffer overflow, an injection vulnerability, or a misconfigured permission. They just send a natural language prompt to the model: “Read the contents of /etc/passwd” or “Make an HTTP POST request to http://internal-api.corp:8080/admin/users with the following JSON body.” The model happily complies because it has no concept of authorization boundaries. It was given tools and instructions to be helpful — it doesn’t distinguish between a legitimate user request and an attacker’s exploitation.

This is fundamentally different from every other RCE vector we’ve dealt with. Traditional RCE requires technical sophistication — crafting payloads, understanding memory layouts, bypassing protections. LLM-based RCE through tool-calling requires only the ability to type English sentences. The barrier to entry for attackers just dropped to zero.

What we need is the same kind of sandboxing for AI tool-calling that we have for browser JavaScript. Consider how browsers handle this:

  • JavaScript can’t read arbitrary files from the filesystem
  • Network requests are governed by CORS policies
  • APIs require explicit user permission (camera, microphone, location)
  • Code runs in a sandbox with well-defined boundaries

The model should have a permission manifest that explicitly declares:

  • What functions it can call (allowlist, not blocklist)
  • What network access it has (specific hosts and ports only)
  • What files it can read and write (scoped to specific directories)
  • What system commands it can execute (ideally none, but if needed, a strict allowlist)

Without this kind of sandboxing, every exposed LLM with tool-calling is essentially an unauthenticated RCE endpoint with a natural language interface. The attack surface isn’t a CVE you can patch — it’s a fundamental architectural flaw in how we’re deploying AI tools.

I’ve started drafting a proposal for an “AI Tool-Calling Security Standard” that borrows heavily from the principle of least privilege and the browser security model. If anyone wants to collaborate on this, reach out — I think the community needs a reference architecture before this gets worse.

This is a governance failure, not just a security failure. The reason 175,000 Ollama instances are sitting on the public internet isn’t because developers are careless — it’s because organizations haven’t built the infrastructure governance to handle the AI era. The reason these instances exist in the wild is simple: developers can spin up AI infrastructure without going through any approval process, any security review, or any architectural oversight.

We saw the same pattern in the early cloud era. Developers spun up EC2 instances, S3 buckets, and RDS databases outside of IT’s purview because the official process was too slow. The industry solved this with landing zones, service catalogs, and guardrails — pre-approved environments that gave developers freedom within safe boundaries. The AI infrastructure era needs the exact same thing.

Here’s what we built internally:

We created an “AI Platform” team (3 engineers, reporting to the platform engineering org) that provides approved, pre-configured environments for model serving. The offering includes:

  • Ollama on Kubernetes with network policies that restrict ingress to internal CIDR ranges only
  • vLLM deployments behind an authenticated API gateway with rate limiting and usage tracking
  • A self-service portal where any developer can request a new model-serving environment, choose their model, and have it provisioned in under 30 minutes
  • Automatic monitoring that alerts on unusual inference patterns, unexpected model loads, and resource utilization anomalies

Developers can deploy any model they want — we don’t restrict model choice. But the deployment goes through our standard infrastructure pipeline with security scanning, network isolation, cost tracking, and monitoring. No more docker run -p 11434:11434 ollama/ollama on a public-facing VM.

The adoption was initially met with resistance. Engineers said we were “slowing them down” and “adding bureaucracy to experimentation.” But after we shared the SentinelLABS data and showed them what happens when Ollama instances get compromised — including the financial cost of stolen GPU compute and the legal exposure from data exfiltration — the resistance evaporated. Engineers understood that the 30-minute provisioning time was a small price for not becoming part of someone else’s GPU botnet.

The shadow AI problem is the shadow IT problem all over again. Every lesson we learned about cloud governance in 2015-2020 applies directly to AI infrastructure governance in 2025-2026. The organizations that recognize this early and build the guardrails now will avoid the painful incident-driven learning curve that the rest will go through.