VulnerabilitiesMay 7, 202612 min read

CVE-2026-7482: Bleeding Llama and the Critical Ollama Memory Leak

Cyera Research disclosed Bleeding Llama, a critical unauthenticated memory disclosure vulnerability in Ollama before version 0.17.1. By supplying a malicious GGUF model file to an exposed Ollama server, attackers can trigger a heap out-of-bounds read and recover sensitive heap memory including prompts, API keys, tokens, and environment variables.

Local AI tools have changed how developers, researchers, and companies work with large language models. Instead of sending prompts to a cloud-hosted AI provider, teams can run models on their own machines or private infrastructure. One of the most popular tools for this is Ollama, an open-source platform for running large language models locally.

That local-first model gives users more control, but it also creates a security challenge. Once a tool meant for local use becomes exposed over a network, especially without authentication, it can become a serious risk.

That is why CVE-2026-7482, also known as Bleeding Llama, matters.

Cyera Research disclosed Bleeding Llama as a critical unauthenticated memory disclosure vulnerability in Ollama. The issue affects Ollama versions before 0.17.1 and can allow a remote attacker to leak sensitive memory from an exposed Ollama server. The CVE record describes it as a heap out-of-bounds read vulnerability in the GGUF model loader.

What is Ollama?

Ollama is an open-source platform that allows users to run large language models on their own hardware. Developers use it to run models such as Llama, Mistral, and others without relying entirely on external AI providers. Cyera describes Ollama as widely used in enterprise environments as a self-hosted AI inference engine, with significant adoption across GitHub and Docker Hub.

For many teams, Ollama is attractive because it feels private. A developer can run a model locally, test prompts, analyze code, or build an internal AI workflow without sending everything to a third-party service.

But "local" does not always stay local.

In practice, teams often expose Ollama over a network so other tools, teammates, services, or containers can access it. That is where the risk grows. A service that was safe on localhost can become dangerous when it is reachable from an internal network or the public internet.

What is CVE-2026-7482?

CVE-2026-7482 is a critical vulnerability in Ollama before version 0.17.1. The official CVE record says the vulnerability exists in the GGUF model loader and involves the /api/create endpoint accepting an attacker-supplied GGUF file with declared tensor offset and size values that exceed the actual file length. During quantization, Ollama can read past the allocated heap buffer.

That technical description matters, but the practical explanation is simpler: an attacker can provide a malicious model file that tricks Ollama into reading memory it should not read.

That memory may contain sensitive information from the Ollama process. According to Cyera and other reporting, this can include prompts, chat messages, system prompts, environment variables, API keys, tokens, secrets, proprietary code, and other data flowing through the local AI service.

The vulnerability is especially concerning because it can be exploited without authentication if the Ollama API is reachable over the network. All versions before 0.17.1 are affected and network-exposed instances are at risk without requiring attacker credentials.

Why is it called Bleeding Llama?

The name Bleeding Llama is a reference to the nature of the vulnerability: sensitive memory can "bleed" out of the Ollama process. The "Llama" part refers to the model ecosystem commonly associated with local AI tools such as Ollama.

The name also echoes earlier memory disclosure vulnerabilities, where attackers could read data from memory that should have remained private. In this case, the sensitive data is not just generic server memory. It may include the exact information users trusted their local AI system to keep private.

How the Vulnerability Works

Bleeding Llama is not a normal software crash. It is a memory disclosure bug.

At a high level, the attack works like this:

An attacker creates or supplies a malicious GGUF model file.
The file contains manipulated metadata, including tensor offset and size values.
Ollama processes the file through its model creation or quantization flow.
Because of insufficient bounds checking, Ollama reads past the memory region it should access.
Sensitive heap memory from the Ollama process can become embedded into the resulting model artifact.
The attacker can then retrieve or exfiltrate that artifact.

Cyera says the attack can be performed with only a small number of unauthenticated API calls when the Ollama service is exposed. SecurityWeek similarly reported that exploitation may require only three unauthenticated API calls and could allow attackers to extract sensitive data from exposed deployments.

This is serious because the vulnerable component handles model files, and model files are often treated as trusted assets in AI workflows. Bleeding Llama shows that model ingestion itself can become an attack surface.

What Data Could Be Exposed?

The exact data exposed depends on how Ollama is deployed and what has recently passed through the Ollama process. Possible leaked data may include:

User prompts
System prompts
Chat history fragments
Environment variables
API keys
Access tokens
Internal service credentials
Proprietary source code
Customer data included in prompts
Tool outputs routed through AI agents
Internal documents or summaries
Personally identifiable information
Health, financial, or business-sensitive data, depending on usage

Cyera specifically warns that data in memory may include prompts, environment variables, API keys, tokens, and other secrets.

This matters because AI systems often sit at the center of sensitive workflows. Developers paste stack traces, source code, configuration files, logs, and credentials into prompts more often than security teams would like. Internal AI assistants may also receive customer support tickets, contracts, database outputs, and private documentation.

If those interactions pass through a vulnerable Ollama server, they may become part of the memory an attacker can read.

Who is Affected?

The affected product is Ollama before version 0.17.1. You should consider yourself potentially affected if:

You run Ollama older than 0.17.1.
Your Ollama API is reachable from other machines.
You exposed Ollama using OLLAMA_HOST=0.0.0.0.
You published port 11434 through Docker.
You exposed Ollama through Kubernetes, a reverse proxy, a tunnel, or a cloud firewall rule.
You allow untrusted users or services to call the Ollama API.
Your Ollama server processes sensitive prompts, code, secrets, documents, or tool outputs.

The highest-risk scenario is an Ollama instance directly reachable from the public internet without authentication or a firewall.

SecurityWeek reported Cyera's warning that roughly 300,000 Ollama deployments may be exposed and at risk of information theft. That number should be treated as a reported exposure estimate, but the underlying point is clear: many teams have deployed Ollama in ways that make it network-accessible.

Why Network Exposure Changes Everything

Ollama is commonly used as a local developer tool. If it is bound only to localhost and only trusted local users can reach it, the risk is much smaller.

But many real deployments are not that simple. Teams expose Ollama so it can be used by web apps, internal chat interfaces, AI coding tools, agent frameworks, retrieval-augmented generation systems, containerized applications, remote developers, shared GPU servers, and internal automation systems.

Once exposed, Ollama becomes an API service. If that service does not have authentication, authorization, rate limiting, network segmentation, and monitoring, it becomes a high-value target.

Bleeding Llama is a reminder that "local AI" does not automatically mean "secure AI." Local tools still need the same security discipline as any other server-side system.

Why This Vulnerability Is Especially Important for AI Systems

Traditional web application vulnerabilities often expose databases, files, or user sessions. AI infrastructure has a different risk profile.

An AI inference server may process a mix of sensitive data from many systems. It might receive source code from developers, credentials from automation tools, private documents from a retrieval system, internal tickets from support workflows, or personal data from customer-facing applications. That makes the AI service a concentration point for sensitive information.

Bleeding Llama is important because it targets the memory of that concentration point. Even if the attacker cannot directly access the database, Git repository, ticketing system, or secrets manager, they may still recover fragments of data that passed through the AI server.

This is why memory disclosure vulnerabilities in AI infrastructure deserve serious attention. They do not only threaten the model. They threaten the data moving through the model runtime.

Severity and Scoring

Cyera's advisory lists CVE-2026-7482 as CVSS 9.1 Critical. Some third-party reporting has cited a score of 9.3, but Cyera's own advisory uses 9.1. Regardless of the minor difference across sources, the practical severity is high because exploitation can be remote, authentication may not be required, the vulnerable endpoint can process attacker-controlled model data, the impact is sensitive information disclosure, exposed Ollama deployments are common, and the leaked data may include credentials and secrets.

How to Check If You Are Vulnerable

First, check your Ollama version:

ollama --version

If the version is older than 0.17.1, upgrade immediately.

Next, check whether Ollama is listening beyond localhost. On Linux or macOS:

ss -ltnp | grep 11434

or:

lsof -i :11434

If Ollama is listening on 0.0.0.0, it may be reachable from other machines depending on firewall and routing rules.

Also check Docker port mappings:

docker ps

Look for mappings such as 0.0.0.0:11434->11434/tcp, which means the service may be exposed on all network interfaces of the host.

For Kubernetes, check services and ingress rules:

kubectl get svc -A | grep ollama
kubectl get ingress -A

Also review cloud firewall rules, security groups, load balancers, reverse proxies, and tunnels.

How to Fix CVE-2026-7482

The primary fix is straightforward: upgrade Ollama to version 0.17.1 or later.

After upgrading, restart the Ollama service and confirm the version:

ollama --version

But patching alone is not enough if the service remains exposed without controls. You should also harden the deployment.

Recommended Mitigation Steps

1. Upgrade immediately. Install Ollama 0.17.1 or later. Treat older versions as vulnerable.

2. Restrict network access. Do not expose Ollama directly to the public internet. Limit access to trusted hosts, private networks, VPNs, or internal service meshes.

3. Avoid unauthenticated public APIs. If remote access is required, place Ollama behind an authenticated reverse proxy, gateway, or internal access control layer.

4. Review `OLLAMA_HOST`. Check whether Ollama has been configured to listen on all interfaces with OLLAMA_HOST=0.0.0.0. Only use broad binding when you also have strong network and authentication controls.

5. Rotate secrets. If an affected Ollama instance was exposed to untrusted networks, rotate any secrets that may have existed in the Ollama process environment or passed through prompts. This includes API keys, tokens, cloud credentials, database passwords, and internal service credentials.

6. Audit logs and model activity. Look for suspicious calls to model creation, upload, quantization, or push-related endpoints.

7. Reduce sensitive data in prompts. Do not paste secrets, credentials, or raw production data into prompts unless absolutely necessary. Use redaction, tokenization, and least-privilege workflows.

8. Isolate the Ollama process. Run Ollama in a container, VM, or restricted environment with limited access to secrets and files. Avoid loading broad environment variables into the process.

9. Monitor exposed assets. Use external attack surface management, cloud inventory, or basic port scanning to identify systems exposing port 11434.

10. Treat AI infrastructure as production infrastructure. Apply standard controls: authentication, authorization, logging, patching, network segmentation, egress restrictions, and incident response procedures.

Incident Response Guidance

If you discover that a vulnerable Ollama instance was exposed to the internet or an untrusted network, treat it as a potential data exposure event.

A practical response plan would include:

Take the instance offline or restrict access immediately.
Upgrade to Ollama 0.17.1 or later.
Preserve logs for investigation.
Identify what data passed through the Ollama service.
Rotate secrets available to the process or included in prompts.
Review outbound network activity.
Look for unusual model creation, upload, or push behavior.
Notify internal security, privacy, or compliance teams if regulated data may have been exposed.
Rebuild the service with safer network and authentication controls.
Document lessons learned and update deployment standards.

The most important mindset is this: if a vulnerable instance was reachable by untrusted users, assume memory contents may have been exposed.

Lessons for Local AI Security

Bleeding Llama is bigger than one bug in one tool. It highlights several broader lessons for AI infrastructure.

Local does not always mean private. A local AI tool is private only if it stays local or is protected properly. Once exposed over a network, it becomes a service that needs real security controls.

Model files are part of the attack surface. AI systems often ingest models, adapters, embeddings, prompts, documents, and tool outputs. Any of those inputs can become a path to exploitation if parsers and loaders are vulnerable.

Prompts can contain secrets. Organizations often underestimate how much sensitive information flows through AI systems. Prompt data should be treated as sensitive application data.

AI services need authentication. A powerful local model server without authentication may be acceptable for a single developer laptop. It is not acceptable as an exposed enterprise service.

Memory safety still matters. Even in modern AI infrastructure, old classes of bugs such as out-of-bounds reads remain highly relevant. When memory bugs occur in systems that process sensitive data, the impact can be severe.

Final Takeaways

CVE-2026-7482, or Bleeding Llama, is a critical reminder that local AI infrastructure must be secured like any other production system.

The vulnerability affects Ollama versions before 0.17.1 and can allow unauthenticated attackers to leak sensitive memory from exposed Ollama servers. The leaked data may include prompts, system instructions, API keys, tokens, environment variables, proprietary code, and other sensitive information.

The immediate action is clear: upgrade Ollama to 0.17.1 or later.

But the deeper lesson is just as important. Do not expose local AI tools without authentication, network controls, logging, and a clear understanding of what data flows through them. Local AI can improve privacy, but only when deployed with care.

Sources

Cyera Research — Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama