Anthropic’s new model tier finds 27-year-old bugs. Who can actually use Claude Mythos Preview?
Claude Mythos Preview is Anthropic’s most capable model, gated to security researchers. Here is what it can do, and why it is not publicly available
Anthropic has built a model it will not release. Claude Mythos Preview is its most capable AI to date, already finding security flaws that have sat undetected in major operating systems for decades. Here is what it can do, why access is gated, and what it means if you work in software or security.
The Short Version
- Claude Mythos Preview matters because AI systems can sound confident while hiding important limits.
- The useful question is what the system can prove, not just what it can produce.
- Good AI use depends on context, verification and sensible boundaries.
- The practical answer is to treat the tool as support for judgement, not a replacement for it.
What The Term Really Means
Claude Mythos Preview sits in a new tier above Opus in Anthropic’s model family, described by the company as larger and more intelligent than the Opus line. Architecture and parameter count have not been published. The model accepts text and image inputs, produces text output, carries a one-million-token context window, and supports extended reasoning. It launched alongside Project Glasswing, a joint initiative pairing the model with security teams working on real-world software vulnerabilities.
The UK government frontier AI paper is useful background because it explains why capability, reliability and risk need to be judged together.
How it compares to what you can actually use. Nothing on the market does quite what Claude Mythos Preview does in a controlled defensive security setting. The comparison below uses the figures Anthropic published at launch, which benchmarked Mythos against Claude Opus 4.6. Claude Opus 4.7 has since been released as the current publicly available flagship model with improved scores across several benchmarks, but is not included in Anthropic’s Glasswing system card.
For everyone else, Claude Opus 4.7 remains the most capable generally available Claude model; pricing and specs are at anthropic.com/pricing . Project Glasswing is aimed at organisations that build or maintain critical software infrastructure. The programme is not designed for individual developers or small teams without a security research mandate.
A useful way to test claude mythos preview is to ask what evidence the system used and what it left out. AI output is strongest when the source and the task are both clear.
Why The Details Matter
Founding partners include Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks, alongside more than 40 additional organisations that build or maintain critical infrastructure software. Anthropic provides $100 million in usage credits for programme participants, plus $4 million in donations to open-source security foundations: $2.5 million to Alpha-Omega and the OpenSSF via the Linux Foundation, and $1.5 million to the Apache Software Foundation. Access to the model runs through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
Note: Claude Opus 4.7 is now Anthropic’s current publicly available flagship model and has improved on several of the Opus 4.6 figures above. The table uses the Opus 4.6 figures published in Anthropic’s Glasswing system card because those are the direct comparisons Anthropic ran against Mythos Preview.
Every capability benchmark cited in Anthropic’s announcement was produced by Anthropic on its own model. The AISI’s independent evaluation covers cybersecurity tasks only; read the full report at aisi.gov.uk before drawing conclusions about general capability. Broader access to Mythos, if it comes, will likely remain gated. Anthropic plans new safeguards alongside a future Opus release, and a Cyber Verification Program for security professionals is in development.
The next step is to check whether the answer can be verified outside the model. If the claim matters, it needs a source, a calculation or a human review.
Where People Get Misled
The benchmark picture. The capability benchmarks come almost entirely from Anthropic running its own model under its own testing conditions. According to Anthropic’s published data at anthropic.com/glasswing , Mythos Preview scores 93.9% on SWE-bench Verified against 80.8% for Claude Opus 4.6. On SWE-bench Pro, a harder variant designed to reduce memorisation risk, the gap widens: 77.8% versus 53.4%. Anthropic states that memorisation screens were applied and the lead holds after excluding flagged problems. On Terminal-Bench 2.0, which tests sustained agentic coding over long sessions, Mythos scores 82.0% versus 65.4% for Opus 4.6.
The misconception worth naming. Most people reading about Mythos assume it represents a general leap forward across all AI tasks. The benchmark data does not quite support that framing. On GPQA Diamond, a hard graduate-level reasoning test, Mythos scores 94.6% against Opus 4.6’s 91.3%, a 3.3-percentage-point margin. On BrowseComp, the gap is 86.9% versus 83.7%. These are real but incremental improvements, not a step change in general capability.
Images: Anthropic / Project Glasswing. Used for editorial reporting purposes.
That habit keeps the tool useful without giving it more authority than it deserves.
How To Test The Claim
The UK AI Security Institute (AISI) provided the only independent evaluation, covering cybersecurity tasks. In expert-level capture-the-flag tests that no model could complete before April 2025, Mythos achieved a 73% success rate. The AISI also ran The Last Ones (TLO), a 32-step corporate network attack simulation estimated to require around 20 human hours. Mythos was the first model to complete it from start to finish, doing so in 3 of 10 attempts, averaging 22 of 32 steps. Claude Opus 4.6 averaged 16 steps in the same test. (Source: aisi.gov.uk , 13 April 2026.)
Where Mythos is genuinely in a different category from every other publicly tested model is cybersecurity: autonomous vulnerability discovery and exploitation at a scale and sophistication that an independent body, the AISI, confirmed cannot yet be matched. Anthropic itself notes that standard benchmarks mostly saturate at Mythos’s capability level, which is why its testing shifted to real-world novel tasks. A model that is modestly better at general reasoning but categorically more capable at autonomous cyberattack is a different kind of product from a simple upgrade.
The Practical Risks
Anthropic’s own Red Team tests, reported at red.anthropic.com/2026/mythos-preview/ , describe further results. Mythos developed 181 working JavaScript shell exploits from Firefox 147 engine vulnerabilities; Opus 4.6 produced 2 from several hundred attempts under the same conditions. On an internal OSS-Fuzz test across approximately 1,000 repositories, Mythos achieved full control-flow hijack , the highest severity tier , on 10 fully patched targets. All of these results are vendor-reported by Anthropic.
Source: anthropic.com/glasswing , April 2026. All benchmark scores in the system card are vendor-reported by Anthropic on its own model. Independent evaluation by the AISI covers cybersecurity tasks only.
What To Watch Next
Real-world bugs found during the programme include a 27-year-old OpenBSD remote crash flaw, a 16-year-old integer overflow in FFmpeg’s H.264 codec, chained Linux kernel vulnerabilities escalating user access to machine control, and a 17-year-old FreeBSD NFS remote code execution flaw that Mythos exploited in under half a day at a cost of under $1,000.
What this means for you. If you work in software security, infrastructure, or open-source maintenance, the practical starting point is anthropic.com/glasswing , where Anthropic describes the programme and how to register interest. A public report on vulnerabilities found and fixed is due within 90 days of the programme’s launch.
A Worked Example
Imagine a reader is looking at claude mythos preview and trying to decide whether it matters in practice. The first mistake would be to accept the label without checking the details behind it.
A better approach is to list the claim, the evidence, the cost and the downside. If any one of those is unclear, the decision needs more work before it deserves confidence.
That small pause changes the whole exercise. Instead of reacting to a headline, the reader is testing whether the idea survives contact with real constraints.
What This Means For You
The useful point is not to memorise every detail of claude mythos preview. It is to know which questions make the topic safer to use.
Start with the plain-English version, then compare it with the evidence. The related Cristoniq guides on What is RAG, and why does it matter for business AI? and What is reasoning in AI? are good next checks.
If the idea still makes sense after that, you have a better basis for action. If it only works when the awkward details are ignored, that is the answer.
In Plain English
Claude Mythos Preview is not a magic phrase. It is a practical idea that needs context before it becomes useful.
The simple rule is to ask what the term means, what problem it solves, and what new risk it creates.
When those answers are clear, the topic becomes easier to judge. When they are vague, slow down.