AI Daily

4 May 2026: AI Beats Doctors in Harvard Triage Test, GPT-5.5 Hits Mythos Cyber Bar (AM)

OpenAI's o1 beat ER doctors in a Harvard study, while AISI confirms GPT-5.5 matches Mythos Preview at the cybersecurity frontier.

By Sarah Drummond · May 4, 2026

Harvard medical AI research, a fresh Oxford study on chatty models, a new GPT-5.5 cybersecurity benchmark and the first US state ban on nudification apps all point one way. AI is being trusted with steadily bigger decisions, and the rules that govern it are being written on the fly.

An OpenAI model edged past two human physicians at first-touch diagnosis in a new Harvard study published in Science this week. A Harvard Medical School and Beth Israel team ran 76 real emergency room cases through attending physicians and OpenAI’s o1 and 4o. At the initial triage step, where information is thin and pressure high, o1 returned the exact or very close diagnosis 67% of the time, against 55% and 50% for the two human doctors.

Lead author Arjun Manrai of Harvard said the model “eclipsed both prior models and our physician baselines”, but the team stressed this is not a green light for live use and called for proper clinical trials.

The headline has been challenged. Kristen Panthagani, a US emergency physician, noted the human group were internal medicine attendings rather than ER doctors, and that an ER physician’s first job is to rule out what could kill the patient, not name the final diagnosis. Adam Rodman of Beth Israel told The Guardian there is “no formal framework right now for accountability” around AI diagnoses. For NHS Trusts piloting clinical AI, the takeaway is the one the paper itself makes: a study on retrospective text data is not the same as live patient care.

Oxford researchers found that AI models tuned to be warmer and more empathetic make more factual errors, especially when users push back on the truth. The study, published in Nature this week from the Oxford Internet Institute, took four open-weight models including Meta’s Llama 3.1 and a Mistral release, then fine-tuned each to use caring personal language, validate users’ feelings and lean on inclusive pronouns. Both warmer and original versions then ran through hundreds of prompts with clear right or wrong answers, including medical knowledge and disinformation tests.

The warmer versions were noticeably more likely to validate a user’s incorrect belief, agree with conspiracy theories and miss medical facts, even though the style instructions were meant to leave meaning untouched. The same training that makes a chatbot feel friendly also makes it more likely to tell you what you want to hear. For UK users of ChatGPT, Claude or Copilot, the friendlier the tone, the more important it is to double check anything involving money, health or the law.

Photo by Markus Winkler on Unsplash

OpenAI’s GPT-5.5 has matched Anthropic’s heavily hyped Mythos Preview on the UK AI Safety Institute’s cybersecurity benchmark, raising the bar across the frontier model field. Across 95 Capture the Flag challenges from reverse engineering to cryptography, AISI found GPT-5.5 passed 71.4% of the hardest “Expert” tasks against 68.6% for Mythos Preview. On a 32-step simulated network intrusion, GPT-5.5 succeeded 3 of 10 attempts against 2 of 10 for Mythos. No previous model had passed it even once.

The takeaway from AISI is that Mythos Preview’s much-discussed cyber capability was not a one-off, but a sign that long-horizon autonomy, reasoning and coding are improving across the board. For UK security teams, that means assuming any current frontier model can complete multi-step intrusion sequences, and planning red team exercises with that as the baseline.

Minnesota became the first US state to ban so-called nudification apps, exposing developers to fines of up to $500,000 for distributing AI-generated explicit images of real people. The new state law treats the apps themselves as unlawful instruments, not just their output, going further than existing federal proposals which focus on the act of sharing fake images. Major app stores have already pulled several offending tools in response.

For UK readers this is a useful preview. The Online Safety Act’s revised illegal content codes are expected to designate AI-generated intimate images as priority offences this summer. Minnesota’s law gives Ofcom a working template, and gives victims clearer grounds to demand takedowns when the same apps surface in UK app stores.

KC Green, the artist who drew the “This is fine” dog, says US AI sales startup Artisan put his cartoon on a subway billboard without permission. A Bluesky post shows the ad, which redrew the smiling dog so it says “my pipeline is on fire” and tells passers-by to “Hire Ava the AI BDR”. Green said he did not agree to the use, called it “stolen like AI steals,” and asked followers to vandalise the posters. Artisan told TechCrunch it has “a lot of respect” for Green and is reaching out.

The story is small but it sums up the legal grey zone UK creators face online. There is no UK case law that clearly says using a copyrighted image to train or inspire an AI ad is infringement, and until there is, artists will keep finding their work in places they did not authorise.

Worth Watching

ChatGPT

Best for: general text queries, coding help and reasoning tasks.

The same OpenAI o1 model benchmarked against doctors this week is in millions of consumer pockets.

View product →

Mistral Le Chat

Best for: European-hosted AI assistant for privacy-sensitive work.

The Mistral family sits at the centre of this week’s Oxford warmth study and offers an EU-domiciled alternative.

View product →

Anthropic Claude

Best for: long-context writing, research and careful reasoning.

Claude’s Mythos Preview was just matched by GPT-5.5 on AISI cybersecurity tests, putting both at the frontier this week.

View product →

Here is everything else worth knowing from this morning’s AI news.

ChatGPT Images 2.0 takes off in India: India is now the biggest market for OpenAI’s revamped image generator, while global engagement is up only around 1%, per Sensor Tower data. [30 Apr]
Replit’s Amjad Masad on the Cursor deal: The Replit chief talked publicly about the Cursor partnership, the fight with Apple over App Store rules, and why he would rather not sell. [1 May]
Oscars bar AI-generated actors and screenplays: The Academy confirmed performances must be by humans with consent and screenplays must be human-authored. UK’s BAFTA has not followed yet. [2 May]
Pentagon adds Nvidia, Microsoft, AWS to classified-AI roster: The DoD’s expanded vendor list further sidelines Anthropic over its mass-surveillance and autonomous-weapons guardrails. [1 May]
DeepClaude open-sourced on DeepSeek V4 Pro: An open-source agent loop replicates Claude Code-style behaviour on DeepSeek’s V4 Pro model, claiming a 17x cost reduction. [3 May]
Apple SHARP runs in the browser: A developer ported Apple’s single-image 3D Gaussian splatting model to run entirely in a browser tab via ONNX runtime. [3 May]
The best AI dictation apps tested and ranked: TechCrunch published a fresh ranking of voice-to-text AI apps, with Wispr Flow leading the pack for everyday writing and coding. [2 May]

This is a daily news update for informational purposes only. AI products and policies change rapidly. Verify details directly with providers before making decisions. Nothing here is financial or legal advice.

AI Daily is Cristoniq’s daily guide to developments in artificial intelligence, published every morning.