Tests Show AI Isn’t as Good as They Said It Was

PLUS: The FBI is after Archive.today, another voice recording wearable, and an AI artist hit the major charts

Last week, I read a few articles about how artificial intelligence is being introduced into high-impact fields like healthcare and education. I’ve always believed that integrating new technology into the workforce can support productivity and improve the way we work. Nothing is worse than joining a new company that still operates like Dunder Mifflin from The Office.

That said, many of these new AI systems are being given responsibilities that require a high level of accuracy, expertise, and ethical judgment. Recent studies suggest that the tools used to evaluate AI performance often fail to reflect how these systems actually operate in real-world environments. As AI takes on a bigger role in professional settings, the gap between how it’s tested and how it’s used is becoming more important to understand, especially when it comes to safety, reliability, and effectiveness. It’s easy to let AI do the work, but even if the technology tells us exactly what it’s capable of, we should still maintain a healthy amount of skepticism.

One study from Scale AI and the Center for AI Safety tested how well AI agents could handle real freelance work such as writing, coding, design, and data analysis. These were actual jobs that people get paid to do. The results weren’t great. Even the best systems could only complete about 2.5% of the tasks at a quality level acceptable to clients. Most of the work was incomplete, inaccurate, or didn’t meet professional standards. The few wins came from simple creative tasks, not projects that required planning, multi-step reasoning, or using multiple tools. Human freelancers produced better results and took an average of 11.5 hours per task.

Another study from the Oxford Internet Institute looked at 445 common AI benchmarks, the same ones companies often point to when claiming progress. Most of them had serious flaws. Only 16% used reliable statistical comparisons, and nearly half didn’t define what they were measuring. Terms like “reasoning” and “helpfulness” were used loosely, if at all. Many benchmarks were based on narrow datasets or tasks so simple they barely reflected real-world use. The researchers said we’re not measuring what we think we are and called for basic fixes like clear definitions, shared standards, and stronger statistical practices.

Even with these limitations, AI is being folded into fields that demand accuracy and expertise. In medical education, Google Cloud and Adtalem Global Education are launching credentials to train students on using AI in clinical settings with a focus on ethics and safety. In biotech, Anthropic’s new Claude for Life Sciences helps with literature reviews and regulatory writing. AI is moving deeper into high-stakes environments while questions about its reliability remain.

Together, these studies show how big the gap is between how AI is tested and how it’s actually used. The systems that ace lab benchmarks often stumble when faced with real-world tasks that require judgment, flexibility, or real understanding. As AI gets baked into more parts of professional life, we need better ways to evaluate how it performs outside the lab. Otherwise, it’s hard to tell whether these tools are really delivering on what they promise.

Tech News

Government Intervention

  • OpenAI published a “Teen Safety Blueprint” with five standards for AI products aimed at minors, including parental controls and age-appropriate design. The document shows the company trying to shape global rules before regulators do. (OpenAI)

    • The five points for AI companies to follow:

      1. Identify teens on the platform and treat them age-appropriately.

      2. Reduce risks to minors by prohibiting depictions of suicide or self-harm, intimate content, and violent content.

      3. If there is any doubt about a user’s age, default to an under-18 experience.

      4. Provide parental controls that allow families to manage their children’s accounts.

      5. Incorporate features based on the latest research on youth and AI.

  • Seven lawsuits filed in California accuse ChatGPT of encouraging dangerous conversations that led to self-harm and suicide. The cases represent one of the first major legal tests of AI’s psychological liability. (Social Media Victims)

  • The FBI subpoenaed domain registrar data from Archive.today to identify ownership, showing growing attention to anonymous archiving and platform accountability. I will be keeping an eye on this story and the future of the site. (The Verge)

  • In Texas, Attorney General Ken Paxton sued Roblox, calling it a “breeding ground for predators” and accusing it of misleading parents about safety. (Game Developers)

  • In the UK, officials are investigating whether Chinese-made Yutong buses can be remotely disabled after Norway found similar capabilities. (Financial Times)

Cool New Things

  • The Sandbar Stream Ring, priced around $249, records and transcribes voice notes directly into an app. It ships in 2026 and marks a shift toward wearables built for productivity rather than fitness, but I would also say the creepy Friend pendant was trying to do that too. I think we are going to see a lot more problems coming up that involve secret recordings by wearable device. (Sandbar)

  • Posha is a $1,500 countertop robot chef that uses sensors and computer vision to cook full meals with minimal human input. Early testers say it can handle complex recipes like creamy Tuscan chicken and paneer tikka masala without supervision. (Puck)

Music

  • Xania Monet became the first AI-generated artist to chart on a major radio list. Her creator, Telisha “Nikki” Jones, used the generative music tool Suno to turn poetry into songs that landed on the Adult R&B Airplay chart. Monet has already signed a multi-million dollar deal with Hallwood Media. (MusicRadar)

  • Universal Music Group signed a deal with AI music platform Udio. The agreement settles copyright disputes and allows Udio to use UMG’s catalog for a new AI-music platform launching in 2026. (Universal Music Group)

Creator Economy

  • Podcasters are starting to use AI voice clones from tools like ElevenLabs to translate episodes, fill missing dialogue, or simulate guest appearances. The tools make scaling easier and help reach global audiences, but they also raise questions about authenticity. (New York Times)

  • TikTok Shop said it rejected roughly 70M product listings and banned 700K sellers in the first half of 2025 after uncovering large-scale fraud powered by generative AI. Scammers used AI to create fake brands and listings that looked legitimate. (Mashable)

  • Prediction: BookTok is about to explode even more, thanks to Kindle Translate, Amazon’s new AI-powered translation service that lets authors instantly publish in multiple languages. No waiting on human translators means more global books, faster, and more content for creators to talk about. (Amazon)

Subscribe to Braun & Brains

By Rachel Braun · Launched a year ago

Tech & Adjacent Topics

Reply

or to participate.