And they released it anyway.
News Flash, Buddy
Apple’s recent stab at AI, Apple Intelligence, has largely been disappointing. In particular, its news summaries faced suched widespread criticism for botching headlines and reporting false information that this week Apple paused the entire program until it can be fixed.
None of this should be surprising. Such AI “hallucinations” are a problem inherent to all large language models that nobody’s solved yet, if it’s even solvable at all. But releasing its own AI model sounds especially reckless when you consider that Apple engineers warned about the tech’s gaping deficiencies.
That warning came in a study released last October. The yet-to-be-peer-reviewed work, which tested the mathematical “reasoning” of some of the industry’s top LLMs, added to the consensus that AI models don’t actually reason.
“Instead,” the researchers concluded, “they attempt to replicate the reasoning steps observed in their training data.”
Math Is Hard
To test the AI models, the researchers had them attempt thousands of math problems from the widely used benchmark GSM8K dataset. A typical question is as follows: “James buys 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay?” Some questions are a tad more complicated, but it’s nothing that a well-educated middle schooler can’t solve.
The way the researchers exposed these gaps in the AI models was shockingly easy: they simply changed the numbers in the questions. This prevents data contamination — in other words, ensuring that the AIs haven’t seen any of these exact problems before in their training data, without actually making the problems any harder.
This alone caused a minor but notable drop in accuracy in every single of the 20 tested LLMs. But when the researchers took things a step further by also changing the names and adding in irrelevant details — like in a question about counting fruits, remarking that a handful of them were “smaller than usual” — the performance drop was, in the researchers’ own wording, “catastrophic”: as high as 65 percent.
These varied between models, but even the cleverest of the bunch, OpenAI’s o1-preview, plummeted by 17.5 percent. (Its predecessor GPT-4o, fell by 32 percent.)
Copy Cat
And so the takeaway is harsh.
“This reveals a critical flaw in the models’ ability to discern relevant information for problem-solving, likely because their reasoning is not formal in the common sense term and is mostly based on pattern matching,” the researchers wrote.
Put another way, AI is very good at appearing smart, and will often give you the right answer! But once it can’t copy someone’s homework word-for-word, it struggles — big time.
You’d think this would raise serious questions about trusting an AI model to regurgitate headlines — swapping words around without actually understanding how that changes the overall meaning — but apparently not. Apple knew about the serious flaws that every single LLM to date has shown and released its own model anyway. Which to be fair, is the modus operandi of the entire AI industry.
More on AI: Horrendous New Startup Uses AI Agents to Flood Reddit With Posts Shilling Clients’ Products
Source link
lol