No One Wants Apple To Scrape Their Websites for AI Training

Ouch.

Stop Sign

Wired reports that a slew of major websites, including influential news publishers and top social media platforms, are blocking Apple’s web crawler from scraping their pages for AI training content.

Per the report, media companies that have altered their robots.txt files to lock Applebot out include The New York Times, The Atlantic, The Financial Times, Gannett, Vox Media, and Condé Nast. On the social media side, Facebook, Instagram, and Tumblr all confirmed that they’ve blocked Apple from scraping their sites, as did the enduring internet elder Craiglist.

Robots.txt files are becoming an increasingly fascinating place to study the digital politics of AI. Some of these companies — including Vox, Condé Nast, and The Atlantic — have inked content licensing deals with OpenAI; The New York Times, meanwhile, has drawn a clear line in the sand on AI, and is actively suing OpenAI for copyright infringement. Facebook and Instagram are both owned by Meta, one of Apple’s competitors in the AI field, while platforms built on user content like Tumblr and Craigslist are sitting on some very lucrative troves of quality data. Meanwhile, in the background, Apple has already entered a deal with OpenAI to integrate the chatbot ChatGPT into “Apple experiences.”

In short, the AI industry is intensely competitive, particularly regarding access to high-quality, human-made training material. And as the tentative bonds between AI companies and data wells like journalistic bodies or social media sites continue to take shape, where and how bots like Apple’s are allowed to roam offers an interesting glimpse into AI-focused decision-making — on the publisher side, and on behalf of AI companies as well.

Coal Mines

According to Wired, these websites have specifically blocked “Apple-Extended,” a web crawler that, per an Apple blog post, explicitly provides web publishers with the choice to “opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.” An Apple spokesperson confirmed to Wired that blocking Applebot-Extended doesn’t ward off the OG Applebot from trawling a website, and instead prevents any scraped data from being used to train Apple’s AI models.

Applebot, in contrast, scrapes data for Apple’s Siri and Spotlight — a distinction that seems to speak to some caution on Apple’s behalf regarding copyright and IP protection in the AI era.

The NYT isn’t the only company or group suing AI makers, and it may well be in Apple’s best interest to avoid scraping any controversial or currently-in-litigation data, especially if it’s already tapped OpenAI to fill in some of its product gaps. Call it the billion-dollar canary in the coal mine.

More on AI and copyright: Amid New York Times Lawsuit, ChatGPT Is Citing Plagiarized Versions of NYT Articles on an Armenian Content Mill

Source link
lol