News publishers limit Internet Archive access due to AI scraping concerns

News publishers limit Internet Archive access due to AI scraping concerns | Nieman Journalism Lab HOME About Subscribe Archives Foundation Reports Storyboard LATEST STORY Washington Post layoffs disproportionately affected union members of color, preliminary Guild data shows Business Models Mobile & Apps Audience & Social Aggregation & Discovery Reporting & Production ABOUT SUBSCRIBE Business Models Mobile & Apps Audience & Social Aggregation & Discovery Reporting & Production Translations Jan. 28, 2026, 3:09 p.m. Aggregation & Discovery Business Models News publishers limit Internet Archive access due to AI scraping concerns Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers. By Andrew Deck and Hanaa’ Tameez Jan. 28, 2026, 3:09 p.m. Jan. 28, 2026, 3:09 p.m. As part of its mission to preserve the web, the Internet Archive operates crawlers that capture webpage snapshots. Many of these snapshots are accessible through its public-facing tool, the Wayback Machine . But as AI bots scavenge the web for training data to feed their models, the Internet Archive’s commitment to free information access has turned its digital library into a potential liability for some news publishers. When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn , head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots. RELATED ARTICLE The Wayback Machine’s snapshots of news homepages plummet after a “breakdown” in archiving projects Andrew Deck October 21, 2025 Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine. In particular, Hahn expressed concern about the Internet Archive’s APIs . “A lot of these AI businesses are looking for readily available, structured databases of content,” he said. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.” (He admits the Wayback Machine itself is “less risky,” since the data is not as well-structured.) As news publishers try to safeguard their contents from AI companies, the Internet Archive is also getting caught in the crosshairs. The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive. The majority of FT stories are paywalled, according to director of global public policy and platform strategy Matt Rogerson . As a result, usually only unpaywalled FT sto

Source: Hacker News | Original Link

才疏学浅

一花一草一世界 | 心若无物就可以一花一世界，一草一天堂

News publishers limit Internet Archive access due to AI scraping concerns