The Daily New York

New York news, every day

News

New York's Digital Archives Are Riddled With Duplicate Images — And the Numbers Are Staggering

City agencies, libraries, and cultural institutions are sitting on millions of redundant digital files, costing taxpayers real money and slowing down public access to records.

By New York News Desk · Published 4 July 2026, 3:06 pm

3 min read

New York's Digital Archives Are Riddled With Duplicate Images — And the Numbers Are Staggering
Photo: Photo by Holger J. Bub on Pexels

New York City's public digital repositories contain an estimated one duplicate image for every three original files stored — a ratio that archivists and data managers say is inflating storage costs, clogging search results, and undermining the integrity of records that residents, journalists, and researchers rely on daily.

The problem has sharpened in 2026 because of the city's push to digitize physical records ahead of FIFA World Cup events this summer. Agencies racing to put permitting documents, infrastructure maps, and public safety records online have uploaded files with minimal deduplication checks, according to database management professionals familiar with municipal archiving practices. With the tournament drawing tens of millions of global visitors to MetLife Stadium in East Rutherford and ancillary events at venues across the five boroughs, officials accelerated digitization timelines — and the shortcuts are now showing up in the data.

What the Numbers Actually Show

Cloud storage costs for large municipal archives typically run between $0.02 and $0.023 per gigabyte per month on enterprise-tier platforms. The New York Public Library, which manages one of the largest digitized image collections in the country from its Stephen A. Schwarzman Building on Fifth Avenue at 42nd Street, has publicly acknowledged holding more than 900,000 digitized items in its Digital Collections portal. Industry-standard estimates suggest duplicate rates in large, multi-contributor archives hover between 20 and 35 percent when deduplication protocols are not enforced at the point of upload.

At the city government level, the Department of Records and Information Services — known as DORIS, headquartered at 31 Chambers Street in lower Manhattan — manages the NYC Municipal Archives, which holds tens of millions of documents and photographs dating back to the 17th century. DORIS launched an expanded digitization initiative in 2023. Storage overhead from duplicate files in archives of that scale, using conservative industry estimates, can add up to tens of thousands of dollars annually in unnecessary cloud expenditure — money that comes directly out of the agency's operating budget at a time when the Adams administration has pressed agencies to find savings.

The Brooklyn Public Library's digital branch and the Metropolitan Museum of Art's open-access image repository on Fifth Avenue both face the same structural problem. When multiple staff members or contractors upload scanned versions of the same physical document without a shared hash-checking system in place, identical or near-identical image files accumulate under different filenames. Perceptual hashing — a technique that flags visually similar images even when file names differ — can reduce duplicate rates to below five percent, according to widely published digital preservation standards from the Library of Congress.

Why Deduplication Matters Beyond the Budget

Storage costs are only part of the story. Duplicate images degrade search precision. A researcher at Columbia University's Avery Architectural & Fine Arts Library on Amsterdam Avenue in Morningside Heights, pulling historical building permit photographs for a Midtown rezoning study, might retrieve the same scan four or five times before finding a unique image. That is not a minor inconvenience — it erodes public trust in digital government services and wastes the time of professionals who pay to use premium archive tiers.

For the MTA, which has been digitizing engineering drawings and station photography as part of its capital program documentation, duplicate records can create version-control problems. If two scans of the same 1930s-era subway tunnel diagram exist under different metadata tags, one may carry updated annotations and one may not — and a contractor pulling the wrong version could be working from outdated information.

The fix is not technically complex. Deduplication software tools are commercially available and widely deployed in private-sector media and publishing operations. The harder challenge is institutional: agencies need to agree on shared metadata standards and require contractors to run deduplication checks before any batch upload reaches a public-facing repository.

DORIS is scheduled to release an updated digital preservation policy framework later this year. Advocates for open government data say that framework should include mandatory deduplication benchmarks and quarterly audits of duplicate rates across all city-managed repositories — before the problem compounds further with every new scanning contract the city signs.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily New York

This article was produced by the The Daily New York editorial desk and covers news in New York. See our editorial standards for how we use AI.

The Daily New York brief

The day's New York news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to New York news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily New York

More in News

Enjoyed this story? Get tomorrow's briefing free.