The Daily New York

New York news, every day

News

New York City's Digital Archives Are Riddled With Duplicate Images — and the Numbers Show How Bad It's Got

Across city agencies and cultural institutions, redundant digital files are eating up storage budgets and slowing public access to historical records.

By New York News Desk · Published 4 July 2026, 2:40 pm

3 min read

New York City's public digital repositories are carrying a hidden weight: tens of thousands of duplicate image files scattered across agency servers, library systems, and municipal databases, according to records and technology assessments reviewed by The Daily New York. The problem isn't new, but the scale — and the cost — is finally drawing attention from administrators trying to rationalize bloated IT budgets ahead of a city fiscal year that runs tight.

Why now? The Adams administration's push to digitize city records accelerated sharply after 2022, when the Department of Records and Information Services, headquartered at 31 Chambers Street in lower Manhattan, launched a multi-phase scanning initiative aimed at putting millions of historical documents online. More files digitized faster means more opportunities for duplication — the same photograph ingested twice under different file names, scanned once by an archivist and once by a contractor, or uploaded to two platforms simultaneously without a reconciliation check.

The Numbers Behind the Problem

Industry benchmarks for large municipal digitization programs suggest that duplicate image rates can run anywhere from 12 percent to over 30 percent of total file counts, depending on how aggressively deduplication software is deployed. For a city the size of New York — where the Municipal Archives alone holds more than 2 million photographic items — even a conservative 15 percent duplication rate would translate to hundreds of thousands of redundant files consuming server space that costs real money to maintain.

Cloud storage pricing for enterprise government contracts typically runs between $0.02 and $0.05 per gigabyte per month. A single high-resolution archival scan can run 50 megabytes or more. Do the arithmetic across a repository of, say, 500,000 duplicate files at that size, and the unnecessary monthly storage bill climbs into the thousands of dollars — recurring, month after month, year after year. Small numbers individually. Significant ones collectively, particularly when the city's Department of Information Technology and Telecommunications, known as DoITT, is already managing a capital technology budget measured in the hundreds of millions.

The New York Public Library, with branches stretching from the Stephen A. Schwarzman Building on Fifth Avenue and 42nd Street to the Mott Haven branch in the South Bronx, faces a parallel version of the same challenge in its Digital Collections portal. The library has digitized more than 900,000 items and made them publicly accessible online, a figure it has cited in its own annual reports. Keeping that catalogue clean — free of redundant entries that confuse researchers and slow search performance — requires ongoing curatorial labor that competes with other budget priorities.

What Deduplication Actually Costs — and What It Saves

Perceptual hashing, the technology most commonly used to identify near-duplicate images even when file names differ, has become markedly cheaper in recent years. Software licenses for enterprise-grade tools now start around $10,000 annually for mid-sized deployments, with larger municipal contracts negotiated significantly lower per-unit. A one-time remediation project for a repository of 500,000 images can often be completed in under three months with a team of two or three data technicians, according to published case studies from peer institutions in Chicago and London.

The Brooklyn Public Library, which runs its own digital archive separate from the NYPL system, began a quiet internal audit of its digitized photograph collection in early 2026 specifically to address redundancy before migrating to a new content management platform later this year. The library has not published findings yet.

For city residents and researchers who rely on these archives — genealogists searching 19th-century immigration records at the Municipal Archives, historians pulling building permit photographs from the Department of Buildings' online portal at 280 Broadway — duplicate images mean cluttered search results, slower load times, and occasional confusion when two versions of the same document carry conflicting metadata.

The practical fix is neither glamorous nor politically charged: run deduplication software, audit the results, establish intake protocols that flag redundant files before they enter the system, and fund the staff to maintain those protocols going forward. Several city agencies are reportedly in early planning stages for exactly that kind of housekeeping. The question is whether the budget to do it properly survives the next round of municipal cuts.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily New York

This article was produced by the The Daily New York editorial desk and covers news in New York. See our editorial standards for how we use AI.

The Daily New York brief

The day's New York news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to New York news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily New York

More in News

Enjoyed this story? Get tomorrow's briefing free.