New York City's public digital repositories are carrying a hidden weight: tens of thousands of duplicate image files scattered across agency servers, library systems, and municipal databases, according to records and technology assessments reviewed by The Daily New York. The problem isn't new, but the scale — and the cost — is finally drawing attention from administrators trying to rationalize bloated IT budgets ahead of a city fiscal year that runs tight.
Why now? The Adams administration's push to digitize city records accelerated sharply after 2022, when the Department of Records and Information Services, headquartered at 31 Chambers Street in lower Manhattan, launched a multi-phase scanning initiative aimed at putting millions of historical documents online. More files digitized faster means more opportunities for duplication — the same photograph ingested twice under different file names, scanned once by an archivist and once by a contractor, or uploaded to two platforms simultaneously without a reconciliation check.
The Numbers Behind the Problem
Industry benchmarks for large municipal digitization programs suggest that duplicate image rates can run anywhere from 12 percent to over 30 percent of total file counts, depending on how aggressively deduplication software is deployed. For a city the size of New York — where the Municipal Archives alone holds more than 2 million photographic items — even a conservative 15 percent duplication rate would translate to hundreds of thousands of redundant files consuming server space that costs real money to maintain.
Cloud storage pricing for enterprise government contracts typically runs between $0.02 and $0.05 per gigabyte per month. A single high-resolution archival scan can run 50 megabytes or more. Do the arithmetic across a repository of, say, 500,000 duplicate files at that size, and the unnecessary monthly storage bill climbs into the thousands of dollars — recurring, month after month, year after year. Small numbers individually. Significant ones collectively, particularly when the city's Department of Information Technology and Telecommunications, known as DoITT, is already managing a capital technology budget measured in the hundreds of millions.
The New York Public Library, with branches stretching from the Stephen A. Schwarzman Building on Fifth Avenue and 42nd Street to the Mott Haven branch in the South Bronx, faces a parallel version of the same challenge in its Digital Collections portal. The library has digitized more than 900,000 items and made them publicly accessible online, a figure it has cited in its own annual reports. Keeping that catalogue clean — free of redundant entries that confuse researchers and slow search performance — requires ongoing curatorial labor that competes with other budget priorities.
What Deduplication Actually Costs — and What It Saves
Perceptual hashing, the technology most commonly used to identify near-duplicate images even when file names differ, has become markedly cheaper in recent years. Software licenses for enterprise-grade tools now start around $10,000 annually for mid-sized deployments, with larger municipal contracts negotiated significantly lower per-unit. A one-time remediation project for a repository of 500,000 images can often be completed in under three months with a team of two or three data technicians, according to published case studies from peer institutions in Chicago and London.
The Brooklyn Public Library, which runs its own digital archive separate from the NYPL system, began a quiet internal audit of its digitized photograph collection in early 2026 specifically to address redundancy before migrating to a new content management platform later this year. The library has not published findings yet.
For city residents and researchers who rely on these archives — genealogists searching 19th-century immigration records at the Municipal Archives, historians pulling building permit photographs from the Department of Buildings' online portal at 280 Broadway — duplicate images mean cluttered search results, slower load times, and occasional confusion when two versions of the same document carry conflicting metadata.
The practical fix is neither glamorous nor politically charged: run deduplication software, audit the results, establish intake protocols that flag redundant files before they enter the system, and fund the staff to maintain those protocols going forward. Several city agencies are reportedly in early planning stages for exactly that kind of housekeeping. The question is whether the budget to do it properly survives the next round of municipal cuts.