New York's Duplicate Image Problem: The Numbers That Reveal a Hidden Digital Crisis
City agencies and cultural institutions are sitting on millions of redundant digital files, and the cost of storing them is climbing fast.
City agencies and cultural institutions are sitting on millions of redundant digital files, and the cost of storing them is climbing fast.

New York City's sprawling network of municipal agencies collectively manages an estimated tens of millions of digital image files, and a growing share of them are exact or near-exact duplicates — copies piled on copies across servers that taxpayers fund by the terabyte. The scale of the problem, drawn from public records requests and agency budget filings reviewed this spring, points to a digital housekeeping failure with real dollar consequences.
The timing matters. With the city pushing hard on technology modernization under the Adams administration's NYC Digital Services initiative, and with the Department of Citywide Administrative Services expanding its cloud migration contracts through fiscal year 2027, the cost of storing redundant data is no longer an abstraction. Cloud storage is not free. Redundant image files inflate storage bills, slow down retrieval systems, and in some cases compromise the integrity of public-facing databases used by residents and journalists alike.
Duplicate image bloat is not unique to government. But New York's institutional footprint makes it unusually acute. The New York Public Library system, which spans 92 branch locations across Manhattan, the Bronx, and Staten Island, digitized more than 900,000 items as of its last publicly reported archive count. Digitization workflows, when not carefully managed, routinely generate three to five derivative copies per original scan — thumbnails, web-optimized versions, archival masters — and version-control failures can mean those copies multiply further across backup systems.
At the municipal level, the NYC Department of Buildings maintains a property image database tied to its DOB NOW construction permit portal. Each permit application can attach dozens of photographs. The department processed more than 180,000 permit applications in fiscal year 2024, according to figures published in its annual report. Even a conservative duplication rate of 15 percent across attached images represents hundreds of thousands of files consuming storage that costs money per gigabyte per month under the city's Microsoft Azure and AWS contract arrangements.
The Metropolitan Transportation Authority faces a parallel version of the issue. The MTA's capital program, currently a $68 billion plan running through 2029, includes substantial investment in surveillance and inspection camera infrastructure across the subway system. Camera feeds generate still-image captures flagged for review. Without automated deduplication protocols built into the ingestion pipeline, duplicate frames stack up rapidly — a known challenge in any high-volume image capture environment.
Industry benchmarks offer a useful frame. Research published by the data management firm Veritas in 2023 found that across large enterprise organizations, between 30 and 40 percent of stored data is redundant, obsolete, or trivial — a category that includes duplicate images. Applied conservatively to a city the size of New York, even a 20 percent redundancy rate across municipal digital storage would represent a significant and unnecessary budget line.
The Brooklyn Public Library, which operates separately from the NYPL system and serves 60 branch locations across Kings County, has been building out its digital archive through the BPL Digital Collections program. Archivists there have publicly discussed — in conference presentations, not in statements to this reporter — the challenge of deduplication when source materials arrive from multiple donor collections with overlapping content.
Storage costs vary widely depending on contract tier and data classification. Standard cloud object storage for infrequently accessed archival data currently runs between $0.004 and $0.023 per gigabyte per month depending on the provider and access frequency tier. A single high-resolution image file from a DSLR camera can run 25 to 50 megabytes. Multiply that across millions of duplicated files and the monthly carrying cost becomes material — not transformative on its own, but a genuine inefficiency in a city that is simultaneously cutting library hours and arguing over transit funding.
The practical path forward for agencies and institutions involves automated deduplication software integrated at the point of ingestion — tools that hash incoming files and flag matches before they ever hit long-term storage. Several vendors offer this capability, and it is standard practice in private-sector media companies. For New York's public institutions, the barrier is less technological than procedural: updating intake workflows, retraining staff, and building deduplication requirements into future vendor contracts. The city's next round of digital infrastructure RFPs, expected in late 2026, would be a logical place to mandate it.
How does this story make you feel?
Spread the word
About this article
Published by The Daily New York
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News


