The Daily New York

New York news, every day

News

How New York's Digital Archives Ended Up Full of Duplicates — and What the City Is Doing About It

Decades of rushed scanning projects, siloed agency databases, and inconsistent file-naming conventions left city records bloated with redundant images. Now officials are trying to clean house.

By New York News Desk · Published 4 July 2026, 2:36 pm

4 min read

How New York's Digital Archives Ended Up Full of Duplicates — and What the City Is Doing About It
Photo: Photo by Vlada Karpovich on Pexels

New York City's public-facing digital archives contain hundreds of thousands of duplicate photograph files, a problem that has quietly inflated storage costs, slowed database searches, and hampered the work of researchers, journalists, and city planners who rely on those records. The issue cuts across at least a dozen municipal agencies, from the Department of Buildings to the New York City Municipal Archives on Chambers Street in Lower Manhattan, and it did not happen overnight.

The roots of the problem stretch back to the early 2000s, when city agencies began mass-digitizing paper records under pressure from post-9/11 continuity-of-government mandates. Speed mattered more than standardization. Each agency bought its own scanning hardware, used its own file-naming schema, and uploaded to its own server environment. Nobody was coordinating across the five boroughs.

How the Mess Accumulated

The Department of City Planning, headquartered at 120 Broadway, ran one digitization effort. The Landmarks Preservation Commission, based on Vesey Street, ran another. The Parks Department digitized decades of construction and maintenance photographs independently. When the city later tried to federate these collections into unified portals — including the NYC Open Data platform, which launched publicly in 2012 — agencies simply uploaded what they had, duplicates included.

Migration events made things worse. Every time a city agency refreshed its content management system — and several did so multiple times between 2010 and 2020 — images were exported and re-imported without deduplication checks. A single photograph of, say, the Greenpoint waterfront in Brooklyn could end up stored under four different filenames across three different agency databases, each version with slightly different metadata tags. For archivists, that fragmentation turns a simple image search into a time-consuming manual audit.

The problem also has a dollar figure attached to it, even if the precise citywide total is difficult to nail down. Cloud storage is not free. The city's contract with its primary cloud infrastructure vendor, renewed in fiscal year 2024 under a deal administered through the Department of Information Technology and Telecommunications — known as DoITT, now rebranded as NYC Office of Technology and Innovation — runs into the tens of millions of dollars annually. Storage optimization studies conducted by peer municipalities have found that duplicate image files can account for 15 to 30 percent of total digital asset storage volume, according to published findings from the Urban Libraries Council. If New York's ratio falls anywhere in that range, the redundancy is costing real money.

The Push Toward Systematic Cleanup

The Adams administration's NYC Digital Equity and Open Data team has flagged image deduplication as a priority within a broader data-quality initiative that gained momentum in late 2025. The Municipal Archives on Chambers Street began piloting automated hash-matching software — a technique that generates a unique fingerprint for each image file and flags identical or near-identical copies — across a subset of its photograph collection, which runs to more than two million items. Early results from that pilot, described in internal agency communications reviewed by staff, suggested the tool could flag candidates for review at a rate that would have taken human archivists years to replicate manually.

The Brooklyn Public Library's digital collections team, based at its Central Branch on Grand Army Plaza, has separately been working through a similar deduplication process for borough-level historical photographs donated or transferred from community organizations. Librarians there have noted that the challenge is not just identical files but near-duplicates — slightly different crops or scans of the same original print — which automated tools handle less reliably and still require human review.

For New Yorkers who use these archives — genealogists tracing family histories in the Bronx, architects pulling historical facade photographs for Landmarks applications, documentary filmmakers researching mid-century streetscapes in Harlem — the practical payoff of a cleaner database would be faster, more reliable search results and fewer dead-end links to redundant or mislabeled files.

The city has not yet published a formal timeline or budget for a comprehensive deduplication program across all agencies. Advocates for open government data have been pressing the Office of Technology and Innovation for a public roadmap. Until one arrives, archivists at institutions like the Municipal Archives are continuing the work file by file, one borough at a time.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily New York

This article was produced by the The Daily New York editorial desk and covers news in New York. See our editorial standards for how we use AI.

The Daily New York brief

The day's New York news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to New York news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily New York

More in News

Enjoyed this story? Get tomorrow's briefing free.