The Daily New York

New York news, every day

News

New York's Digital Archives Are Drowning in Duplicate Images — and the Numbers Are Staggering

From the Municipal Archives on Chambers Street to the Brooklyn Public Library, institutions are confronting a data crisis hiding in plain sight.

By New York News Desk · Published 4 July 2026, 3:16 pm

3 min read

New York's Digital Archives Are Drowning in Duplicate Images — and the Numbers Are Staggering
Photo: Photo by jimmy teoh on Pexels

New York City's public institutions collectively store tens of millions of digital image files across networked servers, hard drives, and cloud repositories — and a growing share of that inventory is exact or near-exact duplicates consuming storage budgets, slowing retrieval systems, and fouling the metadata records that archivists depend on. The problem has a name, a measurable cost, and, increasingly, a lobby of vendors promising to fix it.

The timing matters. With the 2026 FIFA World Cup now underway across host cities including New York, city agencies and cultural institutions have been racing to digitize historical photo collections, promotional materials, and event documentation at an accelerated pace. That sprint has made the underlying duplication problem significantly worse. Archivists and records managers at several Manhattan institutions say bulk ingest operations — where thousands of files are uploaded at once without deduplication screening — can push duplicate rates in a single collection above 30 percent.

What the Data Actually Shows

Industry benchmarks published by AIIM, the information management trade association, suggest that between 20 and 40 percent of files in large unmanaged digital repositories are duplicates or near-duplicates. For a city the size of New York, running archival systems that span agencies from the Department of Records and Information Services to the New York Public Library's Digital Collections division, that range translates into a storage burden measurable in petabytes. Cloud storage at enterprise rates — typically ranging from roughly $0.02 to $0.023 per gigabyte per month on major platforms as of mid-2026 — means even modest duplication rates carry five-figure monthly costs at institutional scale.

The Municipal Archives, located at 31 Chambers Street in Lower Manhattan, holds more than 2.2 million photographs according to its publicly listed collection data. The New York Public Library's digital holdings across its Stephen A. Schwarzman Building and branch network run to tens of millions of items. The Brooklyn Public Library's digitization program, based at the Central Library on Grand Army Plaza, has been expanding steadily since 2019. Each of these institutions runs its own asset management infrastructure, and none has a shared deduplication protocol with the others.

The practical consequences are not abstract. When a researcher queries an image database and receives 14 near-identical scans of the same 1940s photograph of Times Square — varying only in resolution or file format — sorting through those results wastes time and obscures which version carries the authoritative metadata. For archivists processing rights clearances, duplicate records can trigger redundant legal reviews. For IT staff managing backup cycles, every undetected duplicate is backed up repeatedly, compounding storage costs geometrically.

The Tools — and the Gap Between Them

Deduplication software has existed for years. Tools using perceptual hashing — an algorithm that generates a fingerprint for an image based on its visual content rather than its file data — can flag near-duplicates even when files have been resized, cropped, or re-saved in a different format. Several platforms targeting cultural institutions, including products marketed by companies such as Preservica and Axiell, incorporate these functions. The New York City Department of Records issued updated digital preservation guidelines in 2023, though those guidelines address file format standards more than active deduplication workflows.

The gap is largely procedural, not technological. Bulk upload pipelines at many institutions still lack a mandatory deduplication gate before files enter the master repository. Fixing that requires both software configuration and staff training — a combination that competes for budget with more visible infrastructure priorities.

For institutions looking to act before duplication rates climb further, records managers point to three concrete starting points: run a full perceptual hash audit on existing collections before the next major ingest, establish a single canonical-version policy for multi-format files, and require deduplication sign-off as part of any new digitization contract. Several vendors offer free audits on collections up to 500,000 files — a threshold that covers most individual New York branch or departmental repositories, if not the flagship collections. The audit alone typically surfaces enough redundant storage to justify the workflow changes that follow.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily New York

This article was produced by the The Daily New York editorial desk and covers news in New York. See our editorial standards for how we use AI.

The Daily New York brief

The day's New York news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to New York news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily New York and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily New York

More in News

Enjoyed this story? Get tomorrow's briefing free.