jmeinlschmidt

Archiving High Quality Pictures Using Photoprism

Introduction

I recently faced the challenge of scanning and archiving a large number of old family photos. This article is meant to describe the challenges I faced along the way, including the solution I chose and its nooks and crannies.

Criteria

The primary goal is future-proof digitization and archiving of at least hundreds of color and black and white photographs. Given this quantity, it is appropriate to consider a secondary goal, namely the management of these photos through a media library.

This media library should allow, in particular, the automatic generation of quick previews for convenient browsing, e.g. on the web on machines with lower computing resources (using JPEG and down-scaling). And at the same time basic organization and work with metadata such as people, dates, locations and descriptions of what is actually in the photo. The last criterion is the ability to search over this data efficiently. In the nice-to-have category we can include AI models for face recognition, object recognition, etc.

Let’s go back to the notion of future-proof, which is crucial for this solution. In this context, it is a solution that enables future generations to be able to function, i.e. to be robust in 50 years, perhaps in an emulated environment.

Requirements

Let’s convert the criteria defined in the previous sections into specific requirements that we will use to choose a proper solution. Let us first divide this task into two parts.

The scan itself and its technology

Let’s start first with the hardware, i.e. the scanner itself. Nowadays quality and affordable photo scanners allow scanning in color mode with 48-bit color depth, resolution up to 4800 DPI and storing in several formats from PNG, PDF through TIFF. Given the developments in the screen market in recent years, I consider a future-proof solution (50 years+ in this context) to be at least 2400 DPI resolution; ideally 4800 DPI if the scanning time of the image allows us to do so. This criterion will follow us for the rest of the solution.

The data format is also an important criterion. A completely essential requirement is no or lossless compression to preserve the original images. Only two formats - TIFF and PNG - made the final cut. In the end, I chose TIFF, which was created in 1986 and is better suited to the desired purpose of being the first universal standard unifying format for scanned documents. As it turns out later, the choice of format also had a major influence on the later technical design, as TIFF support for media is generally weaker.

  1. At least 2400 DPI, ideally 4800 DPI (this resolution is primarily used for smaller photos)
  2. Colour depth ideally 48-bit (mind that even black and white photos are scanned in colour mode!)
  3. Universal file format with no compression or lossless compression

Media and archiving

There will be many more criteria for software to manage these images. Again, a future-proof solution that will be reproducible even 50 years from now is absolutely essential. This implies, without any doubt, an open-source self-hosted solution that can be built in-house, perhaps in an emulated environment.

If we were talking about a hypothetical closed-source solution, an interesting candidate could be Apple Photos, which works very well, for example, with face recognition and a strong user-friendliness overall.

However, we are now interested in an open-source solution that also fully supports TIFF. As we will see later, these criteria will be quite limiting for the selection.

  1. Open-source
  2. Full TIFF support
  3. Automatic thumbnail generation (e.g. JPEG)
  4. Manage properties (people, locations, dates, labels, tags)
  5. Duplicate search
  6. Backups (including restore)
  7. (nice-to-have) Web environment, user management
  8. (nice-to-have) AI face recognition

A very nice comparison is offered by the documentation of LibrePhotos. Unfortunately, there is no perfect solution, but one comes pretty close - Photoprism. The pitfalls of the chosen solution will be discussed in the following chapters.

Image Size

The choice of DPI and color depth has of course a direct influence not only on the scanning time, but also on the file size. In the case of scanning in 4800 DPI mode, I get up to about 25 minutes per scan for an A5 size in my case. In other cases we are around 5-10 minutes. For small images, however, 4800 DPI is more of a necessity than just a nice-to-have from my perspective.

File size is also critical. In practice, I perceive that when scanning in 4800 DPI mode, the size of a single image is between 10 and 200 MB. Averaging around 35 MB per image. This gives a resolution of roughly up to 300 MP, depending of course on the physical size of the photo. For comparison, today’s last generation iPhone shoots up to 40 MP. So this is a category higher than any ultra high resolution, and I’ll point out straight up that we are hitting the technical limits of conventional desktop devices with this one.

The very first thing is browsing, for which I use native macOS apps like Finder and Preview. Both are not built for this resolution from a performance standpoint, and will quietly consume up to 350 GB of RAM just by viewing (no editing) and then crash. Unless it is a bug or some memory real, it is quite obvious that no ordinary workstation is capable of providing such computing resources nowadays. This is where the need to automatically generate JPEG previews for standard viewing becomes apparent, so that the original TIFF can be saved for future generations or just really for archival purposes.

Please note that any work with these images requires at least the following resources:

  • 8 CPUs
  • 32 GB RAM

Sad to say, running such a media library on a home Raspberry Pi is now not possible. In it’s case, quality storage is added to the requirements, as the default SD card would go away rather quickly.

Photoprism

Installation Tips

There’s not much to add to the installation procedure, Photoprism has the documentation down perfectly. Let me just give a short summary for myself.

For the installation I originally chose Docker Compose for Photoprism itself and the SQL database. Given the critical compute resource requirements, this choice may seem illogical due to overhead and virtualization. However, my original intent was to easily deploy to a custom server that would be accessible to the rest of the family over a VPN. Docker Compose thus provides an easily isolated and virtually out-of-the-box setup for reverse proxy.

Photoprism has the huge advantage of exposing the main folder with all the original images (called originals). I recommend having this folder mounted in some easily accessible place. In my case, I have it right on my iCloud Drive, which makes it so easy to do at least some rudimentary backups. Likewise, have an accessible imports folder that you use for bulk uploading of images. I find the web interface for bulk uploading rather impractical and hardly use it.

By the way Photoprism allows very easy backups, which I recommend trying first before it is too late :-).

Limitations

Just as we have encountered limitations when working with photos, we have encountered several limitations of Photoprism itself. In the so-called clean version, even Photoprism is not able to operate with something like this, even with sufficient computing resources.

The primary problem is generating thumbnails for which Photoprism internally uses ImageMagic. This package has its own internal photo size limits that cannot be set via the Photoprism interface, and there is no choice but to modify Docker Image (the preferred option) or manually adjust ImageMagic settings inside the Docker container.

  • Set maximum limits in Photoprism
  • Disable dynamic generation of photo thumbnails
  • Edit or disable ImageMagic internal limits (see /etc/ImageMagick-6/policy.xml)
  • Set the JPEG thumbnail quality to a maximum of 90

Otherwise, the photo thumbnail generation will fail, which can be seen in the Photoprism log.

Conclusion

The chosen solution has proven to be functional, meeting the defined requirements despite the significant technical problems caused by the computational demands of such high quality photographs. Even from the nice-to-have solutions, all of them were mostly successful.

The only major problem seems to be the handling of people (face tagging and recognition) in Photoprism, which does not (yet) allow manual labelling of people in the image. Thus, it is necessary to rely solely on the AI model, which is very unreliable, even for face recognition alone. In most of the images (about 80%), it cannot even recognize a clearly legible face. Thus, there is no equivalent way to tag such a face.

Learning to recognize the faces of persons, even in almost identical images, fails almost completely. It can do so perhaps in only a few cases, and if it does, it does so with a very high error rate (roughly 50%). Thus, I have to rate the work with persons as inadequate, but at present I have not been able to find a better solution meeting the same criteria. Thanks to the fact that Photoprism is an open-source solution, it is always possible to contribute by improving this feature.