Archivematica is not a digital repository - it doesn't provide storage of digital files. It's used to process and package files before they go to storage.It's built on digital preservation standards. It runs on Linux. It's Open Source and incorporates a couple dozen open source tools.
Some of the things it does:
Processed files are then packaged (in a 7zip file). Archivematica produces an XML file for each submission that includes all of the extracted information about the files, preservation metadata and a brief descriptive metadata.
Question: Can you clarify why the newer version of the iPhone image file was flagged in Archivematica?
Answer: Archivematica runs on an older version of Linux because of the complexity of suite of tools. For this demo, Zack just removed that file from the set and reprocessed the remaining files. In real life, he would convert files outside of Archivematica if he had too. Library of Congress Recommended Formats web site is a good source for learning about file formats he's never worked with and also what is the best preservation format.
Question: Related to the previous question. Often times in the Open Source world updates take a toll on the software's performance. How engaged/active is the Archivematica community? Has this been discussed? Is there a space for discussions/questions ("hey, Archivematica isn't able to work with this file type....").
Answer: I haven't seen any community discussions about the new iPhone .heic format because it's only been out about a year so archives probably aren't having to preserve them. But the Archivematica community is very active. Sometimes my questions are very unique to my workflow and I don't get any answers. But in other cases, I get a lot of answers and help from the community when it comes to file identification and customizing rules to handle different formats. Archivematica is based on so many different tools (dependencies), but they do a good job of keeping it updated on a regular basis.
Question: You said Amazon Glacier checks the health of the files. Have you ever received notification that something is wrong?
Answer: There is no reporting from Amazon. Zack does pull things back periodically and test them and he's never had any problem. He has experienced bit rot on our EBS Amazon server (not Glacier - think of EBS more like Google Drive). There is no fixity checking, like there is in Glacier. It hasn't been a lot - 5 or 6 files out of 900,000 (newspaper files). Many cloud storage services are just providing storage. They are not providing monitoring/repair. Glacier, while not transparent (which is a criticisms of using Glacier - Transparency is important in Digital preservation), does offer monitoring/repair and makes a claim that you have a 99.99999999999 chance of getting out what you put in. The monitoring is at the package level. The package is one file and that's what's being checked and repaired (if necessary) from the other copies of the package.
Question: Have any of you seen born digital content being loaded onto NY Heritage or being submitted to the Dark Archive? It's something that I've been thinking about as my organization considers born digital acquisitions, digital archives, digital preservation policies, etc.
Answer: Most of what we take into the Dark Archive were created through digitization projects. There might be some born digital PDFs and audio files. But they are likely coming. Once you all start collecting them, we'll be here to help preserve them.
Question: I was surprised to see that the JPEG in your demo set got converted to a TIFF. I remember that was the default Archivematica rule/process a few years ago. But I thought the community decided to change rule and just duplicate to JPEG because TIFF creates a large file and you don't get any quality from going to TIFF (because it has already been compressed).
Answer: We kept that rule. I felt better about keeping that rule. Amazon Glacier storage is cheap enough that we can support that (and we don't take in a lot of JPEGs anyway.)
After Q and A, Zack mentioned that he recently setup a Network Attached Storage (NAS). It holds two hard drives with 1.8 TB of storage. One drive mirrors the other so if one goes bad, there's another copy. It does check for digital rot. It's more expensive than a regular hard drive, but could be a good solution for you all locally. It connects to our network and I can see when people have changed files.
Someone set up Synology 220J at home. Even if you're not tech savvy, it's easy to use and they walk you through everything. It does health checks. I've learned a lot about how servers work doing this at home.
Final note about Archivematica: it works much better in a local environment rather than virtual. You're moving a lot of data around. And it's safer because it runs on an older version of Linux.
Question: We really need to be storing files in multiple places, rather than keeping them in one place. I was assured by my IT department that our new cloud storage is secure, but I'm going to put a copy in Southeastern's Dark Archive for more peace of mind.
Future meeting topic: Web Archiving, Archive-It (and other tools). Wayback Machine can be used to make sure certain web pages are captured before they disappear.