--- blog title line (top) ---

Tuesday, November 06, 2012

How long will last our documents?

We accumulate more and more documents on our computers and on the Internet. Text, photos, music, videos. It feels it will last forever. But what will we be able to read in 20 years from now? Can we ensure that our data remains accessible "forever"?

The other day I was remembering some of the LP records that I used to listen at my parents' place. I have been regularly looking for a CD edition of my favorite ones, like these oboe concertos by Heinz Holliger, but I am afraid that many of them will never be re-edited. I also still have a few favorite tracks on audio cassettes, but I have no player anymore. As I don't even know their title, I will never find them again. They seem to be dying with the technology.

On a larger scale, administrations, and especially state administrations, have the duty to preserve information "forever", or at least for long enough. What if one day, your bank or the social health care system would answer to you that according to our system, you do not exist? No, really, this must remain a science fiction scenario. Administrations have the duty to preserve information forever. This is even more true for national libraries, by definition.

Recently I was reading an interesting post from Luis Villa who inherited his grandfather's autobiography. After enriching it with photographies and other documents, he wants to preserve it long enough to be able to pass it later to his own grand children. So he is wondering what is the right ‘Format of Forever’?

Don't you worry,
Digital information is forever

An analog system, like for example the system to play sound recorded on audio cassettes, includes random noise. When you play a cassette it includes random noise. When you copy a cassette it includes random noise on the copy. The more you re-copy, the more you add noise on top of the original signal, to the point that the noise becomes dominant. Then you lost the information.

In contrast, with a digital system information is represented by discrete values (digits). Because it is just a number, it does not degenerate, and you can make an indefinite number of exact, unaltered copies. If we can copy it indefinitely, we can preserve it forever. Isn't it wonderful?

My digital data is safe, forever.
(photo by Renaud d'Avout, used under cba)

With the rise of personal computers and digital data formats, many types of information have been digitized. This includes all forms of text documents, audio records, photos and videos, and many more. Our music library, our mail, our photo albums, our family videos, we have them forever.

With the recent development of online storage and so called cloud technologies, more and more of our digital information is available from any computer. And since the smartphone boom, we can now access it virtually from anywhere, right out of our pocket.

I have a little story

At my first job (long ago), they had an old HP workstation (from even longer ago), which was running a piece of software (from even more longer ago). This software was a key advantage for the business. At the time when I joined the company, HP had just dropped the maintenance of this hardware. Following Murphy's law, the workstation collapsed. I let you imagine what happened to the business.

Hardware is clearly not "of Forever". Information technology evolves at such a speed that we can see several hardware changes in a lifetime. If you want to preserve data and software over time, it must be independent from the hardware.

No, actually HP kindly accepted to fix the workstation. They did an excellent work, very quick and very expensive. So the workstation restarted. Of course it was only temporary, we had to move to another hardware. Unfortunately different hardware often means also different binaries. It quickly became obvious that our software will not run on any new potential platform. Without the workstation, not only we were not able to create new projects, but even worse, we were not anymore able to access our archived projects, which we had a legal obligation to keep for 20 years. It seemed to be the end of the story.

Binary softwares die with the hardware, they are not "of Forever". In the same way, if your data is written in a closed format which only a binary software can read, then your data will die with the hardware.

Fortunately, the software had been developed specifically for the company, and I was told that the source code of the software should be somewhere on the workstation. With the source code, we can rebuild the software on a new platform. Bingo!

Source code is more "of Forever" than binary. When you acquire software that is essential for your long-lasting business, you must acquire also the source code. If the software is delivered only in binary form, you can't ensure that you will be able to run it "Forever".

Not quite, in fact. I did not find "the" source code, but many of them. There was a dozen of folders in different places, each one containing a variation of the code. Of course there was no information to distinguish one from another. Because the software was a calculation engine certified by a national authority, it was essential to find the exact source code matching the binary that we used to run and was certified.

There are two ways to lose data:
  1. by losing your unique copy of the data. (the obvious way)
  2. by disseminating many copies of the data, including variations; this without any control, so you don't know anymore which one is what. (the psychotic way)
Document variations under proper revision control is more "of Forever".

I thought that it would be easy to sort out by compiling all the variations and compare the new binaries to the certified one. Well, not. There was no compiler on the workstation. Without the exact same compiler, we are not able to rebuild the exact same binary that we used to run. I was left with the only alternative to compile the various source trees with the compiler on the new platform and run comparative tests with the certified binary on the HP workstation.

This ended up to be a long and tedious work, especially as it appeared that the source code was written in a non standard dialect of the programming language, revealing thousands of warnings and errors.

There is virtually an infinite number of ways to store information on a disk. By using standard formats, we allow different softwares to deal with the same information. Standard computer languages, in general standard file formats and especially open file formats are more "of Forever".

After dealing with the compilation errors, I conducted rigorous comparative tests on the various configurations. In the end I could identify a source tree with a good enough match with the certified binary. The new platform was now able to create new projects.

I was left to deal with the archived projects. They were stored on big tapes, which only the dying workstation could read. We had a full cupboard of tapes. As it was taking over half an hour to read one tape, it took over two weeks to read them all and transfer the data into a more modern storage medium. During the transfer, we realized that a few tapes were not readable anymore. As we had no other backup, the data was lost.

In the end, the full cupboard of tapes fitted on two CDs. This was a great improvement in access time and physical storage space. Also we were now able to make backup copies.

Storage media has a limited lifespan, and it may become obsolete with time. Storage media is not "of Forever". Data migration to another media may not be trivial, and it may require a significant amount of time and money.

That was it for the migration and the final switch off of the workstation. It was really time to migrate the system.


What did we learn?

Clearly, despite the fact that we can copy it indefinitely, digital data is not simply "of Forever". It involves hardware and software and there are many possible failure points, especially due to the rapid obsolescence of electronic systems. Preserving data "Forever" requires a process that might be costly in time and resources. Moreover the more a system is aging, the harder it is to migrate it.

Software is essential to process our data. If we do not have the right software, we are not able to open and process our data. So if we want to preserve data, we must also preserve related software. Because a binary software can be tied to a specific hardware platform, and because hardware platforms last only some years, it is essential to have access to the software source code.

Software source code often has multiple branching revisions, and our data itself may have several revisions. To prevent to get lost among revisions, both data and software must be under proper revision control.

Our data is encoded into a specific file format. It is essential that this format is clearly specified to allow us to access our data "Forever". It is better if it is an open format. And the best is when there exists Free Software to read and write your data format. Also a file format itself is not "of Forever" either. This is one more reason to assure the possibility to migrate our data to a another format.

Digital data appear to be very fragile. It is very easy to loose our data forever. For example we all experienced loss of data, erased by mistake, by a faulty software, or by a hard drive crash. A regular backup is the only way to overcome data loss. And to prevent losing also your backups, they should be stored in a remote location. You do backups, right?

Our data is stored on... data storage devices, like hard drives, CDs, etc. These devices have a relatively short lifetime in years, while good paper lasts centuries. They also appear to be very fragile. As a consequence we must be prepared to regularly migrate our data to other storage devices.

One more aspect to consider, that is not covered by my little story, is the legal right to deal with both the data and the related software. It includes the legal right to use it, to copy it, and to transfer it to another person like to pass it down to a descendant. There may be very strong and narrow restrictions. Proprietary software often prohibits copy and transfer. In opposite, Free Software grants an unlimited usage including copy and redistribution. The license to use e-books is another example of strongly restricted rights, basically you can't give an e-book, even to your descendant. It's just a one-person limited usage license, you don't own anything.

To sum it up

Digital data preservation requirements 1.Ownership - legal right to use the data and related software, - legal right to give, to pass down 2.File format A clearly specified file format - preferably an open format when it exists 3.Related software The software that deals with the data - parser, readers, converters, etc. - and their source code - the compilers and runtime environments (source code too) 4.Revision control If we make multiple variations of the data, they must be under proper revision control. 5.Backups to overcome data loss, or its physical support loss - regular backups - stored in different geographical locations 6.Data migration - maintain awareness - migrate storage media (vs. lifespan, obsolescence) - migrate data formats - may not be trivial - may be missing information for the new format - may have a significant cost - migration time, - cost of physical support, online storage, etc.

To read more:

We care more about what is unique

In the introduction, I mentioned Luis' project to preserve his grandfather's autobiography for his own grand children. Because a book is mostly text and pictures, it can be easily digitalized as Plain text files, together with images in a format like png.

Plain text is basically the most robust digital format, especially if the text is in English, therefore using only ASCII characters. To add some formatting to the text, there have been several lightweight markup language developed in the last 10 years. Any of them satisfies the requirements for long term preservation. Because they are light and just text, they are easily parsable, and writing a new parser is at the level of a first year student in computer science. PNG images are widely used all over the Internet, with multiple software implementations, including Free Software ones, which makes it a safe future-proof bet.

However some documents have also an emotional value. This does not fit in any digital format. What we won with infinite exact copies, we lost it in emotional value. It's not THE book anymore, it's just one copy among an infinite number of possible exact clones. Digital data on a CD or a USB key carries very little emotional value, while we care a lot about a beautiful old book. We care more about what is unique.

Old books (Photo by Maguis & David, used under cb)

To sow dreams in youngsters head

Paper is a proven long enough "of Forever" support. For such document with emotional value, I would get it printed on good paper, and bind it by myself. It's a bit of work and it takes time, but the content may deserve it.

Here is my secret, said the fox, It is the time you have wasted for your rose that makes your rose so important. It is the time I have wasted for my rose said the little prince, so that he would be sure to remember.

Antoine de Saint Éxupéry, The Little Prince, Chapter XXI.

To carry on with the idea of building emotional value, I think that I would hide the book in the attic, or maybe in a chest buried in the garden (and put an old map with clues in the attic). Then I would just let the grand children vaguely hear about an old book telling the story of their grand grand father, but this might just be a legend, as nobody found it, so far.

We could also engrave it in stone, although this might be a work... "of Forever".

Post a comment

You can use the following HTML tags in your comments:
<b> for bold text, <i> for italic text, and <a href=""> for links.

(my little blog robot)