Evolution of personal file management and why it is not completed

Journal 0 Comments

Are you 100% happy with the way you organize your personal files, how you manage backups, and how you synchronize all the devices that you own? If so, you may skip the rest of this article, but if you are at least slightly unhappy with existing personal file storage management solutions, then I encourage you to continue reading. Although I am happy about existing synchronization, encryption, and backup solutions (no matter whether they run in the cloud, at home, or on a private server), I am still annoyed that we are forced to think in terms of different locations, and that operation and upgrade management is not as easy as it could and should be. To better understand what I mean and how datagnan - a personal file storage management platform that enters its public beta early next year - is adressing these issues, let's take a brief look at the history of personal file storage management and at the reasons behind the development and evolution of the tools that we have today.

History of personal file storage management

Once upon a time, when personal computers had just found their way into our homes, we stored our files (documents, sheets, etc..) on local hard drives. At that time, it was common to have only a single personal computer at home, if any. As there was only one computer, there was no need for file synchronization. Instead, we used floppy disks to either backup our documents or to hand digital documents over to friends. Some folks also used tape drives to archive or backup their data. Over time, the size of our personal files increased, and we consequently adopted new storage media such as CD-R(W), DVD-R(W), or memory cards. With the introduction of USB 2.0 in 2000, the set of storage options was further extended by USB flash drives and portable hard drives. Yet, the usage scenario remained unchanged: such drives were and are still used to either extend local storage, to backup data, to transfer data from one computer to another, or to have a single storage location that can be attached to multiple systems - though, only to one at a time, and only if all systems are able to work with the filesystem used on the drive.

Of course, there were situations in which the above solutions were not applicable or simply inconvenient. For instance, if you had multiple personal computers or laptops at home from which you wanted to access your files. Or if you wanted to provide access to several users. The environment had grown to a network of single computers, and hence, network-attached storage devices (NAS) came into play that allowed concurrent access over a local network. The only technical requirement that was introduced: all client computers need to (1) understand the proper network protocol (e.g. NFS, SMB, WebDAV, FTP, etc.), and (2) they need to know the address under which the NAS system is reachable. With NAS, filesystem details were abstracted, and the need to configure backups on every single client computers was eliminated as well, thanks to redundant arrays of inexpensive disks (RAID) inside of the NAS.

While NAS systems are a good option if you want to access your files from your local network only, they were, in the beginning of the 21st century, not suited to provide convenient access from outside of the local network, e.g. when you are not at home. This shortcoming has many reasons: either the used network protocols (SMB, NFS) were not secure enough at that time, Internet Service Providers blocked the respective ports, or it was necessary to configure a port forwarding in your router. Furthermore, the upstream bandwidth of traditional internet connections was very limited and Internet Service Providers assigned dynamic IP addresses to their customers. The only alternative left was to use USB flash and portable hard drives, or to upload files to a private server using FTP, SSH, or any other network protocol. But both options are not very convenient. Out of this situation, in 2007, Drew Houston started to work on the well known service called Dropbox that has changed the way we store and share data today. With Dropbox, the contents of a local folder are continuously monitored, and changes (file or folder creations, modifications and deletions) are automatically mirrored to servers operated by Dropbox as well as synchronized to all computers. The benefit: (1) you can access your files no matter where you are (as you do with webmail services), (2) you receive an automatic backup in a remote location, and (3) you can access and modify your files when your are offline - changes are synchronized as soon as you are online again. And most importantly, the term "location" was eliminated. No network drive anymore, no remembering of addresses, just a plain folder on the local hard drive. Though Dropbox was the first provider of that kind, alternative cloud storage providers emerged - e.g. SugarSync, Box.net, Google Drive, or Microsoft OneDrive - and they provide more or less the same functionality: location abstraction, convenient synchronization, and online backup.

When personal data is stored on remote servers, questions regarding privacy pop up nearly immediately, not only since the Snowden revelations in 2013. To encrypt and protect the data stored in the cloud, new startups - e.g. BoxCryptor, CloudFogger or CryptSync - developed additional tools that ensure that all data is encrypted before it is uploaded to cloud providers. In the most convenient way, these tools intercept file system calls and perform transparent on-the-fly encryption and decryption - transparent in the sense that no user-interaction is required. Some cloud storage providers implement client-side encryption themselves, e.g. Wuala (which is shutting down its services) or SpiderOak.

A different path in order to deal with privacy concerns was taken by existing NAS system manufacturers, e.g. Western Digital, AVM, Synology, or QNAP, and new startups such as ownCloud. They either extended the functionality of existing NAS systems or developed new platforms, all with the aim to provide a functionality and usability similar to the offerings of cloud storage providers. The difference: in contrast to cloud storage providers, your files stay on your devices and you essentially run your personal cloud. The usability varies. Configuration and setup require at least basic networking knowledge, you need to perform identity and access management, some systems require that you manually setup port-forwarding in your router in order to be accessible from outside, and others lack proper synchronization and support for offline access, i.e. the ability to work on your files when you are not connected to the system. And this is probably not even a comprehensive list of the limitations yet.

There are also software only solutions on the market, most prominently BitTorrent Sync. With BitTorent Sync, your files also remain on your devices, but, and this is the most significant difference to NAS systems, no central coordination unit is needed. There is also no network drive that you need to connect to, nor an IP address or hostname that you need to remember. Synchronization is based on local folders that you select and it is even possible to have a distinction between metadata and file contents. This leads to the benefit that you can browse the complete tree of files and folders without actually having all file contents locally available.

Why today’s offerings are not solving the problem completely

At first glance, it looks as if the latest tools (ownCloud, SpiderOak, BittorrentSync, etc...) provide everything necessary, but do they really deliver? What has changed over the past 20 years? And which problems have just been moved to a different level? My two major comments are

  1. We are still thinking in terms of different locations.
  2. There is still no easy operation and upgrade management.

Yes, we are still distinguishing between local hard drives, network drives, folders that are synchronized through cloud providers or peer-to-peer technologies, folders that are shared with friends, folders that are included in backups, etc... In other words, before we store a file, we are thinking about where and how we are going to use it later on. Not that we want to think about that, but we have to because existing solutions still expect us to. How often have you asked yourself whether you already moved a file to your Dropbox or if it is still on your NAS or local hard drive? Or how many different root folders with pictures do you have?

Operation of storage requires efforts too. Let's consider NAS systems as an example: if your NAS system (not the drives inside) gets damaged, we have to deal with the situation manually, e.g. we have to configure the replacement NAS system and we need to transfer existing data to the new NAS system (sometimes it is sufficient to put the old hard drives into the new NAS system, most often it's not). My mother can't handle that herself. I experienced such a situation again just recently where I had to help a friend to migrate his data from his old NAS to his new NAS. As his Laptop could not mount the file system on the hard drives, we had to boot an Ubuntu system from USB before we could mount and copy the data to the new NAS system. The same applies to upgrading: if you are already using all hard drive slots of your NAS, you either buy a bigger NAS with more slots, or you buy a second one that is mounted as another network drive. Which brings us back to my first comment. Can't we do better?

As Satya Nadella, C.E.O. of Microsoft, pointed out a few weeks ago, when showcasing the new Surface Pro 4 and Surface Book: "as devices come and go, you persist". Yes, devices come an go, but not only you, also your data persists. Hence, migration from old devices to new devices should be trivial, but not only with cloud storage solutions.

How it should be

A personal file storage management system should not bother us with the concept of different storage locations. I am aware that this concept is already deeply carved into our way of thinking and that it might take some time to get rid of it again, but I really believe that the link between “when and where we use our files” and “how we organize them in our file system” should be broken. Instead, the system should assist us in our core activity: creating, accessing and modifying personal data. Consequently, users should be able to express their access requirements, e.g. "enable offline access for this file or folder" or "pay special attention to these files, they are important to me" and not waste time on identifying and carrying out the file operations that satisfy their requirements. Preferably, the system should learn the access patterns and identify requirements automatically.

Operation management should be easy and comprehensible for everybody. Adding (or removing) of storage devices should be possible on-the-fly and as simple as adding (or removing) a LEGO brick to (or from) a LEGO assembly. The same requirement applies to the replacement of storage systems, which effectively translates to a fault tolerance and no single coordination unit that can destroy the whole setup once its damaged or not working anymore.

And last but not least, the file storage management system should be compatible with legacy applications that expect data to be stored on local filesystems. Hence, the system should operate transparently on local file system level instead of introducing new and exotic APIs.

Conclusion

Two years ago I started to think about how the sketched file storage management system could look like on a technical level. Since the beginning of 2014, I have been evaluating the different aspects and technology stacks that need to be mastered in order to implement the system, amongst others: peer-to-peer synchronization algorithms, automatic network discovery methods, how to inject non-existing files and folders into the local file system on major operating system platforms, identity and access management, policy evaluation and management, or smart identification of differences in binary files. Today, after nearly 2 years of design and implementation, I am happy to say that I am close to releasing a public beta of the personal file storage management system named "datagnan" - pronounced like the famous d'Artagnan from the historical novel "The Three Musketeers" by Alexandre Dumas.

If I caught your interest, subscribe the newsletter and help spread the project by sharing this article or the project with others, as in the motto "one for all, all for one".

     

    Blog Categories

    Archives