Ending the Confusion - A Simple Explanation of Data Deduplication

It goes by many names, data deduplication, commonality factoring, single instance store, common file elimination, referential integrity and the list goes on. The data reduction field is still struggling with many terms and a lack of standardization. I will define the basics degrees of deduplication available today and the advantages as they relate to data backup.

Data Deduplication or is it de-duplication, the very fact that every time I type it, my spell check tells me it's spelled wrong, is a hint that it might need some clarification.

It goes by many names, commonality factoring, single instance store, common file elimination, referential integrity and the list goes on. What do all of these terms have in common?

Two things come to mind. First, none of them set off alarms in my word processor, which means they can't be used in a technical discussion. When we invent new technology, we need to invent new words for it. Second, at a high level, they all refer to various methods to reduce the amount of data being stored.

The data reduction field is still struggling with many terms and a lack of standardization. I will define the basics degrees of deduplication available today and the advantages as they relate to data backup.

Full Backup

A Full backup is a complete copy of data for every file and every server, every time you complete a backup. This method is used because it is straightforward and simple. If you want to recover data from four days ago, you retrieve that tape (or tapes) and begin your restore. The restore process is one tape or set of tapes. The drawback is it uses the longest backup window of all the methods.

As backup administrators begin to run into shorter backup windows, they usually start doing differential backups.

Differential Backup

This is where you have a weekly or master backup and each successive backup after that includes all the changes from the master. That means as the week goes on, you are backing up more and more data each time to capture all the changes. When it is time to recover, you only need your master and the differential for the day you are recovering to, so a Friday recovery requires only two tapes. This is a basic level of deduplication introduced by tapes decades ago.

Incremental Backup

Differential backups may cause your Thursday night backup windows to be too long. That is where tape based incremental backups come in. After your master, each tape only stores file that have had changes since the last backup. The trade off here is recovery. If you are recovering Friday's backup, you will need to take the master, let's say Sunday, and insert each tape until you reach Friday to process a full restore. This deduplicates more data than the differential, but it has some tradeoffs. It can be very time consuming and is further complicated when tapes are taken off-site each night.

Deduplication Backup

Enter the various levels of disc-based deduplication technology. At a basic level, that can be broken down into three types: Single instance, Block level and Byte level. Some of the earliest deduplication technologies came out of the wide area file services segment. Reducing data allowed higher bandwidth utilization and minimized the expense of wide area networks.

Single Instance Deduplication Backup

An email application example is the best way to describe single instance deduplication. If you email a 10 MB attachment to 100 employees in your company, that could equate to 1,000 MBs of data. In this case, a single instance of the attached file is stored. Other recipients receive the email with a "pointer" to that file. This reduces the storage on the email server to the original 10 MG. Now let's say everyone loves the file and they all save it in their personal directory on a server. You now have 1,000 MBs on your file server. A backup solution that uses single instance or common file elimination will only store a single copy of that file. This reduces your backup window, network traffic and backup storage from 1,000 MBs to 20MBc. All other references to that file are stored as pointers. They are still available for recovery, but use almost no storage.

This method also eliminates the need to backup common files like operating systems and applications when doing full server backups. Since I am using a disk-to-disk based example, all the "which tape has which version" is eliminated and tracked in the metadata of the backup solution.

Block-Level Deduplication Backup

Block-level data deduplication assigns a unique hash algorithm to every block of data. Block size varies depending on the application ranging from 4KB to 56KB. Some applications use a fixed block size while others use a variable block sizing. Generally, a smaller blocks size will find more commonality and reduce data by a greater amount. Block level deduplication can be applied globally across many backup sets.

A tradeoff to smaller blocks size is the greater processing and I/O overhead. Breaking a 100GB block of data into 8K blocks produces 12 million chunks of data. Reconstructing all these chunks of data delays restore times. Some technologies have configurations settings to increase restore performance by producing "sub-masters" within the data set. This allows for faster restores but requires additional storage. You must consider the additional overhead when evaluation of deduplication methods.

Byte-level Deduplicatiion Backup

Byte-level data deduplication performs a byte-by-byte comparison of the current data streams with the bytes it has seen before. This is a much more accurate comparison and produces a much higher commonality in the data sets. Most byte-level deduplication approaches are content aware. That means it is engineered specifically to understand the backup application's data stream so it can identify information like file name, file type and date/time stamp.

Because comparisons at this level are resource intensive, it is usually done after the backup occurs, called post processing, versus in-line, which is the norm with block level deduplication. This means backups complete at full disk performance, but require additional storage to cache the backups while they are processed. In addition, the byte-level deduplication process is usually limited to a single backup set and not generally applied globally across backup sets.

In many cases, byte-level technology keeps the most recent generation as a master. That significantly improves restore times. Ninety percent of all restores are of the most recent generation. Restore times are an important consideration when considering any backup solution.

All venders have their own approach to deduplication. Some use storage appliances, others have software only solutions, and still others have complete end-to-end replacements for existing backup solutions.

One thing is for sure, without data deduplciation remote backup and automated business continuity solutions across corporate and public Wide Area Networks would not be a reality today.

0 ความคิดเห็น:

แสดงความคิดเห็น

 
©2009 Tips & Article Computers and Technology | by TNB