Back in January of this year, we published a blog titled Quarantine your Stale Data about the need to quarantine your stale (or grey) data. In it, we talked about a conversation we had with Alan Daley, Research Director of Gartner Research, about the problems his clients were having with managing stale data – or those files that for whatever reason, become less valuable to the end-user over time.
Most organizations eventually retire (shut down) aging business applications for several reasons including cost reduction, application consolidation, risk reduction, and because of new regulatory and eDiscovery requirements. However, one question that is not usually addressed early on is; what should be done with the associated application data?
As I discussed in Part 1 of this blog series, your enterprise data is comprised of approximately 30% grey data, or unstructured data that is low-touch or abandoned that for various reasons, your legal department is not willing to dispose of. This data can be content from departed employees, data that has aged beyond the standard retention periods but due to extenuating circumstances still needs to be retained, eDiscovery data sets from past cases, or content considered corporate history. The question is; how do you determine what is grey data versus truly valueless data?
First Tackle the Obvious
Referring back to the CGOC numbers from Part 1, you should be able to quickly determine what data is subject to legal hold, regulatory retention, or has obvious business value based on how your organization handles data generally. The difficult process is gleaning the grey from valueless data. In Part 1, I suggested that culling for valueless data is not the best strategy. Let me clarify by saying that culling for obvious valueless data only is not a best practice.
To begin the culling process, first concentrate on those files that are obviously valueless such as:
- Duplicate files: There can be large numbers of duplicates in the file shares, document repositories, and PSTs spread around the enterprise file shares.
- Revisions: Documents can have several revisions the final document was created from. The revisions usually include structural changes, edits, added content, and comments. The question is; are the revisions important when determining value? In most cases the answer is no for aging files.
- Aging backups: Backups of both desktops and servers/storage beyond a certain age are almost always valueless. Ask yourself the following question; what could I possibly do with an email system backup from seven years ago? In reality, backups are for disaster recovery purposes and should only be kept for short periods of time, i.e. 3 months, otherwise they become useless.
- Aging system files and system reports: Again, what value does a system report from 3 years ago have?
- Non-business related or personal MP3’s and video files: These files can take up large amounts of enterprise storage. Send an email out to employees say that they have 2 weeks to move these files off of company assets and at the end of two weeks all files matching these profiles will be deleted
This is not an extensive list however you get the idea, use common sense here.
The Not So Obvious
The next step is to create a policy for determining, for the vast majority of unstructured data in the enterprise, what low-touch or grey data still rises to the level of retention? After disposing of the obvious, the next step is to begin culling on other data points such as:
- Last accessed date: If data is new or relatively new, then it no doubt belongs to current employees and still might have a relatively high probability of review/reference (refer to the Lifecycle of Grey Data blog) . It’s never a good strategy to delete relatively new content without the owner’s knowledge. Employees can waste huge amounts of productivity searching for a file they are sure they just created 1 month ago.
- Target Custodians: Companies should develop a list of those employees whose data will not be culled and deleted for any reason, for example the CEO, GC, or specific engineers developing IP for both legal and historical reasons.
- Departed Employees: Data from departed employees such as mailbox content, email archive content, file system content, cloud data, and data from their workstations should be collected and held for a period of time as defined by corporate legal. This data can be instrumental if later wrongful termination lawsuits are files. This data is more easily collected as the employee is actually leaving the company ort shortly after.
- Author-less Content: In rare occasions, data files will not have an easily discernible author. In this event, keyword filtering can help determine content value.
- PSTs: Again, PSTs can sometimes be difficult to determine ownership. Cracking open the PST (if it’s not password protected) can help you quickly establish ownership.
The above bullets are the most productive culling points but many others can also exist depending on your specific industry.
Next Steps After Categorization – Store It
So what should you do with this grey data after you have finished the filtering/categorization? Obviously you began the process to save it. The questions are: for how long and where?
You should develop a policy for handling grey data. First, create high water mark retention periods, for example the time period for your local statute of limitations for employee wrongful termination lawsuits.
Second, establish a secure low cost repository that can be managed and searched when needed. This repository should also include in-place legal hold and retention/disposition functionality so that this grey data can eventually be disposed of.
Microsoft Azure as the Managed Grey Data Repository
Archive2Azure is Archive360’s Compliance Storage Solution targeting long term storage and management of unstructured grey data into the Microsoft Azure platform. The Archive2Azure solution leverages Microsoft Azure’s low-cost ‘cool’ storage as an alternative to expensive on premise enterprise storage. Azure costs as little as $0.02 per GB per month and eliminates all the expensive overhead costs of traditional on premise storage.
Archive2Azure importantly provides automated retention, indexing on demand, encryption, search, review, and production – all important components of a low cost, searchable storage solution. Given the clear cost advantages of the Azure cloud, it’s no surprise many companies are looking to Azure and Archive2Azure for grey data management and storage.
Yesterday I conducted a webinar for the Association of Legal Administrators titled Tomorrow’s Information Governance. One of the questions I received was about determining what is actually grey data, what should be kept and what is truly valueless data that should be disposed of. I thought it was a great question so I will address it here.
In recent months I have been struck by the explosion of high performance storage solutions - Nutanix, Pure Storage, Nimble, Simplivity, and Datrium just to name a few. These new storage solutions are pushing the performance envelope for the modern data center and virtual server hosts. What I find very interesting is the corresponding impact on actual storage expectations.
As corporate data continues to pile up within the enterprise, a much asked question, at least around the IT water cooler, is why is all of this data accumulating instead of being deleted. Employees create, send, and receive approximately 20 MB of data per day. The vast majority of this data is retained because employees feel that they will need to reuse/reference it at a later date so it accumulates on local storage, on file shares, in the email system and archive, and lately, into employee corporate and private clouds (figure1). In fact, 70 to 80% of corporate unstructured data is unindexed, unmanaged, and invisible to IT.
Organizations habitually over-retain information, especially unstructured electronic information, for many reasons. However, many organizations simply have not addressed what to do with this data so fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the grey data management spectrum, a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use. In these cases, employees find ways around these complex systems by keeping huge amounts of data locally on their workstations, on enterprise file shares, on removable media, in cloud accounts, or on rogue SharePoint sites that are used as “data dumps” with little or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of active management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery. In some cases the company’s General Counsel actively stops grey data “clean up” processes because they are afraid of being accused of destruction of evidence in a future case.
It may surprise you to know that 10%, 20%, or even 30% of enterprise data can be classified as grey or inactive data, mainly from ex-employees, that has accumulated over the years as employees left the company. Many organizations simply haven’t defined policies around what to do with departing employee data. A minority of organizations will remove the hard disk from the departing employee’s laptop or desktop computer and place it in a cabinet for a year or longer. This policy is usually driven by the corporate legal department just in case the employee later files a wrongful termination lawsuit. This process is an attempt to address the issue but doesn’t really take into account the other possible data repositories where employee data can reside including file systems, email systems, email archives, removable media, or cloud repositories.