For centuries, records/information managers have had to rely on end-users to take the first, second, and third steps in information governance which are:
- Make a decision on a document as to whether it should be retained
- Decide how long it should be kept (retention period)
- And actually take the step to move the document somewhere for safekeeping and management.
Over the last 15 to 20 years, many companies have marketed and sold “records management systems” that would supposedly make information management much easier. However, these systems didn’t address the 3 points above; the reliance on end users to initiate the process and to make decisions on the importance of the content.
And now, with the increasing volume and velocity of electronic data being created and received in modern organizations of all sizes, end users don’t have enough time to make intelligent decisions on all of the information they come into contact with on a daily basis. The universal truth is that corporate information has far surpassed our ability to manage it.
Even today, organizations try to manage just their regulatory and legal records, which actually amounts to approximately 6% of total corporate information. The rest they leave to end users to manage, which based on human nature and the huge amount of data they’re dealing with, means it’s not really managed at all and becomes unmanaged low-touch or inactive data.
The holy grail for information governance has always been error-free, intelligent, automation to take end users out of the process.
At the Microsoft Inspire first day keynote, Microsoft’s CEO, Satya Nadella, talked about intelligent automation to begin addressing the information governance issues (In my opinion, the weakest link is the reliance on end users.) He spoke about intelligent cloud platforms, intelligent archiving, and predictive intelligence, etc. to address system issues before they occur. Another huge target for predictive intelligence is to anticipate as well as make decisions on content; is it subject to regulatory compliance, should it be kept, for how long, in what location, are there geographic regulations limiting its locations and access, is there a litigation hold etc. - thereby relieving end users of that responsibility.
Machine Learning and Predictive Coding – the First Step
Years ago, I worked in the eDiscovery industry where we successfully established predictive coding as a time-saving and cost-reducing technology to automate the process of first pass culling and review of eDiscovery data sets.
Before predictive coding, companies gathered huge amounts of potentially relevant documents based on simple keywords, and then paid teams of lawyers and paralegals to read and make a decision on each document, which in many cases totaled millions or tens of millions of documents. As you can imagine, this process drove the cost of eDiscovery up. In fact, several years ago, the average cost of a single eDiscovery was approximately $1.5 million - not including the actual trial or judgments.
Supervised machine learning technology (the most common machine learning technology used) enabled those collecting eDiscovery results sets to train computers to recognize relevant content and “meaning” based on examples supplied to it. This supervised machine learning technology included iterative training cycles providing feedback to the system as to its error rates - what documents it marked as responsive was correct and which ones were wrong. The number of training cycles could include 2, 10, 30, 50 or more training cycles… the more cycles, usually the lower the error rate. For eDiscovery you wanted an error rate less of than 2% as opposed to manual culling which could average 20% to 50%. The ability to show mathematically low, consistent error rates, courts finally began to accept predictive coding as a reliable and acceptable tool for eDiscovery.
The Next Step – Predictive Information Governance (PIG)
There were a few companies that offered a semi-automated version of this content intelligence relying on black-box algorithms and massive software installations. They were relatively successful in recognizing and categorizing documents correctly however still relied on individuals training the software each time and were hugely expansive. The key for truly automating not just records management, but information governance is unsupervised machine learning, and the cloud.
Unsupervised Machine Learning Takes the Human Factor Out
Just like the name implies, unsupervised machine learning (computers teaching themselves) takes the iterative training cycles out of the process and allows the system to automatically categorize, store, apply retention/disposition, and manage content as it flows within the system. Soon, once the error rates are low enough, truly automated predictive information governance can be realized.
The Cloud Spreads the Cost
Logically, a public cloud platform which could provide this machine learning/predictive technology to all would spread the cost of the technology out and enable its use by everyone for a small fraction of the cost of an on premise system.
Microsoft’s Cloud and Azure services are bringing the information governance industry much closer to the holy grail – Predictive Information Governance. As part of Azure’s Services, machine learning is available to help organizations build advanced analytics and self-adapting security among other things, and in the future (I hope), automated data categorization and governance.
Archive360 and Microsoft
Archive2Azure, the first native compliant information governance/archiving solution built on top of Azure, is perfectly positioned to take advantage of this new technology. Call us to talk about the future of cloud archiving and information governance.
To learn more about supervised versus unsupervised machine learning, check out the Forbes article: Supervised V Unsupervised Machine Learning -- What's The Difference?