A version of this post appeared on CMSWire.
One thing that comes through loud and clear in discussions with clients is that people are puzzled how to address their shared drive mess without involving massive amounts of human effort to accomplish the task.
We’ve been hearing for years how so-called auto-classification software was going to be the silver bullet answer to our information management woes, and they’ve been far from it. But rather than focusing on the complexities of auto-classification tools and techniques, let's discuss some of the most basic tools and techniques to begin addressing your shared drive mess. In most cases, this will get you reductions of 20 to 30 percent, but it can get you as much as 70 to 80 percent, depending on your organization’s information management practices.
Before we get into the details of cleaning up shared drive content using software tools, let’s clear the air about the “auto” in auto-classification, which is a misnomer that implies that no human effort is involved; we just have to point the tool at our content, and it will figure out what it is while we kick back and have a bagel and coffee with our feet up.
In reality, these tools require varying levels of human involvement to be successful, because while they can accurately and rapidly tell you all sorts of things about your content, actually acting on this information requires knowledge of the wider context (business, legal, compliance, etc.) of the content.
In this way, auto-classification tools are less like a Roomba (i.e. turn them on and they get to work, no matter what the space is like that you put them in) and more like a robot that makes car parts (i.e. although they come out of the box with a set of capabilities, they need to be extensively trained in order to apply those capabilities to successfully making a given part).
With this more narrow (and accurate) meaning of “auto” in mind, let’s take a look at one approach to how you can use them to begin cleaning up your shared drive mess.
The First 100 Pounds
The approach I’m going to present here is like losing the first 100 pounds for someone grossly overweight: some very basic changes (e.g. not drinking soda, walking 20 minutes a day) can have a dramatic result -- much more so than the kinds of activity required to lose the last 5 pounds, which can refuse to come off even if you’re training for a marathon.
In terms of shared drive content, “losing the first 100 pounds” can typically be accomplished through use of the most basic of classification tools: file analytics software.
File analytics software looks at the wrapper of the content -- the file system metadata -- to provide insight into that content, e.g. what files are PDFs or how many haven’t been viewed in the last 5 years, etc. Auto-classification software typically does this, as well as cracking into the files to perform a variety of actions on the content itself, from simple full-text indexing, to more complex analysis like semantic or vector.
For losing the first 100 pounds, however, the added complexity and overhead of auto-classification tools are not worth it. It’s far better to use file analytics tools to find content eligible for clean-up, using the following criteria:
- Obvious junk: There are 100 or so commonly accepted “junk” file types that file analytics tools can find. Pick a list appropriate for your organization and then use it to drive the first wave of purging.
- Content aging: File analytics tools can also tell you information about content aging: not just how old it is, but when it was last updated and viewed. Depending on your organizational context, you may be able to simply pick a cut-off date and purge, e.g. “all content older than X years or that hasn’t been accessed in Y years”. Or, if you’re heavily litigated or subject to stringent recordkeeping requirements, you’ll need to apply these kind of purges on a department-by-department (or possibly even a workgroup-by-workgroup) basis.
- Duplication: Duplicate files are easy to find, but it can be difficult to know what to do with them. You can’t simply delete the dupes without risking operational impact when users can’t find their documents (after all, they don’t know they’re duplicates; they're just their documents). So in order to reap the benefits of de-duping, you’ll need to invest in a tool that leaves a stub to the original file when deleting dupes (if your file analytics tool doesn’t offer that functionality).
I’ve seen very high purge rates (roughly 80 percent) at organizations simply by addressing obvious junk files on shared drives. But a more realistic estimate for the average organization is somewhere closer to 20 to 30 percent, which is still a huge improvement.
Content aging and duplication will offer additional purge opportunities, but these two purge categories require more finesse because of retention requirements and some technical challenges. However, even if you applied them very narrowly, it would be hard to imagine that you couldn’t get another 10 percent for each of these categories. Combined with the low end of what you can expect for junk files, you’re at 40 to 50 percent of your shared drives -- which is nothing to sneeze at.
The Final Word
Here’s hoping this discussion has given you an idea of the kind of things you can do to address your shared drive problem, without having to spend thousands of hours sifting through files or hundreds of thousands of dollars on complex auto-classification software. For a reasonable amount of time and money, you can lose the first 100 pounds, which is not to say you should stop there. For many organizations, getting serious about the next 100 pounds makes a great deal of sense. And for a few organizations, losing the last 5 pounds will, as well. But no matter what kind of organization you are, losing the first 100 makes sense, and the method I’ve outlined here is one way to do just that.