Clean up Your Content Repository for Better Information Security

You should know by now that it’s no longer enough just to build stronger walls to defend against data breaches to protect corporate information. You need to clean up content repositories for information security.

The day-to-day practice of the CISO now embraces information management.

Today, the day-to-day practice of the Chief Information Security Officer (CISO) requires addressing information management, and cleaning up corporate repositories so that they contain as little sensitive data as possible.

I’ve been posting on Doculabs’ information management program framework—i.e. what you need to do to execute information management successfully and help minimize the impact of a breach. The five components of the framework are:

This post provides details on the penultimate item on the list: content cleanup.

After you set your information management policies, you need to clean up content.

With your defensible disposition playbook and your policies in place and aligned and your procedures defined, the next step is to clean up your content. For some organizations, this cleanup is a standalone effort to purge. For others, it may be part of the preparations for a content migration.

No matter what industry you’re in or what size your organization is, you’ll need tools to help in the effort. It’s simply not reasonable to expect end users to manually comb through their content to purge junk or stale data or to identify sensitive content. The sheer volume of unstructured data at most organizations is simply too large to make this manageable.

For content clean up, focus analytics tools are best.

The tools available fall into two categories: file analytics and auto-classification. By and large, auto-classification tools are too cumbersome to utilize effectively, requiring enormous efforts to train the software to recognize and classify document types. However, file analytics tools are more ready for prime time and easier to use because they rely on regular expressions to classify content. For instance, they use character sequences, such as ###-##-####, to identify documents containing Social Security numbers. Or they may use hits against a dictionary of customers and health terms to identify PHI.

File Analytics Tools

  • Examine the "wrapper" of a file; i.e. its properties, such as date created, date last accessed, or file extension
  • Analyze the file's contents, using regular expressions to find patterns

Auto-classification Tools

  • Examine the properties of a file
  • Analyze the file's contents
  • Can also "learn" to identify documents through the use of machine learning, natural language processing, and other advanced capabilities

A range of file analytics tools are available to help you analyze your content. Just about all of them use the same engines under the hood that the vendors themselves OEM. That means that in terms of raw file analytics power, they’re all about the same.

File analytics tools differ in their user interface and their ability to integrate with different systems.

Where file analytics tools differ is in the user interface and in their ability to integrate with different systems. Given the variety of systems deployed at any organization, integration with other systems is probably the most important criterion when selecting a tool. You want to make sure that whatever tool you choose can reliably connect to the repositories you have in play.

Junk, stale and sensitive content; problematic security and access.

But whatever tool you ultimately choose, the results of a scan of your repositories are likely to be something like the following, which we’ve observed at dozens of client sites over the last 10 years:

  • 30 to 70 percent “junk” content (able to be removed immediately)
  • 20 to 40 percent stale content (defined as older than 3 years, based on date last accessed)
  • 5 to 10 TB of stale sensitive content (able to be quarantined immediately with no operational impact)
  • 20 to 60 percent of content with problematical security and access, (e.g. global access, access for end users who have no need to access the content, etc.)

Data migrations are more efficient when you reduce your unstructured data footprint.

The value of classifying your content into these buckets and then acting on these buckets by purging, archiving, etc., is that you reduce your overall unstructured data footprint significantly (by anywhere from 30 to 90 percent).

Doing so not only improves the efficiency of a data migration, but it also reduces the overall risk posed by your unstructured data. That's because you have less junk and stale data to distract you, as well as less sensitive data to protect.

Download the Transforming Information Security with Information Management White Paper

Rich Medina
Joe Shepley
I’m VP and Practice Lead, focusing on developing Doculabs’ InfoSec practice and its applications in a wide range of industries.