Records Retention and Sensitive Data Identification

When it comes to your organization’s information security, risk surface is a critical factor. The less data available for bad actors to find and pilfer, the lower the risk to the organization.

But how do you actually go about identifying the information that can be deleted, as well as the information that should be archived, and—most important of all—the sensitive data that requires greater security? And when you decide to undertake such a content cleanup, where do you start? It’s not always intuitive to determine which files to tackle first.

One way is to separate files into different categories. And while there are plenty of potential buckets, there’s one approach we’ve helped our most successful clients implement—an approach that provides a “first cut” at the data, making it easier to identify data that can be deleted.

The easiest files to tackle are “junk” files. Next comes your collection of records. Then, finally, is personal and transient information.

What to Do with the Junk

Junk is easy. It’s the content you really don’t need—the stuff that’s just filling up space. Typically, you can get your stakeholders together and get agreement on what, now and going forward, constitutes “junk” within your organization and agree on a definition for it.

One place to start is with documents having file extensions such as .tmp or .exe. Another approach is to get rid of files less than 1 byte in size. Then there are also music libraries. It just depends on how aggressive your organization wants to be and how wide a net you’re willing to cast.

What to do with the Records

These are the files you keep based upon pre-determined retention schedules. The challenge is to identify which files are truly records, which is not always easy to do. You can use tools or scripts to ferret these out, but in many cases you have to interview the users of those files.

The goal is to create an inventory of your records and where they’re stored—a process that, until recently, was highly manual. This is where ECM systems come in—e.g. IBM FileNet, OpenText, or Microsoft SharePoint.

Most organizations have some kind of retention schedule that automated software can use as a set of rules to help categorize information. The problem, though, is that many companies have saddled themselves with retention schedules that are overly complex—in many instances, relics of the days when those organizations retained their records on paper, and a complex schedule helped manage and organize all those hard copies.

It’s no longer sustainable—or, I would argue, even necessary—to have records categories that number in the hundreds. With appropriate metadata tagging your electronic documents, you can streamline the number of records categories. And with fewer the categories, you’re far more likely to get users to comply with the records retention schedule. Of course, there’s no “ideal” number of categories. But in our experience, if you can get it down to between 30 and 50 categories, you’re doing well.

Let’s say your department is now managing 30 categories that call for deletion between 3 and 7 years after creation (and maybe some timeframes in between.) You could radically reduce the number of categories by giving everything a retention of 7 years, and just call it a day.

What to Do with the Personal and Transient Information

After you’ve dealt with the junk and identified and managed the records, the next large bucket is the personal and the transient information—i.e. everything that’s left over that would not be considered a record. (Some organizations use different categories: critical business, reference and personal information.)

Of these two sub-buckets, the personal is easier to define. Personal content is content that only one user may need access to, and in many cases, it’s information unrelated to business activities—things like a kids’ soccer schedule, a nice performance review from a manager, or photos from a work team event.

This is stuff that literally no one else would need. As we all know, people can be very territorial about these personal files. Our recommendation here is not to legislate how long an employee can keep such a file, but that whatever the user keeps be within a prescribed amount of file space that you’ve allotted to the individual user—e.g. 2 GB of storage. Then if the user exceeds this, it’s up to them to figure out what to delete to get back within the prescribed limit.

It’s an approach that lets you enforce retention rules through policy, training, and monitoring—and it helps keep corporate records from getting saved with personal data.

Transient files, on the other hand, are copies of business-related documents such as templates, policies, weekly reports, and department presentations. In many cases, these are early drafts of final files. This is stuff that’s it’s more difficult to know how long to keep.

When it comes to managing these transient files, we recommend two approaches. If you’re uncomfortable setting a fixed time (too short or too long) to keep all data, you can use a file analytics software tool such as Varonis Systems’ Data Classification Framework or STEALTHbits Technologies’ Stealth Audit. Using such a tool to scan your repositories, you can develop an aging curve for the documents in those repositories.

If you’re struggling to define retention for transient documents across your organization, these tools are useful to help show the point in the document lifecycle where content access/modification typically drops off. And you can monitor the proportion of the repository content by date last accessed or date last modified. All of this is information that can help you zero in on an appropriate retention period for transient documents in your organization—whether that winds up being 3 years, or 2 years, or even just 1 year.

Note, though, that in some cases transient information must be treated as if it were a record. This is usually the case when there are defined periods—say for HIPPA compliance or when it contains personally identifiable information (PII) or other sensitive data. In these scenarios, regulatory compliance would dictate that a draft version of a document might also constitute a record.

But it gets still more complicated. Just because a file has PII or sensitive data doesn’t inherently make that file a record. Say you’re in HR and you want to define a benefits package for the coming year. You export data from a billing system, you want to do analysis on rate changes, and you manipulate all the data to help you come up with the next year’s benefits package.

All the preparatory work—even though it contains PII—is not necessarily a record. That’s just data that you’re using as the basis for your conclusion. The benefits package, with individuals’ information, would be a record. But the analysis and how you got there would not be.

Of course, there are cases where all the analysis in the world won’t affect the retention schedule. Doculabs has clients in the mining industry. Any record of an employee fatality must be kept forever. We also have clients in the utilities sector, and for records pertaining to the management of a facility such as a power plant or a sub-station, most regulations specify that virtually all records be kept for the life of the asset.

Retention’s Role in InfoSec

Undertaking this bucketing and prioritization process is critical to improving an organization’s information security. Go through the sequence: Designate the junk, then the records, then the personal and the transient. Get rid of what you don’t need. Make your pile of information smaller, and then protect the information assets that are really important to your business. Reduce the target. Reduce the risk surface.

In the end, it’s all about retention—those decisions about how long your organization will keep files. The more content you can winnow down, the greater your capacity for event monitoring, data loss prevention, access monitoring—and security. Not to mention you’ve just made your CISO’s job easier!

And Doculabs can help. Learn more about our InfoSec services and how improving your information management improves information security. Contact us here; we'll be happy to develop a strategy for your organization!


Rich Medina
Jim Polka
I’m a Principal Consultant. My expertise is in security-based information management and strategic deployment of ECM technologies.