Working with Organizational Dark Data

By Paul Chin

Originally published in Intranet Journal (07-Jul-2005)

back back to portfolio

How easy is it to manage your organization's content when you don't know where a lot of this content is? You have a sense—theoretically, at least—that there are vast, undiscovered caches of information scattered throughout the organization. What you don't know is how much of it there is, where it is, and who has it.

While companies laud the size, efficiency, and design of their intranet, we have to wonder how much of their content is really contained on the intranet. The fact of the matter is, visible intranet content only accounts for a small percentage of an organization's total knowledge assets. There's often a large unseen—and in some cases, unknown—portion of corporate content that never reaches the general user community. This is what's known as dark data.

What is Dark Data?

The name might sound menacing, like some black cloud that seeps into your office and ties your shoelaces to the leg of a desk. But dark data has much more value than its moniker affords. It borrows its name from the cosmological theory of dark matter, defined as "non-luminous matter not yet directly detected by astronomers that is hypothesized to exist because the visible matter in the universe is insufficient to account for various observed gravitational effects."

In plain English: You can't see it directly but you know something is out there because it's affecting the movement of other things. Quite vague, but that's the nature of dark matter.

Intranets gained their fame in the 1990's, but before that—and to this day for those without a formal content management system—much of an organization's core content was stored in all sorts of different mediums:

All of this information was managed—and I used the term "managed" very loosely—in a relatively informal manner. There was no single, centralized repository for all of this information. Employees had to either ask the "informationally privileged" for assistance or had to dig through large corporate file servers to retrieve what they were looking for. And because this information was so spread out, they would have to repeat this process several times at several locations with several people before completing their task.

When intranets emerged as a corporate content management tool, developers and content owners attempted to port and consolidate all of this dispersed content into a centralized environment that can be easily navigated by even the casual user. But how successful was this exercise? How much of this content truly made it onto the corporate intranet?

Like dark matter, no one truly knows. You can't port what you can't see. But this content does exist because—although you can't see it directly in your intranet—you can see its effects in many corporate efforts. Content most users have never seen is referenced during presentations, in conversations, and in e-mails. It's out there; it just hasn't been harnessed by the intranet.

The Effects of Dark Data on Organizations and Users

Dark data comes in all shapes and sizes, and some is more useful than others. But to fully understand the concept of dark data you need to be aware of its two main classes:

The exact amount of dark data within an organization will vary depending on the company and how long it has been operating. Organizations that have been in business for decades—before the advent of digital content management—will likely have more dark data than those that have only been in business for the last several years.

But the term dark data can be used to describe not only the hidden nature of this content's existence within an organization, it can also be used to describe the state of users who rely on the intranet as their central information source. In other words, if they don't know about the existence of this "invisible" content, they themselves are, in a sense, in the dark as well.

When organizations try to run a comprehensive intranet without making an effort to locate as much dark data as possible they run the risk of duplicating both effort and content.

Without the availability and integration of dark data into an intranet, you might end up spending time and effort re-doing something that has already been done. The information that you're trying so hard to collect and process could very well already exist in a spreadsheet or database in another department. This leads not only to data duplication but also content inconsistencies.

Since the creator of an original piece of dark data and the creator of the content redux are usually different people, the content—although they reference the same subject matter—may have slight variances. And if the original dark data is ever discovered, intranet owners will be left scratching their heads wondering which is the more accurate. And to make matters worse, if the dark data's originator is no longer with the company, there's no way to compare notes.

Processing Dark Data

While dark data is an invaluable addition to an intranet it doesn't come without its challenges. Since dark data exists outside the corporate intranet, and often outside the knowledge of intranet owners, finding and porting this content to an intranet can be quite time-consuming.

There are two major issues associated with the consolidation of dark data into an intranet: Discovery and integration.

Discovery is the process of finding and aggregating all applicable dark data to be included in an organization's intranet. But this process isn't an exact science. And it can be further complicated by content owners who, for various reasons, aren't willing to share their pool of information. In these cases dark data isn't hidden by obscurity, but rather deliberately concealed.

Although organizations with large amounts of heterogeneous content types can install a search appliance such as Google's Mini (for small- to medium-sized businesses) or Search Appliance (for large businesses), it does nothing to address dark data that exists in the form of hardcopies or user knowledge that has no corporeal home.

In order to maximize your chances of uncovering useful dark data, you need to run an information audit (the issue of running information audits will be explored in greater detail in an upcoming article). But don't try to create a single group to perform this audit. Instead, have individual intranet content owners conduct the audit from within their department or workgroup. Those who know their content best should be the ones responsible for the audits.

But the true key to dark data discovery lies in users' perception of the intranet. When users begin to see it as a productive, long-term business tool and not a flavor of the day, they will be more likely to share knowledge and bring some of this dark data to light. Without this cooperation your chances of uncovering dark data will be greatly diminished.

Once dark data has been discovered, your biggest decision will be what action to take upon this content—to decide the extent of your integration process. Will all of your heterogeneous content types be left in their native formats and simply linked to in your intranet or will this content be converted to one standard Web-based format? There's a case to be made for both.

Leaving all dark data in their native formats is certainly the quickest method, but you might be left with content inconsistencies in the long run. Intranets provide the entire user community with read-only content—much like Internet content. It's the content owners' responsibility to manage and update their intranet content. If dark data were to be left in its native format, say an Excel spreadsheet, users would most likely have to download the file from the intranet for local viewing. When this happens, there's a danger that the original file will be modified and re-circulated into the organization's information stream. The file could be changed and e-mailed from user to user—each making their own sets of modifications until there are dozens of copies. And when everyone is done with their copy, there will be dramatic differences from those files and the original "production" version sitting on the intranet.

Converting dark data to a standard intranet format is ideal but can become very effort intensive—especially if the format is not consistent with intranet content standards. Dark data contained within databases, applications, or in hardcopies can prove to be particularly difficult to convert. Dark data integration can include manually reformatting content, integrating applications with your intranet, and digitizing hardcopy documents via OCR (optical character recognition).

Closing Thoughts

The value of dark data isn't in question, but the trick is in finding it. It can be hidden in small corners of the company or long forgotten in some dusty old file server. And with all the value they can provide to the entire user community, dark data remains only accessible by a privileged minority—if even that.

Unfortunately, there's no quick fix or silver bullet. The actions required to find and integrate dark data into an intranet is dependent on the amount and complexity of this content. If you have a sense that you're in the possession of large quantities of usable dark data, focus your attention on finding this content rather than re-inventing the wheel.

But however you decide to approach this process of discovery and integration, the most important consideration when dealing with dark data is not to allow it to turn into a black hole, never to be seen again.

Copyright © 2005 Paul Chin. All rights reserved.
Reproduction of this article in whole or part in any form without prior written permission of Paul Chin is prohibited.