The Big Data phenomenon is at least partly about answering questions – hard questions, and ideally, questions that predict the future in their own fuzzy way. But there is (at least) one question Big Data doesn’t necessarily have an answer for. It’s a question that comes up regularly and may be among the most feared questions heard in meetings:
“Where did this data come from?”
As we engage more and more Big Data projects, we will be required to answer this question and its variations (“Where did that field come from?” and “Did you include..?”) more and more often. This is partly because the question has roots in good data management. But more importantly, it’s driven by a simple reality: Insights tend to come from bringing together various datasets. And each set has its own potential issues. For example:
- Source of the data. Where did the data originate? In a large enterprise, there typically are many copies, variations, or subsets of each interesting dataset.
- Scope of the data. Which data should be included? In a normalized, storage-centric data model, excluding or including a row using query criteria can dramatically change the results of analysis.
- Timeliness of the data. What is the time frame of the data? Rollups, summaries, and aggregates, in particular, may need to be understood thoroughly before they can be used for forecasting.
- Lineage of the data. Has the data been redacted? Cleansed? Profiled? The processing, enrichment, and/or other enhancements that have been applied to the data need to be understood.
- Access to the data. Who has access to the data? To all of it? To some of it? Access to the data should be carefully controlled to ensure that only the people who should have access to the data can access it.
Knowledgent has seen these challenges delay and derail projects for years. Nothing has brought them to the forefront like Big Data projects. These new efforts promise great insight but present difficulties in their execution. They combine internal and external data of all types, in huge volumes, often at rapid velocities – all being analyzed with new platforms, tools, and (potentially) people as well.
So it’s actually all the more important, right now, to build governance into the “Big Data” strategy, and ensure that the data is trustworthy, and the insight actionable.
- A coherent architecture that provides the right level of governance – minimally, the ability to have separate, controlled access to raw, refined and “gold standard” data
- Processes for in-taking and tracking the enrichment, linking, and combination of datasets
- Processes and tools for identifying problematic data (such as PII) and removing, redacting, or masking it (this is particularly important for unstructured, raw sources)
- Automatic profiling of data quality and linking of that profile to the dataset
- Processes for inventorying data and related information (like original SQL view, enrichment history, and quality profile) so that others can find and use data without re-creating it
- Creating an operating model that defines the people, tools, and processes to do all of this on an ongoing, scalable basis, with metrics showing progress and identifying challenges
One important tactic is to agree on which will come first: governing or modeling. It can be a chicken-and-egg problem. We recommend that data be governed prior to modeling, even though this makes it harder to assess prior to the modeling. Governing the raw data, at least, starts the process of collecting useful information, and that should absolutely include the outputs and artifacts from discovery exploration during the modeling phase.
How are you governing your Big Data? What challenges are you running into? Share your thoughts in the comments!