Many in the industry hold the opinion that Data Governance and Management are activities abstracted from the technologies and data patterns they serve. This is somewhat true but when we look at how technological advances will bring the promise of speed, time-to-market, machine learning, and AI, the face of data management will change.
Why are Data Lakes building momentum? The Data Lake pattern is being adopted by larger organizations to solve the widespread problems linked to lack of Data Consolidation, Data Standardization, single-source Data Availability, and Process & Data Change Management Agility. Data Lakes are often solutions built to leapfrog brittle legacy technology with the goal of creating a new, modern data paradigm that can support larger volumes of data at faster rates of ingestion and processing.
The most differentiating concept of the Lake is its ingestion pattern. A data lake takes large amounts of data “as-is” from source systems, ingests it into a data lake raw zone where data is cataloged, subjected to data quality monitoring, and readied for curation. Unlike the ETL and ESB patterns, the Lake stores iterations of ”originating” data as it moves from a raw state through curation and on to a modeled state.
Each iteration of the data is an asset that can served up for data science, business analysis, and advanced analytics. Therefore, using the right iteration of the data – raw, curated, transformed, and organized – requires designating a “fit for purpose” classification such that data can be consumed for the right purpose. Unlike data warehouses that govern a well-established structure, data lakes require governance across every iteration of the data; these are the zones of the Lake typically named raw, curated, organized, and “for purpose” zones. Each zone requires differing forms of governance.
For example, the Raw zone is where intensive Data Quality Monitoring might occur. The curation zone is where governance of reference data, standardization, business rules for transformation is applied. Data Lakes will often maintain an enterprise model, and, therefore, will require the intensive levels of governance around semantics, modeling, and ontology. Why is this important? It allows a business to analyze data for trends, quality, behavior, and meaning long before data is reorganized and modeled for enterprise consumption.