The cloud can help unlock significant business value via data democratization
Most organizations are moving their data onto the cloud to reduce IT maintenance and increase business agility by being able to dynamically scale both horizontally and vertically based on needs. Once this data is moved to the cloud, enterprises can also begin unlocking the value of the data by deriving insights from auto AI and ML services.
Data offerings from cloud vendors include databases like SQL Server but have also come to include platforms like data lakes, cloud warehouses including ETL stack, and analytical workbenches that help do everything from data preparation to processing data at scale. However, such data modernization by moving data storage and processing to the cloud also creates complexity in actively managing and governing data. Often, such complexity is a consequence of data distribution and processing that occurs in hybrid or multi-cloud landscapes. Within this context, discovering where petabytes of data exist along with its meaning, and characteristics is important. Furthermore, this can be accomplished with a catalog that can integrate discovered data, its metadata across hybrid storage systems.
[box type=”shadow” align=”” class=”” width=””]To someone hearing about a catalog for the first time – a catalog is a tool that can discover data assets and assist the business and data analysis in understanding the structure of data elements they would like to use for their specific projects. A catalog as a core capability crawls any database, such as relational databases, NoSQL databases, or graph databases, and gives users useful information.[/box]
Some of this useful information includes –
- Table & Column Name
- Modeled data-types
- Inferred data types
- Patterns & frequency
- Data Length with minimum and maximum threshold
- Minimal and maximum value of data
- frequency of values and their distribution
- End-End lineage across data sources along with transformation or derivation rules
The vast majority of on-prem catalog solutions do not have the functionality to scan diverse storage environments, and especially do not support scanning cloud object storage by default. For distributed assets being managed in multiple environments, cloud vendor solutions offer cloud connectors that provide more insight into the physical location and semantic meaning of data assets. Let’s take a look at a few examples of cloud-based catalogs to contrast them with traditional on-premise catalogs.
Managing Metadata from cloud storage with a Catalog
To pick an example- the AWS Glue catalog provides features like scanning relevant storage, such as S3, RDS, Redshift, etc to build out the physical characteristics of the tables and files. Further, references to data that is used as sources and targets in Glue are stored as metadata in the catalog. As data moves through its lifecycle, this helps keep track of where it came from. A similar offering is Azure’s Purview from Microsoft.
Let’s examine how the Glue data catalog works with a scanner, also called a crawler.
- The crawler connects to the data store of choice in AWS storage, such as S3 or Redshift. Connection properties would include configuring the data store properties, data paths etc.
- The type of crawling required – Full folder scans or incremental folder scans
- The inferred schema is then created from your data stores.
- The crawler writes metadata in the Data Catalog.
- It creates definitions for databases and tables.
What is the choice of Operating Model suited for managing cloud metadata?
An organization could have started using an on-premise data catalog that ingests and manages metadata centrally. However, with the transformations happening on cloud, one of the choices is to use traditional and extended connectors that can scan cloud storages and manage the metadata centrally. Metadata can be centrally managed in this manner and can be less costly upfront. Each cloud service provider, however, offers its own catalogs, such as Glue and Purview, that can ingest metadata in a distributed fashion. This is a choice that can help organizations rely on cloud service providers to scan their native storage and lineage without having to rely on third-party providers’ connectivity.
Similarly managing business metadata can either be done centrally or in a distributed way. While having an on-prem catalog for a few years would make the employees accustomed to the interface, a central operating model can be preferred even though the technical metadata is managed in a distributed way.
The benefits of managing metadata from cloud storage
Catalogs like Glue are not designed to be glossaries by default but instead are used to maintain schemas. Recent technological advancements and changes in public policy such as GDPR and CCPA regulations have dramatically increased the use of metadata over the last year. Enterprise metadata functions simplify the data landscape, democratize search for data assets, and manage schema drifts.
[box type=”shadow” align=”” class=”” width=””]The science of data discovery is associated with the doctrine of data democratization. It answers questions such as “Where does data exist physically?” in schemas as objects and instances as elements or searching for data in single or multiple applications systems-of-record, systems of reference like data lake or cloud warehouses. Metadata Management also requires analysts to input business information at a rapid pace. For analysts to define data, technical metadata from databases must also be available instantly.[/box]
The operating model can be defined by including the right stakeholders, the process including toll-gates during a change advisory board or a change committee. Data, as well, has a lifecycle; POSMAD (Plan, Obtain, Store/Share, Maintain, Apply, Decay), which when documented helps bring out the data lineage. By being able to trace lineage, it will be easier to conduct impact analysis and maintain data pipelines.
[box type=”shadow” align=”” class=”” width=””]For accelerated delivery of features, even more, advanced agile management models such as scrum, kanban, DAD, and FDD can benefit from curating and using definitions of data. Data Governance ensures a balance between hosting and serving metadata, allowing metadata to meet most use-case requirements. As data governance formalizes active management of metadata through a specific operating rhythm and processes, it becomes much easier to integrate it into project life-cycles planning for data changes or usage.[/box]
To summarize, what are the basic benefits of managing Metadata in catalog?
There are many benefits of managing metadata but some are highlighted below.
- Increased availability of intelligence about data that brings out better context to insights
- The reduced turnaround time to find answers during the analysis
- Increased efficiency of subject-matter-experts in turning out information for impact analysis
- Removes ambiguity in relationships among data in the landscape
- Simplifies the views of data through meaning, identified redundancy, and relationships.
[author title=”” image=”http://”] Contributed by Tejasvi Addagada – Data Operations, Governance & Privacy Head, Axis Bank [/author]