Big Data projects are quickly becoming the norm at large enterprises that want to bring together their huge data stores and gain competitive insights. Most of the projects start with research – in-house coupled help from external consultants and experts. Several build the infrastructure as a test bed to weigh in performance or features. Some want performance – better than their traditional warehouses. Some want faster to results – quicker integrations. Others take use case based implementation, rather than infrastructure based ‘built it they will come’ approach. Ultimately, upon on completion of first few use cases/implementation, businesses want to expand and get more out of the infrastructure. Here is when the data stores that started a pristine Data Lakes quickly may end up as Data Swamps. Without good data governance, duplicated data, missing data and other data problems multiply quickly. Users find it difficult to trust the data.
Enter Data Catalog. When implemented correctly, a data catalog will have definitions of data, information about where they reside (systems, files, reports, etc.), where it originates, who to contact about the said data, etc. Having this information helps users – those who build solutions and those who use them – equally. Business users trying to build reports out of big data repositories will want to know where it came from, what is the definition etc. Similarly, developers will want the data definitions to transform the data as required by the business. Often data or business analysts build this data catalog on the individual projects and share within the teams. They are then posted to spreadsheets or on SharePoint lists. It is a foundation for data projects. Depending on the type of data catalog, once the project completes, they are either forgotten or are actively used. Commercial big data systems have an inbuilt metastore that captures the metadata and uses it for data processing. This metastore is not business friendly or has all the required information to function as a proper data catalog. Likewise, data warehouse/databases have inbuilt system tables to show the metadata about the data. They too lack the business definitions.
Poorly managed or non-existent data catalog causes confusion and other data errors in reports, systems and integrations. Business relying on this poorly defined data may end up with making poor decisions. This can be avoided with having a good data catalog. It need not be exhaustive from the get go. It can start small and be increased over time. A centralized catalog with proper controls and is accessible to everyone is a great start. Developers can build quickly and accurately using the catalog definitions. Reports builders and users can better understand the data being used in them. As its usage grows, data & analytics programs and data governance programs will find the catalog indispensable. Looking back, you may wonder where you would be on this data journey with out data catalog as a guiding map.