Introduction
Enterprises have experienced a significant trend in the consumption of data, marked by an exponential increase in both the volume and diversity of available information in recent years. Statista predicts that the global data consumed is expected to surpass 180 zettabytes in 2025. Imagine your high-resolution image to be around 5 megabytes, a zettabyte could store about 200 trillion photos. With the rapid increase in the amount of data available, businesses found the need of utilizing them to obtain deeper insights, improve their decision-making processes and stimulate innovation. But within this massive realm of information lies a lesser-known aspect called dark data.
What is dark data
According to Gartner, dark data is defined as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes”. Dark data represents the unexplored portion of collected information and is not being actively used to derive insights or make decisions.
Imagine the usual office setup having a huge filing cabinet with many drawers. Some drawers are regularly opened and information stored in it is used every day for reports or decision-making. But there also exist drawers that are rarely opened which might have files that were stored some months ago. People might have forgotten what’s inside and why it was stored. This is like dark data – an often overlooked information that hides within the organization’s data stores and remains unexamined. A few examples of dark data includes: inactive or historical data, used logs and metadata, raw sensor data, backup and archived information, and customer interactions and feedback.
What makes the data lose its light
Dark data exists primarily due to the sheer volume of collected data which often includes unstructured, unused or unanalyzed information. Let’s try to name a few reasons why dark data is getting accumulated which we can easily remember as S-I-L-O.
Shifting data sources
As technology advances, new data sources come into play such as social media platforms, streaming services and digital media, geospatial data and sensor data. These varied data sources introduce a new range of formats, structures and complexities making it challenging for organizations to integrate, manage and analyze. Instead of getting utilized, this information adds up to the accumulation of dark data within the organization.
Inadequate data management strategy
This can happen when organizations fail to implement effective data classification or tagging systems. With the absence of proper categorization, identifying relevant data for analysis becomes challenging causing valuable information to be overlooked and buried within the system of the organization, resulting in it becoming part of the dark data.
Lack of the right technological tools and storage capabilities
If there’s not enough storage space or scalable infrastructure to accommodate the growing data of an organization, some information might remain uncollected and unprocessed, consequently contributing to the expansion of dark data.
Obsolete data
Obsolete data or the outdated information over time due to change in business rules, customer preference and market conditions. When obsolete information is retained without regular review or purging, it may add up to the realm of dark data.
What nightmare is brought by dark data
What’s the problem with all this information going dark? With so much data getting generated, businesses are increasingly shifting towards cloud storage because of the advantages it offers such as scalability and flexibility, cost-efficiency, and enhanced security. In a chart published by Jack Flynn in his article, 60% of the corporate data in 2022 is already stored in the cloud from being 30% back in 2015. He also shared that by 2025, there will be a projected 175 zettabytes of data that will be stored on the cloud. Additionally, according to Gartner, spending on public cloud services is predicted to be $679 billion by 2024. With dark data accumulating and hiding in organizations’ cabinets, their impact on cloud spending will be significantly felt.
Cloud service providers base their fees on the amount of storage used; holding onto excessive dark data can lead to increased expenses for storage. On a survey result shared by Veritas, companies are spending an average of $26M on storage per year attributed to dark data alone.
How to escape the shadow of dark data
So how can we go about with dark data in our organization?
Handling dark data involves implementing strategies to identify, categorize and either utilize or dispose of the unexplored information effectively. As an organization, we need to conduct a thorough analysis of stored data to distinguish between actively used, valuable information and dormant or underutilized dark data. For instance by using AWS Glue Data Catalog, we could have a fully managed metadata catalog service that allows for the creation of a centralized metadata repository which will then help in discovering, profiling and cataloging data across various data sources and formats. Datahub, an open-source centralized metadata management tool, is another platform that we can leverage to understand and manage metadata information. Some of the cool features of Datahub include: data discovery, search functionality across data platforms, view data schema and usage, edit data descriptions and tags, view which are the most used queries and visualize data lineage. In a Datahub Town Hall last Jan 4, Airtel describes how Datahub and the transformation tool dbt serves as a self-describing platform that allows their stewards to document, tag and describe their data products comprehensively. With these metadata management platforms, organizations can catalog data, can identify which information has low or no usage and pinpoint which data sources are not being used on any of the data pipelines. By doing so, we can assess whether the data needs to be retained or if they are ready for proper disposal.
What data needs to be held on to and until when it should be stored is governed by data retention policy. Retention policy is often influenced by legal, regulatory and business requirements. Defined retention periods assist organizations in differentiating between actively utilized data and information that has outlived its usefulness. Data that exceeds its retention period can be identified for clean up. Tools like Gobblin, a retention management framework can be used to configure and manage data retention for different datasets. Gobblin workflows can include steps for archiving or deleting data based on retention policies. Another service available in the market is AWS Data Lifecycle Manager. Let’s say you are on cloud and are utilizing Amazon EBS volumes, organizations can create automated policies for its lifecycle management which can automate the snapshot creation and retention periods for backups and recovery purposes through AWS Data Lifecycle Manager service. With the help of these kinds of tools and services, organizations can reduce storage and costs associated with it by automatically archiving data or cleaning up information that is not required anymore.
To mitigate cloud storage costs further, we can leverage cloud storage solutions that offer tiered storage options (e.g. hot and cold storage tiers) to store data cost-effectively based on its access frequency and performance requirements. Hot storage is designed for frequently accessed data that requires low-latency and high performance which is required for real-time analytics, live streaming, active databases and more. On the other hand, cold storage is optimized for data that is infrequently accessed or has long-term retention requirements. It is generally cheaper because it does not require immediate access but needs to be stored for compliance, backup or archival purposes. Amazon S3 Glacier and Amazon EBS, to name a few, are some of the services that provide cold storage options. Moreover, compression and deduplication methods can be utilized to minimize storage usage and enhance space utilization. Dark data, which is often dormant or rarely accessed, can be moved to the cold storage tier to reduce the overall cost of storage infrastructure.
We can also carry out monitoring services or tools to continuously monitor data usage, processing workflows and resource utilization. AWS CloudWatch offers metrics to track the usage of AWS resources, including storage, databases and compute instances. By closely monitoring the resource usage, we can detect the resources that are underutilized and potentially serving as a source of dark data.
Last for this list is the adaptation and continuous improvement of data management strategies. This would involve refining the current practices and policies and adapting to changing business needs and technological advancements. This can be done through regular reviews and assessments of the business glossary and business needs within the organization by the stakeholders, continuously aligning data management practices with the changing business goals and requirements and conducting training sessions to educate employees about data management best practices which emphasizes the importance of responsible data handling, proper data classification and adherence to data retention policies.
The aforementioned strategies can contribute to diminish dark data, consequently reducing the impact on cloud spending. This approach not only cuts down unnecessary expenses but also ensures better data governance, compliance and potential extraction of valuable insights from the retained data.