Data lifecycle management is a complex and important thing to consider. Despite the absolute storage of data is getting cheaper over time, it is still important to build a data platform that stores data in an efficient way. By efficient, I mean both cost and performance wise. It is necessary to build a data architecture that allows fast access to data but on the other hand also stores data in a cost effective way. Both topics are somewhat conflicting, because a cost effective storage is often slow and thus won’t create much throughput. Highly performant storages in contrast are often expensive to build. However, one question should rather be if it is really necessary to store all data in high performing entities. Therefore, it is necessary to measure the value of your data and how much you can store in a specific storage.
How to manage the data lifecycle
The role of the Data Architect is in charge of storing data efficient – both in performance and cost. Also, the architect needs to take care of data lifecycle management. Some years from now, the answer was to put all relevant data into the data warehouse. Since this was too expensive for most data, data was put into HDFS (Hadoop) in recent years. But with the cloud, we now have more diverse options. We can store data in message buffers (such as Kafka), on HDFS systems (disk based) and on Cloud-based Object stores. Especially the later one provides even more options. Comming from general purpose cloud storages, over the last years those storages have evolved to premium object stores (with high performance), common-purpose storage and cheap archive stores. This gives more flexibility in terms of how to store data even more cost effective. Data would typically demote from in-memory (e.g. via instances on Kafka) or premium storages to general purpose storages or even to Archive Stores. The data architect now has the possibility to store data in the most effective way (and thus making a Kappa Architecture useless – cloud prefers Lambda!).
But this now add additional pressure to the data architect’s job. How would the data architect now figure out what is the value of the data to store it? I recently came across a very interesting article, introducing something called “the half life of data”. Basically, this article describes how fast data loses value and thus makes it easier to judge where to store the data. For those that want to read it. The article can be found here.
What is the half life of data?
The half life of data basically categorises data into 3 different value types:
- Strategic Data: companies use this data for strategic decision making. Data still has high value after some days, so it should be easy and fast to access.
- Operational Data: data has still some value after some hours but then looses value. Data should be kept available for some hours to maximum days, then it should be demoted to cheaper storages
- Tactical Data: data has value only for some minutes to maximum of hours. Value is lost fast, so it should either be stored in a very cheap storage or even deleted.
There is also an interesting infograph that illustrates this:
What do you think? What is your take on it? How do you measure the value of your data? How do you handle your data lifecycle in your company?
This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company