Data lifecycle: What is the half-life of data and how does it affect your architecture?

Data lifecycle management is a complex and important thing to consider. Despite the absolute storage of data is getting cheaper over time, it is still important to build a data platform that stores data in an efficient way. By efficient, I mean both cost and performance wise. It is necessary to build a data architecture that allows fast access to data but on the other hand also stores data in a cost effective way. Both topics are somewhat conflicting, because a cost effective storage is often slow and thus won’t create much throughput. Highly performant storages in contrast are often expensive to build. However, one question should rather be if it is really necessary to store all data in high performing entities. Therefore, it is necessary to measure the value of your data and how much you can store in a specific storage.

How to manage the data lifecycle

The role of the Data Architect is in charge of storing data efficient – both in performance and cost. Also, the architect needs to take care of data lifecycle management. Some years from now, the answer was to put all relevant data into the data warehouse. Since this was too expensive for most data, data was put into HDFS (Hadoop) in recent years. But with the cloud, we now have more diverse options. We can store data in message buffers (such as Kafka), on HDFS systems (disk based) and on Cloud-based Object stores. Especially the later one provides even more options. Comming from general purpose cloud storages, over the last years those storages have evolved to premium object stores (with high performance), common-purpose storage and cheap archive stores. This gives more flexibility in terms of how to store data even more cost effective. Data would typically demote from in-memory (e.g. via instances on Kafka) or premium storages to general purpose storages or even to Archive Stores. The data architect now has the possibility to store data in the most effective way (and thus making a Kappa Architecture useless – cloud prefers Lambda!).

But this now add additional pressure to the data architect’s job. How would the data architect now figure out what is the value of the data to store it? I recently came across a very interesting article, introducing something called “the half life of data”. Basically, this article describes how fast data loses value and thus makes it easier to judge where to store the data. For those that want to read it. The article can be found here.

What is the half life of data?

The half life of data basically categorises data into 3 different value types:

  • Strategic Data: companies use this data for strategic decision making. Data still has high value after some days, so it should be easy and fast to access.
  • Operational Data: data has still some value after some hours but then looses value. Data should be kept available for some hours to maximum days, then it should be demoted to cheaper storages
  • Tactical Data: data has value only for some minutes to maximum of hours. Value is lost fast, so it should either be stored in a very cheap storage or even deleted.

There is also an interesting infograph that illustrates this:

The half life of data: https://nucleusresearch.com/research/single/guidebook-measuring-the-half-life-of-data/

What do you think? What is your take on it? How do you measure the value of your data? How do you handle your data lifecycle in your company?

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

AI is killing us all … not literally. Is AI dangerous?

I have to admit – I am having a really hard time with AI services and sales pitches from vendors about AI. Currently, the term AI is a hype without limits – I hear people talking about AI without a clue what it actually is and how it works. I mean I don’t want to be mean, but sales people are currently calling things “AI” that is nothing more than a rules engine. As already stated in my post for Advanced Analytics predictions, I tend to call this “rules based AI”. A really smart one ;). So, is AI dangerous at all?

AI isn’t as smart as you might think

Now, but why is AI creating so much trouble for all of us? It is mainly the Sales people that promise us now the magic AI thing. I recently heard a sales pitch where the seller told me: “you know, AI is this thing where our magicians make impressive stuff with”. I was really overpowered and didn’t know how to react. The only thing that came into my mind was asking him if their AI is already “rule based”. He was really enlighten, looked at with a winning grin and told me: “Yes, we are having a world-class rules based AI”. I didn’t ask any further, since it would eventually lead nowhere. However, I was really honoured to be a magician now.

I basically don’t fall for such sales pitches since I can easily uncover real AI. There are only few that get it done. Most others renamed their rules-engine to an AI. But imagine what happens when you are frequently dealing with business units? They are not so deep into technology and sales people now promise them the swiss army knife. I constantly get confronted with questions and have to explain the mess that has been created there. This is creating a lot of work and overload to an analytics departments that should deliver business results.

One demand from my side: could we please end this bullshit bingo about “AI”?

As always, I am looking forward to your feedback and thoughts about this topic 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you should remain sceptical, I wouldn’t recommend Terminator. There, the question remains: is AI dangerous at all?

Free e-book about Big Data

This promotion is currently not availalbe. However, you can still register for this blog in order to receive updates about Big Data, Data Science and Machine Learning in the Cloud.

Register for Big Data and Data Science Newsletter