Read about Big Data and what is necessary to implement it in your company

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my oppinion, they range from different experise levels. Basically, I see three different user types:

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success.

Honestly, those data scientists are doing a great job. Literally, they are saving all industries from a strong decline. And those heroes, they are doing all of that alone. Alone? Not fully.

There are some poor guys that support their success: those, that are called Data Engineers. Actually, when looking at projects so far, a huge majority of tasks has been carried out by these guys (and girls) that hardly anyone is talking about. All the fame seems to be going to the data scientists but the data engineers aren‘t receiving any credits. I remember one of the many meetings with C-Level executives I had when I explained the structure of a team dealing with data. Everyone in the board room agreed on „we need data scientists“. Then, one of the executives raised the question: „but what are these data engineers about? Do we really need them or could we maybe have more data scientists instead of them“. I kept on explaining and they took it but I had the feeling that they still wanted to go with more Data Scientists than Engineers eventually. This basically comes out of the trend and hype around the data scientists we see. Everyone knows that they are important, but data driven projects only succeed when a team with mixed skills and know-how is coming together.

In all data driven projects I saw so far, it would have never worked without data engineers. They are relevant for many different things – but mainly – and in an ideal world – working in close cooperation with data scientists. If the maturity in a company for data is high, the data engineer would prepare the data for the data scientist and then work with the data scientist again on putting the algorithm back into production. I saw a lot of projects where the later one wasn‘t working – basically, the first steps were successful (data preparation) but the later step (automation) was never done.

But, there are more roles involved in that: one role, which is rather a specialization of the data engineer is the data system engineer. This is not often a dedicated role, but carried out by data engineers. Here, we basically talk about infrastructure preparation and set-up for the data scientists or engineers. Another role is the one of the data architect that ensures a company-wide approach on data and of course data owners and data stewards.

I stated it several times, but it is worth stating it over and over again: data science isn‘t a one (wo)man show, it is ALWAYS a team effort.

When Kappa first appeared as an architecture style (introduced by Jay Kreps) I was really fond of this new approach. I carried out several projects that went with Kafka as the main “thing” and not having the trade-offs as Lambda. But the more complex projects got, the more I figured out that it isn’t the answer to everything and that we ended up with Lambda again … somehow.

First of all, what is the benefit of Kappa and the trade-off with Lambda? It all started with Jay Kreps in his blog post when he questioned the Lambda Architecture. Basically, with different layers in the Lambda Architecture (Speed Layer, Batch Layer and Presentation Layer) you need to use different tools and programming languages. This leads to code complexity and the risk that you end up having inconsistent versions of your processing capabilities. A change to the logic on the one layer requires changes on the other layer as well. Complexity is basically one thing we want to remove from our architecture at all times, so we should also do it with Data Processing.

The Kappa Architecture came with the promise to put everything into one system: Apache Kafka. The speed that data can be processed with it is tremendous and also the simplicity is great. You only need to change code once and not twice or three times as compared to Lambda. This leads to cheaper labour costs as well, as less people are necessary to maintain and produce code. Also, all our data is available at our fingertips, without major delays as with batch processing. This brings great benefits to business units as they don’t need to wait forever for processing.

However, my initial statement was about something else – that I mistrust Kappa Architecture. I implemented this architecture style at several IoT projects, where we had to deal with sensor data. There was no question if Kappa is the right thing – as we were in a rather isolated Use-Case. But as soon as you have to look at a Big Data architecture for a large enterprise (and not only into isolated use-cases) you end up with one major issue around Kappa: Cost.

In use-cases where data don’t need to be available within minutes, Kappa seems to be an overkill. Especially in the cloud, Lambda brings major cost benefits with Object Storages in combination with automated processing capabilities such as Azure Databricks. In enterprise environments, cost does matter and an architecture should also be cost efficient. This also holds true when it comes to the half-live of data which I was recently writing about. Basically, data that looses its value fast should be stored on cheap storage systems at the very beginning already.

An easy way to compare Kappa to Lambda is the comparison per Terabyte stored or processed. Basically, we will use a scenario to store 32 TB. With a Kappa Architecture running 24/7, this would mean that we have an estimated 16.000$ per month to spend (no discounts, no reserved instances – pay as you go pricing; E64 CPUs with 64 cores per node, 432 GB Ram and E80 SSDs attached with 32TB per disk). If we would use Lambda and only process once per day, this would mean that we need 32TB on a Blob Store – that costs 680$ per month. Now we would take the cluster above for processing with Spark and use it 1 hour per day: 544$. Summing up, this would equal to 1.224$ per month – a cost ratio of 1:13.

However, this is a very easy calculation and it can still be optimised on both sides. In the broader enterprise context, Kappa is only a specialisation of Lambda but won’t exist all alone at all time. Kappa vs. Lambda can only be selected by the use-case, and this is what I recommend you to do.

… this is at least what I hear often. Basically, when talking to people that are data-minded, they would argue “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be adresed:

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: it should be tracked and audited what systems and processes access data. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy.

Despite the absolute storage of data is getting cheaper over time, it is still important to build a data platform that stores data in an efficient way. By efficient, I mean both cost and performance wise. It is necessary to build a data architecture that allows fast access to data but on the other hand also stores data in a cost effective way. Both topics are somewhat conflicting, because a cost effective storage is often slow and thus won’t create much throughput. Highly performant storages in contrast are often expensive to build. However, one question should rather be if it is really necessary to store all data in high performing entities. Therefore, it is necessary to measure the value of your data and how much you can store in a specific storage.

The role of the Data Architect is in charge of storing data efficient – both in performance and cost. Some years from now, the answer was to put all relevant data into the data warehouse. Since this was too expensive for most data, data was put into HDFS (Hadoop) in recent years. But with the cloud, we now have more diverse options. We can store data in message buffers (such as Kafka), on HDFS systems (disk based) and on Cloud-based Object stores. Especially the later one provides even more options. Comming from general purpose cloud storages, over the last years those storages have evolved to premium object stores (with high performance), common-purpose storage and cheap archive stores. This gives more flexibility in terms of how to store data even more cost effective. Data would typically demote from in-memory (e.g. via instances on Kafka) or premium storages to general purpose storages or even to Archive Stores. The data architect now has the possibility to store data in the most effective way (and thus making a Kappa Architecture useless – cloud prefers Lambda!).

But this now add additional pressure to the data architect’s job. How would the data architect now figure out what is the value of the data to store it? I recently came across a very interesting article, introducing something called “the half life of data”. Basically, this article describes how fast data loses value and thus makes it easier to judge where to store the data. For those that want to read it. The article can be found here.

The half life of data basically categorises data into 3 different value types:

  • Strategic Data: companies use this data for strategic decision making. Data still has high value after some days, so it should be easy and fast to access.
  • Operational Data: data has still some value after some hours but then looses value. Data should be kept available for some hours to maximum days, then it should be demoted to cheaper storages
  • Tactical Data: data has value only for some minutes to maximum of hours. Value is lost fast, so it should either be stored in a very cheap storage or even deleted.

There is also an interesting infograph that illustrates this:

The half life of data: https://nucleusresearch.com/research/single/guidebook-measuring-the-half-life-of-data/

What do you think? What is your take on it? How do you measure the value of your data?

One topic every company is currently discussing on high level is the topic of marketing automation. It is a key factor to digitalisation of the marketing approach of a company. With Marketing Automation, we have the chance that marketing gets much more precise and to the point. No more unnecessary marketing spent, every cent spent wise – and no advertisement overloading. So far, this is the promise from vendors if we would all live in a perfect world. But what does it take to live in this perfect marketing world? DATA.

One disclaimer upfront: I am not a marketing expert. I try to enable marketing to achieve these goals by the utilisation of our data – next to other tasks. Data is the weak point in Marketing Automation. If you have bad data, you will end up having bad Marketing Automation. Data is the engine or the oil for Marketing Automation. But why is it so crucial to get the data right for it?

As of now, Data was never seen as a strategic asset within companies. It was rather treated like something that you have to store somewhere. So it ended up being stored in silos within different departments. Making it access hard and connections difficult. Also, governance was and is still neglected. When data scientists start to work with data, they often fight governance issues – what is inside the data, why is data structured in a specific way and what should the data tell us? This process often takes weeks to overcome and is expensive. Some industries (e.g. banks) are more mature, but are also struggling with this. In the last years, a lot of companies built data warehouses to consolidate their view on the data. Data warehouses are heavily outdated and overly expensive nowadays and still most till now most dwh’s are poorly structured. In the last years, companies started to shift data to datalakes (initially Hadoop) to get a 360° view. Economically, this makes perfect sense, but also there a holistic customer model is a challenge. It takes quite some time and resources to build this. The newest hype in marketing are now Customer Data Platforms (CDPs). So far, it’s value isn’t proved yet. But most of them are an abstraction layer to make data handling for marketeers easier. However, integrating the data to the CDPs is challenging itself and there is a high risk of another data silo.

In order to enable Marketing Automation with data, the following steps are necessary:

  • Get your data house in order. Build your data assets on open standards to change technology and vendor if necessary. Don’t lock in your data to one vendor
  • Do the first steps in small chunks, closely aligned with Marketing – in an agile way. Customer journeys are often dedicated to specific data sources and thus a full-blown model isn’t necessary. However, make sure that the model stays extensible and the big picture is always available. A recommendation is to use a NoSQL store such as Document stores for the model.
  • Keep the data processing on the datalake, the abstraction layer (I call it Customer 360) interacts with the datalake and uses tools out of it
  • Governance needs to be done in the first steps – as it is far too difficult to do it at a later stage. Establish a data catalog for easy retrieval, search and data quality metrics/scoring.
  • Establish a central identity management and household management. It is necessary to have a “golden record” of a customer and all necessary entities are linked to the customer

With Marketing Automation, we basically differentiate 2 different types of data (so, a Lambda Architecture is my recommendation for it):

  • Batch data. This kind of data doesn’t change frequently – such as Customer Details. This data also contains data about models that run on larger datasets and thus require time-series data. Typically, analytical models run on that data are promoted as KPIs or fields to the C360 model
  • Event data. Data that needs to feed into Marketing Automation platforms fast. This could be a product a customer bought. If this has happened, unnecessary ads should be removed (otherwise, you would loose money)

This is just a high-level view on that, but handling data right for marketing is getting more and more important. And, you need to get your own data in order – you can’t outsource this task.

Let me know what challenges you had with this so far, as always – looking forward to discuss this with you 🙂

Everyone (or at least most) companies today talk about digital transformation and treat data as a main asset for this. The question is where to store this data. In a traditional database? In a DWH?

I think we should take a step back to answer this question. First of all, a Datalake is not a single piece of software. It consists of a large variety of Platforms, where Hadoop is a central one, but not the only one – it includes other tools such as Spark, Kafka, … and many more. Also, it includes relational Databases – such as PostgreSQL for instance. If we look at how truly digital companies such as Facebook, Google or Amazon solve these problems, then the technology stack is also clear; in fact, they heavily contribute to and use Hadoop & similar technologies. So the answer is clear: you don’t need overly expensive DWHs any more.

However, many C-Level executives might now say: “but we’ve invested millions in our DWH over the last years (or even decades)”. Here the question is getting more complex. How should we treat our DWH? Should it be replaced or should the DWH become the single source of truth and should the Datalake be ignored? In my opinion, both options aren’t valid:

First, replacing a DWH and moving all data to a Datalake will be a massive project that will bind too many resources in a company. Finding people with adequate skills isn’t easy, so this can’t be the solution to it. In addition to that, there are hundreds of business KPIs built, a lot of units within large enterprises built their decisions on these. Moving them to a Datalake will most likely break (important) business processes. Also, previous investments will be vaporised. So a big-bang replacement is clearly a no-go.

Second, keeping everything in the DWH is not feasible. Modern tools such as Python, Tensorflow and many more aren’t well supported by proprietary software (or at least, get the support with delay). From a skills-perspective, most young professionals coming from university get skills in technologies such as Spark, Hadoop and alike and therefore the skills shortage can be solved easier by moving towards a Datalake. I am speaking at a large number of international conferences; whenever I ask the audience if they want to work with proprietary DWH databases, no hands go up. If I ask them if they want to work with Datalake technologies, everyone raises the hand. The fact is, that employees choose the company they want to work for, not vice versa. We have a skills shortage in this area, everyone ignoring or not accepting that is simply wrong. Also, a DWH is way more expensive then a Datalake. So also this option is not a valid one.

So what is my recommendation or strategy? For large, established enterprises, it is a combination of both steps, but with a clear path towards replacing the DWH in the long run. I am not a supporter of complex, long-running projects that are hard to control and track. Replacing the DWH should be a vision, not a project. This can be achieved by agile project management, combined with a long-term strategy: new projects are solely done by Datalake technologies. All future investments and platform implementations must use the Datalake as the single source of truth. Once existing KPIs and processes are renewed, it must be ensured that these technologies are implemented on the Datalake and that the data gets shifted to the Datalake from the DWH. To make this succeed, it is necessary to have a strong Metadata management and data governance in place, otherwise the Datalake will be a very messy place – and thus become a data swamp.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.

This post’s focus: Logistics.

Big Data is a key driver for logistics. By logistics, companies that provide logistics solutions and companies that take advantage of logistics are meant. On the one hand, Big Data can significantly improve the supply chain of a company. For years – or even decades – companies rely on the “just in time” delivery. However, “just in time” wasn’t always “just in time”. In many cases, the time an item spent on stock was simply reduced but it still needed to be stored somewhere – either in a temporary warehouse on-site or in the delivery trucks themselves. The first approach is capital intensive, since these warehouses need to be built (and extended in case of growth). The second approach is to keep the delivery vehicles waiting – which creates expenses on the operational side – each minute a driver has to wait, costs money. With analytics, the just in time delivery can be further improved and optimized to lower costs and increase productivity.

Another key driver for Big Data and logistics is the route optimization. Routes can be improved by algorithms and make them faster. This lowers costs and on the other hand significantly saves the environment. But this is not the end of possibilities: routes can also be optimized in real-time. This includes traffic prediction and jam avoidance. Real-time algorithms will not only calculate the fastest route but also the environmental friendliest route and cheapest route. This again lowers costs and time for the company.

Header Image by  Nick Saltmarsh / CC BY

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.

This post’s focus: Customer Services.

Big Data is great for customer services. In customer services, there are several benefits for it. A key benefit can be seen in the IT help desk. IT help desk applications can greatly be improved by Big Data. Analysing past incidents and calls, their occurrence and impact can give great benefits for future calls. On the one hand, a knowledge base can be built to give employees or customers an initial start. For challenging cases, trainings can be developed to reduce the number of tickets opened. This reduces costs on the one side and improves customer acceptance on the other side.

Big Data can have a large impact here. When a customer feels treated well, the customer is very likely to come back and buy more at the company. Big Data can serve as an enabler here.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.

This post’s focus: Sales.

Las week I outlined Marketing possibilities (and downsides) with Big Data. Very similar to Marketing is Sales. Often,  those two things come together. However, I would say it needs to be stated separately. In this post, I won’t discuss the Sales opportunities in Big Data from Webshops and alike. Today, I want to focus on Big Data opportunities that respect privacy but still have an impact.

Last year, I attended a conference where a company outlined their big data case. It was about analysing bills issued in their chain stores. The data from the bills included no personal details like credit card number, bonus card number and alike. It was only about what was in the basket. With the help of that, they could figure out what products get more attention at a specific store and how it differs from other stores. This data was joined with open data from public sources and other data about demographics. They could also find out that specific products get bought with another products – which means that if customer X buys product C, the customer is very likely to buy product D. An example of that for instance is that if you buy a skirt, you are also likely to buy a top.

The later example focused on analysing data for fashion stores. However, most stores can benefit from Big Data. I recently had the chance to talk to the CIO of a large supermarket chain. They also have some Big Data algorithms that improve their chain stores. The company’s policy is to accept their customer’s privacy and they don’t work on their personal data. They figured out when the neighbourhood changes – e.g. because a university was built. They could see that other products are demanded and changed the assortment of goods accordingly.

There are many opportunities where Big Data can improve Sales, and as shown in these two examples, they don’t necessarily need to violate someone’s privacy.