Learn what is necessary for Big Data Management and how you can implement Big Data Projects in your company

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my opinion, they range from different expertise levels. Basically, I see three different user types for data access within a company

Data access on 3 different levels

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data access for Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success. With Data access, it is necessary to also incorporate role based access controls.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

Honestly, a data scientist is doing a great job. Literally, they are saving all industries from a strong decline. And those heroes, they are doing all of that alone. Alone? Not fully.

The Data Scientist need the Data Engineer

There are some poor guys that support their success: those, that are called Data Engineers. A huge majority of tasks has been carried out by these guys (and girls) that hardly anyone is talking about. All the fame seems to be going to the data scientists but the data engineers aren‘t receiving any credits.

I remember one of the many meetings with C-Level executives I had. When I explained the structure of a team dealing with data, everyone in the board room agreed on „we need data scientists“. Then, one of the executives raised the question: „but what are these data engineers about? Do we really need them or could we maybe have more data scientists instead of them“.

I kept on explaining and they accepted it. But I had the feeling that they still wanted to go with more Data Scientists than Engineers eventually. This basically comes out of the trend and hype around the data scientists we see. Everyone knows that they are important. But data driven projects only succeed when a team with mixed skills and know-how is coming together.

A Data Science team needs at least the same number of Data Engineers

In all data driven projects I saw so far, it would have never worked without data engineers. They are relevant for many different things – but mainly – and in an ideal world – working in close cooperation with data scientists. If the maturity in a company for data is high, the data engineer would prepare the data for the data scientist and then work with the data scientist again on putting the algorithm back into production. I saw a lot of projects where the later one wasn‘t working – basically, the first steps were successful (data preparation) but the later step (automation) was never done.

But, there are more roles involved in that: one role, which is rather a specialization of the data engineer is the data system engineer. This is not often a dedicated role, but carried out by data engineers. Here, we basically talk about infrastructure preparation and set-up for the data scientists or engineers. Another role is the one of the data architect that ensures a company-wide approach on data and of course data owners and data stewards.

I stated it several times, but it is worth stating it over and over again: data science isn‘t a one (wo)man show, it is ALWAYS a team effort.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another interesting article about the data science team setup can be found here.

A current trend in AI is not a much technical one – it is rather a societal one. Basically, technologies around AI in Machine Learning and Deep Learning are getting more and more complex. This is making it even more complex for humans to understand what is happening and why a prediction is happening. The current approach in „throwing data in, getting a prediction out“ is not necessarily working for that. It is somewhat dangerous building knowledge and making decisions based on algorithms that we don‘t understand. To solve this problem, we need to have explainable AI.

What is explainable AI?

Explainable AI is getting even more important with new developments in the AI space such as Auto ML. With Auto ML, the system takes most of the data scientist‘s work. It needs to be ensured that everyone understands what‘s going on with the algorithms and why a prediction is happening exactly the way it is. So far (and without AutoML), Data Scientists were basically in charge of the algorithms. At least there was someone that could explain an algorithm. NOTE: it didn‘t prevent us from bias in it, nor will AutoML do. With AutoML, when the tuning and algorithm selection is done more or less automatically, we need to ensure to have some vital and relevant documentation of the predictions available.

And one last note: this isn‘t a primer against AutoML and tools that do so – I believe that democratisation of AI is an absolute must and a good thing. However, we need to ensure that it stays – explainable!

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. A comprehensive article about explainable AI can also be found on wikipedia.

Data itself and Data Science especially, is one of the drivers of digitalisation. Many companies experimented with Data Science over the last years and gained significant insights and learnings from it. Often, people dealing with statistics started to do this magic thing called data science. But also technical units used machine learning and alike to further improve their businesses. However, for many other units within traditional companies, all of this seems like magic and dangerous. So how to include others not dealing with the topic in detail and thus de-mystify the topic? So what does it take to become data driven?

How to become data driven

First of all, Machine Learning and Data Science isn‘t the revolution. Units started implementing it in order to gain new insights and improve their business results. However, often it is also acquired via business projects from consulting companies. The newer and complex a topic is, the higher the risk is that people will object it. The reasons for that are fear and mis- or not understanding.

When being deep in the topic of data and data science, you might be treated with fame by some. Mainly by those, that think that you are a magician. However, you will also be rejected by others. Both is poisoning in my opinion. The first group will try to get very close to you and expects a lot. However, you are often not capable of meeting their expectations. After a while, they get frustrated by far too high expectations.

In corporate environments, it is very important to filter this group at the very beginning. You need to clearly state what they can expect and what not. It is also important to state towards them what they won‘t get – and saying „No“ is very important to them as well. Being transparent with this group is essential – in order to keep them close supporters to you in a growing environment. You will depend a lot on those people if you want to succeed. So be clear with them.

People fear digitalisation

The other group – which I would say in digitalisation is the bigger group – is the group that will meet you with fears and doubts. This group is the far larger group and it is highly important that you cover them well. You can easily recognise people in this group by not being open towards your topics. Some are probably actively refusing it, others might not be so active and just poison the climate. But be aware: they usually don‘t do it because they hate you for some reasons.

They are just acting human and are either afraid, feel that they are not included or have other doubts about you and your unit. It is highly essential to work on a communication strategy with this group and pro-actively include them. Bringing clarity and de-mystifying your topic in easy terms is vital. It is important that you create a lot of comparisons to your traditional business and keep it simply. Once you gained their trust and interest, you can get much deeper into your topic and provide learning paths and skill development for those people.

If you succeeded in that, you created strong supporters that will come up with great ideas to improve your business even further. Keep in mind: just because you are in a „hot topic“ like big data and data science and you might be treated like a rock star by some, others are also great in doing things and it all boils down to: we are just humans.

No digitalisation without a data strategy

Digitalisation needs trust to succeed. If you fail to deliver trust and don’t include the human aspect, your digitalisation and data strategy is poised to fail – independent of the budget and C-Level support you might have for your initiative. So, make sure to work on that – with high focus! Becoming data driven is the driver for digitalisation in your company!

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another article I like about data driven organisations can be found on Forbes.

Agility is everywhere in the enterprise nowadays. Most companies want to become more agile and also on C-Level, there are huge expectations on agility. However, I’ve seen much of the analytics (and Big Data) projects being the complete opposite: neither agile nor successful. Often, the reasons for this were different: the setup of the datalake with expensive hardware setup took years, not month and the operations with it turned out to be very inefficient to maintain these systems. What can be done for agile data science projects?

The demand for agile data science projects

Also, a lot of companies expressed their demand for agile analytics. But in fact, with analytics (and big data), we moved away from agility and to a complex waterfall-like approach. But what was worse, is the approach of doing agile analytics and then don’t stick to it (and rather do it somewhere in between).

However, a lot of companies also realised that agility can only be solved with (Biz)DevOps and the Cloud. Really, there is hardly any way around this. And a close coop between data engineering and data science. One important question for agile data science projects is the methodology. Is it Kanban or Scrum?

I would say, that this question is a “luxury” problem. If a company has to answer this, it is already at a very high maturity state on data. My ideas on this topic (which, again, is a “it depends” thing) are:

When to select Kanban or Scrum for Data projects

  • Complexity: if the data project is more complex, Scrum might be the better choice. A lot of data science projects are one-person projects (with support of data engineers and devops at some stages) and rather run for some weeks and also not always full-time. In this case (lower complexity), Kanban is the most suiteable approach. Often, the data scientist even works on different projects as the load per project isn’t much at all. Other projects with higher complexity, I would recommend Scrum
  • Integration/Productization: If the integration effort is high (e.g. into existing processes, systems and alike), I would rather recommend to go with Scrum. More people are involved and the complexity is immediately higher. If the focus is on Data Engineering or at least this part is very high, it is often delivered with Scrum.

I guess there could be much more indicators, so I am looking forward to your comments on it 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. You might also read this discussion about Scrum for Data Scientists.

Digitalisation is a key driver amongst companies since the last 2 years. However, many companies forget that the oil for the digitalisation engine is data. Most companies have no data strategy in place or at least it is very blurry. A lot of digitalisation strategies fail, which is often due to the lack of proper treatment and management of their data. In this blog post, I will write about the most common errors I saw so far in my experience. Disclaimer: I won’t offer answers as of now, but it is relevant to give you an insight into what you should probably avoid doing. The following steps help you to destroy your data strategy.

Step 1: Hire Data Scientists. Really: you need them

Being a Data Scientist is a damn sexy job. It is even considered to be the most sexy job of the 21st century. So why should you not have one? Or two or three? Don’t worry – just hire them. They do the magic and solve almost all of your problems around data. Just don’t think about it, just do it. If you have no Data Scientist for your digitalisation strategy, it isn’t complete. Think about what they can or should do later.

In my experience, this happend a lot in the last years. Only few industries (e.g. banking) have experience with them, as it is natural for them. Over the last years I saw Data Scientists joining companies without a clear strategy. These Data Scientists then had to deal with severe issues:

  • Lack of data availability. Often, they have issues getting to the data. Long processes, siloed systems and commodity systems prevent them from doing so.
  • Poor data quality. Once they get to the data and want to start doing things with it, it becomes even more complex: no governance, no description of the data, poor overall quality.

So, what most companies are often missing out on is the counterpart each data scientist needs: a Data Engineer. Without them, they are often nothing.

But with this, I described actually a status which is almost advanced; often, companies hire data scientists (at high salaries!) and then let them just do BI tasks like reporting. I saw this often and people got frustrated. Frustration led to them leaving the jobs just after some months. The company had no learnings after that and no business benefits. So it clearly failed.

Step 2: Deliver & Work in silence. Let nobody know what you are doing

Digitalisation is dangerous and disruptive. It will lead to major changes in companies. This is a fact, not fiction. And you don’t need science to figure that out. So why should you talk about it? Just do it, let other units continue doing their job and don’t disrupt them.

Digitalisation is a complex topic and humans by nature tend to interpret. Also, they will start to interpret things from this topic to fit to their comfort zone. This will lead to different strategies and approaches, creating even more failed projects and a lot of uncertainty.

The approach here should be to be consistent about communication within the company and to take away fear from different units. Digitalisation is by nature disruptive, but do it with the people, not against them!

Step 3: Build even more silos to destroy the data strategy

Step 2 will most likely lead to different silos. A digital company should be capable of doing and solving their digital products, services and solutions on their own. There is always a high threat that different business units will create data silos. This leads to the fact that there will never be a holistic view on all of your data. The integration is though later on and will burn a lot of money. For businesses, it is often a quick win to implement the one or another solution, but backwards integration of these solutions – especially when it comes to data – is very tricky.

A lot of companies have no 360 degree view of their data. This is due to the mere fact that business units often confront IT departments with “we need this tool now, please integrate”. This leads to issues, since IT departments are anyway often understaffed. So, a swamp in the IT landscape is created, leading to an even bigger swamp of data. Integration then never really happens as it is too expensive. Will you become digital with this? Clearly no.

Step 4: Build a sophisticated structure when the company isn’t sophisticated with this topic yet.

Data Scientists tend to sit in business units. For a data driven enterprise, this is exactly how it should be. However, only a small percentage of companies are data driven. I would argue that traditional companies aren’t data driven, only the Facebooks, Googles and Amazons of our world are.

However, traditional companies now tend towards copying this system and Business units hire data scientists – which are then disconnected to other units and only loosely connected via internal communities. A distributed layout of your company in terms of data only makes sense once the company reached a high level of maturity. In my opinion, it needs to be steered from a central unit first. Once the maturity is going to improve, it can be step-wise decentralised and then put back fully into business units.

One thing: put digitalisation very close to the CEO of the company. It needs to have some fire power as there will always be obstacles.

In my experience, I’ve seen quite a lot of failures when it comes to where to place data units. In my opinion, it only makes sense in a technical unit or – if available – in the digitalisation unit. However, it should never be in business functions. You will definitely succeed and destroy the data strategy with this.

Step 5: Don’t invest into people to destroy your data strategy

Last but not least, never invest into people. Especially Data Scientists – they should be really happy to have a job with you, so why would you also invest into them and give them education?

This is also one challenge I see a lot in companies. They simply don’t treat their employees well, and those that are under high demand (like Data Scientists) tend to leave fast then. This is one of the key failures in Data driven strategies. Keeping the people is a key to a successful strategy and a lot of companies don’t manage this well. To not invest into people is probably one of the most effective ways to destroy a data strategy.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Now it is about time to twist it around and destroy your competitors with data.

… this is at least what I hear often. A lot of people working in the data domain state this to be “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be addressed:

What is data governance about?

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: There is a clear tracking of data access. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy. There are many benefits for Data Governance.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

One topic every company is currently discussing on high level is the topic of marketing automation and marketing data. It is a key factor to digitalisation of the marketing approach of a company. With Marketing Automation, we have the chance that marketing gets much more precise and to the point. No more unnecessary marketing spent, every cent spent wise – and no advertisement overloading. So far, this is the promise from vendors if we would all live in a perfect world. But what does it take to live in this perfect marketing world? DATA.

What is so hot on Marketing data?

One disclaimer upfront: I am not a marketing expert. I try to enable marketing to achieve these goals by the utilisation of our data – next to other tasks. Data is the weak point in Marketing Automation. If you have bad data, you will end up having bad Marketing Automation. Data is the engine or the oil for Marketing Automation. But why is it so crucial to get the data right for it?

As of now, Data was never seen as a strategic asset within companies. It was rather treated like something that you have to store somewhere. So it ended up being stored in silos within different departments. Making it access hard and connections difficult. Also, governance was and is still neglected. When data scientists start to work with data, they often fight governance issues – what is inside the data, why is data structured in a specific way and what should the data tell us? This process often takes weeks to overcome and is expensive.

Some industries (e.g. banks) are more mature, but are also struggling with this. In the last years, a lot of companies built data warehouses to consolidate their view on the data. Data warehouses are heavily outdated and overly expensive nowadays and still most till now most dwh’s are poorly structured. In the last years, companies started to shift data to datalakes (initially Hadoop) to get a 360° view. Economically, this makes perfect sense, but also there a holistic customer model is a challenge. It takes quite some time and resources to build this.

The newest hype in marketing are now Customer Data Platforms (CDPs). The value of CDPs aren’t proved yet. But most of them are an abstraction layer to make data handling for marketeers easier. However, integrating the data to the CDPs is challenging itself and there is a high risk of another data silo.

In order to enable Marketing Automation with data, the following steps are necessary

  • Get your data house in order. Build your data assets on open standards to change technology and vendor if necessary. Don’t lock in your data to one vendor
  • Do the first steps in small chunks, closely aligned with Marketing – in an agile way. Customer journeys are often dedicated to specific data sources and thus a full-blown model isn’t necessary. However, make sure that the model stays extensible and the big picture is always available. A recommendation is to use a NoSQL store such as Document stores for the model.
  • Keep the data processing on the datalake, the abstraction layer (I call it Customer 360) interacts with the datalake and uses tools out of it
  • Do Governance in the first steps. It is too difficult to do it at a later stage. Establish a data catalog for easy retrieval, search and data quality metrics/scoring.
  • Establish a central identity management and household management. A 360 degree view of the customer helps a lot.

With Marketing Automation, we basically differentiate 2 different types of data (so, a Lambda Architecture is my recommendation for it):

  • Batch data. This kind of data doesn’t change frequently – such as Customer Details. This data also contains data about models that run on larger datasets and thus require time-series data. Analytical models run on that data are promoted as KPIs or fields to the C360 model
  • Event data. Data that needs to feed into Marketing Automation platforms fast. If this has happened, unnecessary ads should be removed (otherwise, you would loose money)

What’s next?

This is just a high-level view on that, but handling data right for marketing is getting more and more important. And, you need to get your own data in order – you can’t outsource this task.

Let me know what challenges you had with this so far, as always – looking forward to discuss this with you 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you want to learn more about Marketing Automation, I recommend you reading this article.

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

The pain of cluster sizing

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects

RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!