Posts

One of the frequent statements vendors make is “Agile Analytics”. In pitches towards business units, they often claim that it would only take them some weeks to do agile analytics. However, this isn’t necessarily true, since they can easily abstract the hardest part of “agile” analytics: data access, retrieval and preparation. On the one hand side, this creates “bad blood” within a company: business units might ask why it takes their internal department so long (and there most likely has been some history to get the emotions going). But on the other side, it is necessary to solve this problem, as agile analytics is still possible – if done right.

In my opinion, there are several aspects necessary to go for agile analytics. First, it is about culture. Second, it is about organization and third is it about technology. Let’s start with culture first.

Culture

The company must be silo-free. Sounds easy, in fact it is very difficult. Different business units use data as a “weapon” which could easily be thermo-nuclear. If you own the data, you can easily create your own truth. This means that marketing could create their view of the market in terms of reach, sales could tweak the numbers (until the overall performance is measured by controlling), … So, business units might fight giving away data and will try to keep it in their ownership. However, data should be a company-wide good that is available to all units (of course, on the need to know basis and with adhering to legal and regulatory standards). This can only be achieved if the data unit is close to the CEO or any other powerful board member. Once this is achieved, it is easier to go for self-service analytics.

Organisation

Similar like culture, it is necessary to organize yourself for agile analytics. This is now more focused on the internal structure of an organization (e.g. the data unit). There is now silver bullet for this available, it very much depends on the overall culture of a company. However, certain aspects have to be fulfilled:

  • BizDevOps: I outlined it in one of my previous posts and I insist on this approach being necessary for many things around data. One of them is agile analytics, since handover of tasks is always complicated. End-to-end responsibility is really crucial for agile analytics
  • Data Governance: There is no way around it; either do it or forget about anything close to agile analytics. It is necessary to have security and privacy at control and to allow users to access data easy but secure. Also, it is very important to log what is going on (SOX!)
  • Self-Service Tools: Have tools available that enable you to access data without complex processes. I will write about this in “Technology”.

Technology

Last but not least, agile analytics is done via technology. Technology is just an enabler, so if you don’t get the previous 2 right, you will most likely fail here – even though you invest millions into it. You will need different tools that handle security and privacy, but also a clear and easy to use Metadata repository (let’s face it – a data catalog!). Also, you need tools that allow easy access of data via a data science workbench, a fully functional data lake and a data abstraction layer. That sounds quite a lot – and it is. The good news though is, that most of that comes for free – as all of them are mainly open source tools. At some point, you might need an enterprise license but cost-wise it is still manageable. And remember one thing: technology comes last. If you don’t fix culture and organization, you won’t be capable to deliver.

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it:

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my oppinion, they range from different experise levels. Basically, I see three different user types:

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success.

Agility is everywhere in the enterprise nowadays. Most companies want to become more agile and also on C-Level, there are huge expectations on agility. However, I’ve seen much of the analytics (and Big Data) projects being the complete opposite: neither agile nor successful. Often, the reasons for this were different: the setup of the datalake with expensive hardware setup took years, not month and the operations with it turned out to be very inefficient to maintain these systems. Also, a lot of companies expressed their demand for agile analytics. But in fact, with analytics (and big data), we moved away from agility and to a complex waterfall-like approach. But what was worse, is the approach of doing agile analytics and then don’t stick to it (and rather do it somewhere in between).

However, a lot of companies also realised that agility can only be solved with (Biz)DevOps and the Cloud. Really, there is hardly any way around this. And a close coop between data engineering and data science. Once these requirements are fixed, another important question will arise: Kanban or Scrum?

I would say, that this question is a “luxury” problem. If a company has to answer this, it is already at a very high maturity state on data. My ideas on this topic (which, again, is a “it depends” thing) are:

  • Complexity: if the data project is more complex, Scrum might be the better choice. A lot of data science projects are one-person projects (with support of data engineers and devops at some stages) and rather run for some weeks and also not always full-time. In this case (lower complexity), Kanban is the most suiteable approach. Often, the data scientist even works on different projects as the load per project isn’t much at all. Other projects with higher complexity, I would recommend Scrum
  • Integration/Productization: If the integration effort is high (e.g. into existing processes, systems and alike), I would rather recommend to go with Scrum. More people are involved and the complexity is immediately higher. My experience is, that the higher the data engineering part is, the more likely it is that the project is delivered with Scrum

I guess there could be much more indicators, so I am looking forward to your comments on it 🙂

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects

RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!