Posts

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it:

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

I use booking.com a lot for my bookings, but one thing which constantly bugs me are the e-mails after a booking – stating “The prices for [CITY YOU JUST BOOKED] just dropped again!”. Really, booking.com?!? It happend to me several times already that I booked a Hotel and after some hours I received a message that the prices in this city just dropped.

So, I am wondering if this happens on purpose or rather on incident. If we would expect it to happen on purpose, I would like to question the purpose of it. Basically, you buy something and then get told – haha, it just got cheaper, we got you? No, I don‘t think so. I believe it rather the opposite: incident.

I do expect that booking.com is having some issues with either data silos or simply with the speed of the data. Either there is no connection between the ordering system and the campaigning system and thus the data doesn‘t flow between those two systems. After some time, I receive getting messages, so I think that the booking.com systems aren‘t built to handle this topic in real-time. You order something on booking.com – the system is probably optimized on bringing this order process through, send and receive information from their (hotel) partners, … – but they don‘t update the data on the CRM or Marketing systems, that create adds. Now, my suggestion is that once you book a hotel, booking.com tracks that you looked at a specific city. This is then added to their user database and the marketing automation tool is updated. However, the order process seems to be totally de-coupled from this process and doesn‘t receive the data about this fast enough – and most likely, their marketing automation system is set to „aggressive“ marketing once you have looked up a city – and sends recommendations often. This then leads to some discrepancy (or consistency) in their systems. For me, this is also a great example of eventual consistency in database design. At some point, booking.com‘s systems will all be up-to date, so they stop re-targeting you. However, their eventual in the consistency is very, very late 🙂

Let me know what experiences you had.

… this is at least what I hear often. Basically, when talking to people that are data-minded, they would argue “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be adresed:

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: it should be tracked and audited what systems and processes access data. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy.

One topic every company is currently discussing on high level is the topic of marketing automation. It is a key factor to digitalisation of the marketing approach of a company. With Marketing Automation, we have the chance that marketing gets much more precise and to the point. No more unnecessary marketing spent, every cent spent wise – and no advertisement overloading. So far, this is the promise from vendors if we would all live in a perfect world. But what does it take to live in this perfect marketing world? DATA.

One disclaimer upfront: I am not a marketing expert. I try to enable marketing to achieve these goals by the utilisation of our data – next to other tasks. Data is the weak point in Marketing Automation. If you have bad data, you will end up having bad Marketing Automation. Data is the engine or the oil for Marketing Automation. But why is it so crucial to get the data right for it?

As of now, Data was never seen as a strategic asset within companies. It was rather treated like something that you have to store somewhere. So it ended up being stored in silos within different departments. Making it access hard and connections difficult. Also, governance was and is still neglected. When data scientists start to work with data, they often fight governance issues – what is inside the data, why is data structured in a specific way and what should the data tell us? This process often takes weeks to overcome and is expensive. Some industries (e.g. banks) are more mature, but are also struggling with this. In the last years, a lot of companies built data warehouses to consolidate their view on the data. Data warehouses are heavily outdated and overly expensive nowadays and still most till now most dwh’s are poorly structured. In the last years, companies started to shift data to datalakes (initially Hadoop) to get a 360° view. Economically, this makes perfect sense, but also there a holistic customer model is a challenge. It takes quite some time and resources to build this. The newest hype in marketing are now Customer Data Platforms (CDPs). So far, it’s value isn’t proved yet. But most of them are an abstraction layer to make data handling for marketeers easier. However, integrating the data to the CDPs is challenging itself and there is a high risk of another data silo.

In order to enable Marketing Automation with data, the following steps are necessary:

  • Get your data house in order. Build your data assets on open standards to change technology and vendor if necessary. Don’t lock in your data to one vendor
  • Do the first steps in small chunks, closely aligned with Marketing – in an agile way. Customer journeys are often dedicated to specific data sources and thus a full-blown model isn’t necessary. However, make sure that the model stays extensible and the big picture is always available. A recommendation is to use a NoSQL store such as Document stores for the model.
  • Keep the data processing on the datalake, the abstraction layer (I call it Customer 360) interacts with the datalake and uses tools out of it
  • Governance needs to be done in the first steps – as it is far too difficult to do it at a later stage. Establish a data catalog for easy retrieval, search and data quality metrics/scoring.
  • Establish a central identity management and household management. It is necessary to have a “golden record” of a customer and all necessary entities are linked to the customer

With Marketing Automation, we basically differentiate 2 different types of data (so, a Lambda Architecture is my recommendation for it):

  • Batch data. This kind of data doesn’t change frequently – such as Customer Details. This data also contains data about models that run on larger datasets and thus require time-series data. Typically, analytical models run on that data are promoted as KPIs or fields to the C360 model
  • Event data. Data that needs to feed into Marketing Automation platforms fast. This could be a product a customer bought. If this has happened, unnecessary ads should be removed (otherwise, you would loose money)

This is just a high-level view on that, but handling data right for marketing is getting more and more important. And, you need to get your own data in order – you can’t outsource this task.

Let me know what challenges you had with this so far, as always – looking forward to discuss this with you 🙂

Agility is an important factor to Big Data Applications. (Rys, 2011) describes 3 different agility factors which are: model agility, operational agility and programming ability.

Data agility

Data agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

Whenever we talk about Big Data, one core topic is often not included: Data Quality. If we Data, all the Data doesn’t really help us if the data quality is poor. There are several key topics that data should contain in terms of quality.

Relevance – Data should contain a relevant subset of the reality to support the tasks within a company.

Correctness – Data should be very close to reality and correct.

Completeness – There should be no gap for data sets and data should be complete as possible.

Timeliness – Data should be up-to-date.

Accuracy – Data should be accurant to serve the needs of the enterprise.

Consistency – Data should be consistent.

Understandability – Data should be easy to interpret. If it is not possible, data should be explained by metadata.

Availability – Data should be available at any time.