Posts

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

There is a huge demand for Data DevOps nowadays. Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it.

The process for Data Science: BizDevOps is the answer

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Data DevOps: Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Data DevOps: Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Data DevOps: Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

This post is part of the “Big Data for Business” tutorial. Our focus was on Data DevOps and BizDevOps. In this tutorial, I explain various aspects of handling data right within a company. I also recommend you to read about the concept of DevOps.

The three data sources

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my opinion, they range from different expertise levels. Basically, I see three different user types for data access within a company

Data access on 3 different levels

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data access for Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success. With Data access, it is necessary to also incorporate role based access controls.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

As described in an earlier post here I outlined the fact that becoming a data scientist requires a lot of knowledge.
Focusing back, a data scientist needs to have knowledge in different IT domains:

  • General understanding of distributed systems and how they work. This includes administration skills for Linux as well as hardware related skills such as networking.
  • Knowledge in Hadoop or similar technologies. This knowledge basically builds on top of the former one but it is sort of different and requires a more software focused knowledge.
  • Great statistical/mathematical knowledge. This is necessary to actually work on the required tasks and to figure out how they can be applied to real algorithms.
  • Presentation skills. All is worth nothing as long as someone can’t represent the data or things found in the data. The management might not see the points if the person can’t present data in an appropriate way.

In addition, there are some other skills necessary:

  • Knowledge of the legal situation. The legal basics are different from country to country. Though the european union gives some legal borders within member states, there are also differences.
  • Knowledge of the society impacts. It is also necessary to understand how society might react to data analysis. Especially in marketing it is absolutely necessary to handle that correct

Since more and more IT companies focus on looking for the ideal data scientist, people should first try to find out who is capable of handling all of these skills. The answer to this might be: there is no person that can handle all. It is likely that one person is great in distributed systems and Hadoop but might fail in transforming questions to algorithms and finally presenting them.
Data Science is more of a team effort than a single person that can handle all of it. Therefore, it is rather necessary to build a team that will be able to address all of these challenges.

Data is often stored in one system, but the analytical systems are often somewhere else. In this tutorial, we will look at the challenges of moving data for analysis.

Moving data for analysis

Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible.
If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late.
In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible.

What are the challenges?

(Alexander, Hoisie , & Szalay , 2011) describes some factors that influence the challenges of moving data to another cluster: high-flux data, structured and unstructured data, real-time decisions and data organization.
High-flux data describes data that arrives in real time. If the data must be analyzed, this also has to be done in real-time. The data might be gone or modified at a later point. In Big Data applications, data will arrive structured as well as unstructured.
Decisions on Data must often be done in real time. If there is a data stream of financial transactions, an algorithm must decide in real time if the data needs more detailed analysis. If not all data is stored, an algorithm must decide if the data is stored or not. Data organization is another challenge when it comes to moving data.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

With the consistency of data, another challenge arrives: data concurrency. What it means is described in this tutorial.

What are the challenges with data concurrency?

Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand.

Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes.

To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem.

How does this play out?

Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them User A and User B) want to buy a Product P. There is exactly one item on stock. User A sees this and proceeds with the checkout, as well as User B. They complete the order at about the same time.

The Database in our sample is designed in a way that partitioning is preferred over consistency and both Users get the acknowledgement that their Order was processed. Now we would have -1 items in stock since no database trigger or any other command told us that we ran out of items. We either have to tell one User to “forget” the order or have to find a way to deliver the item to both users.

In any case, one user might get angry. Some web shops solved this issue in a non-technical way: they tell the user “sorry, we are unable to deliver in time” and give them the option to cancel the order or take a voucher. However, there is no simple technical solution to that.

How to solve data concurrency issues?

In most cases, it will cost money to the company. If the web shop would use a system built for consistency, it might run into database outages. Users might not buy products at their web site since the web site is simply “not available”. The web shop can either loose money by users that were unable to buy products because of delays in the database or by consistency issues.

In the case of web shop outage, users might not return and buy products since they are annoyed about the “bad performance of the website” and “inability to process the order”, whereas people would return and buy other products if they get a voucher because of issues that came with data partitioning and concurrency.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Another challenge for Big Data is about different storage systems. This creates a lot of variety in data and thus increases complexity. In this tutorial, we will discuss this.

What are the problems of different storage systems?

A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility.
Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need.
(Helland, 2011) described the challenges for datastores with 4 key principles:

  • unlocked data
  • inconsistent schema
  • extract, transform and load
  • too much to be accurate

What are these aspects about?

By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads to semantically changes in a database. With inconsistent schema, (Helland, 2011) describes the challenge of data from different sources and formats. Schema needs to be somewhat flexible to deal with extensibility. As stated earlier, businesses change over time and so does the data schema.
Extract, transform and load is something very specific to Big Data Systems, since data comes from many different sources and it needs to be put into place in a specific system. Too much to be accurate outlines the “velocity” problem with Big Data applications. If data is calculated, the result might not be exact since the data the calculation was built upon might have already changed. (Helland, 2011) states that you might not be accurate at all and you can only guess results.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. In this tutorial we will discuss the storage performance challenges.

What are the storage performance challenges?

A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance.

In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application.

As big data is about storing and analyzing data, this is a major challenge to Big Data Applications. It doesn’t help us much if we have enough compute power to analyze data but our disks simply can’t deliver the data in a fast way.

Data is distributed

When we look at supercomputers nowadays, they are often measured in cores and Teraflops (Top 500 Supercomputers Site, 2012). This is basically good if you want to do whatever kind of calculation such as the human genome. However, this doesn’t tell us anything about disk performance if we want to store or analyze data. (Zverina, 2011) cites Allan Snavely when he proposes to include the disk performance in such metrics as well:

“I’d like to propose that we routinely compare machines using the metric of data motion capacity, or their ability to move data quickly” – Allan Snavely

Allan Snavely also stated that with increasing data size – hard disks are getting higher in capacity but access time stays the same – it is harder to find data.

This can be illustrated easily: you have an external hard disk with the capacity of 1 TB. The hard disk operates with 7,200 rpm and a cache of 16MB. There are 1,000 Videos stored on this hard drive, each with a size of 1 GB. This would fill the entire hard disk. If you now change to a larger system as your videos grow, you would change to a 2 TB system.

If this System is full, you won’t be able to transfer the videos to another system in the same time as you did with the 1 TB hard drive. It is very likely that your 2 TB System now needs about twice as much time to transfer the data. Whereas compute performance grows, the performance to access data stays about the same. Given the growth of data and storage capacity, it even gets slower. Allan Snavely (Zverina, 2011) describes this with the following statement:

“The number of cycles for computers to access data is getting longer – in fact disks are getting slower all the time as their capacity goes up but access times stay the same. It now takes twice as long to examine a disk every year, or put another way, this doubling of capacity halves the accessibility to any random data on a given media.”

How to overcome these challenges?

In the same article, Snavely suggests to include the following metrics in a computer’s performance: DRAM, flash memory, and disk capacity.

But what can enterprises do to achieve higher through output of their systems? There is already some research about that and most resources point towards Solid State Disks as Storage (SSD). Solid State Disks are getting commodity hardware in high end Personal Computers, but they are not that common for servers and distributed systems yet.

SSDs normally have better performance but lower disk space and the price per GB is more expensive. If we talk about large-scale databases that have the need for performance, SSDs might be a better choice. The San Diego Supercomputing Center (SDSC) built a supercomputer with SSDs. This computer is called “Gordon” and can handle Data up to 100 times faster as with normal drives (Zverina, 2011).

Another prototype, called “Moneta” (Anthes, 2012) used a phase change memory to boost I/O performance. The performance was about 9.5 times faster as a normal RAID-System and about 2.8 times faster as a flash-based raid system.

There is significant research around this topic as the performance of storage is a problem to large-scale data centric systems as we now have with Big Data Applications.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.