With the consistency of data, another challenge arrives: data concurrency. What it means is described in this tutorial.
What are the challenges with data concurrency?
Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand.
Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes.
To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem.
How does this play out?
Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them User A and User B) want to buy a Product P. There is exactly one item on stock. User A sees this and proceeds with the checkout, as well as User B. They complete the order at about the same time.
The Database in our sample is designed in a way that partitioning is preferred over consistency and both Users get the acknowledgement that their Order was processed. Now we would have -1 items in stock since no database trigger or any other command told us that we ran out of items. We either have to tell one User to “forget” the order or have to find a way to deliver the item to both users.
In any case, one user might get angry. Some web shops solved this issue in a non-technical way: they tell the user “sorry, we are unable to deliver in time” and give them the option to cancel the order or take a voucher. However, there is no simple technical solution to that.
How to solve data concurrency issues?
In most cases, it will cost money to the company. If the web shop would use a system built for consistency, it might run into database outages. Users might not buy products at their web site since the web site is simply “not available”. The web shop can either loose money by users that were unable to buy products because of delays in the database or by consistency issues.
In the case of web shop outage, users might not return and buy products since they are annoyed about the “bad performance of the website” and “inability to process the order”, whereas people would return and buy other products if they get a voucher because of issues that came with data partitioning and concurrency.
I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.