Errol and I had the idea of doing a “Data Hike” back in 2020. However, Covid Restrictions didn’t allow us to meet. So, both of us ended up hiking on our own. Errol the beautiful lakes of Sweden, me the beautiful alpine lakes in Austria. However, things get easier now and so we finally made it last weekend!

Our hike took us to the beautiful and scenic area of Semmering in Lower Austria. This place is not just known for its beautiful scenery, but also for an area where the Austrian emperors often went to. The hike took us through forests, alms and even a small summit (Großer Pfaff). We touched different topics around data and one’s rather remote to it.

The Data Mesh Hike

an aereal picture of the Stuhleck
The Data Hike in Austria

One of the key topics we talked about was the “Data Mesh”. This trend started in 2019 by ((LINK)) is getting popular fast. Both Errol and I had our ideas of it and how to approach it. We were both in-line with the idea that it isn’t just a trend but something that will ease up all our lives and how we deal with data. We talked about some companies we both know that are already thinking into that direction. Also, we highlighted the need for more know-how on this topic, since it is a new and emerging topic.

The Datalake

Talking about the Data Mesh while we had a break

We didn’t really pass by a lake during our hike – only at a swamp. This was a “signature” for what the datalakes eventually had become: data swamps. We discussed how this appeared and why it ended like this; we argued that most likely it was because companies focused on implementing a complex technology and fully ignored the fact that a good governance is necessary. All tech geeks discussed the fancy technology around the Datalake – e.g. Spark, Hive, Kafka, … but nobody wanted to “pitch” for governance. Our final thoughts were: the datalake is death. However, we both agreed that it is only about the legacy technology, not the concept of it.

Heads up in the Cloud

Once we reached the top of “Großer Paff”, we had a marvelous view over the area. This led us to another topic: The Cloud. In the Cloud, you reach another level of maturity by using Microservices and alike. It isn’t necessary anymore to build a big Datalake. It is already existing and available. Using services such as S3 is easy and straight forward. There is no need for a complex HDFS-based solution.

It’s the business, stupid!

We made it to our final destination – Großer Pfaff

After a 4 and a half hours hike, we reached our car and rewarded ourselves with nice Topfenknödel. I made my pitch for “Kasnocken” but it isn’t served in this area of Austria. When we asked ourselves, why we did this hike, the answer was simple: Data needs to serve a business need. There is no need to go for complex technology, keep it simple (with Microservices) and deliver value!

If you are also interested in joining a hike with us, hit me up! One thing is clear to us: this was just the beginning of a series of Data Hikes (or maybe even Data Ski’s?)

The last years have been exciting for Telco’s: 5G is the “next big thing” in communications. It promises us ultra-high speed with low latency. Our internet speed will never be the same again. I’ve been working in the telco business until recently but I would say that the good times of telco’s will soon be gone. Elon Musk will destroy this industry and will entirely shake it up.

Why will Elon Musk disrupt the Telco industry?

Before we get to this answer, let’s first have a look what one of his companies is currently “building”. You might have heard of SpaceX. Yes – these are the folks being capable of shooting rockets to the orbit and landing them again. This significantly reduces the cost per launch. Even Nasa is relying on SpaceX. And it doesn’t stop here: Elon Musk is telling us how to get to the moon (again) and even bring first people to the Mars. This is really visionary, isn’t it?

However, with all this Moon and Mars things, there is one thing we tend to oversee: SpaceX is bringing a lot of satellites to the orbit. Some of them are for other companies, but a significant number are for SpaceX itself. They shot some 1,700 satellites to the orbit and are already the largest operator of satellites. But what are these satellites for? Well – you might already guess it: for providing satellite-powered internet. In a first statement, the satellite network was considered for areas with large coverage. However, recently the company (named “StarLink“) announced that they now offer a global coverage.

One global network …

Wait, did I just write “a global coverage”? That’s “insane”. One company can provide internet for each and every person on the planet, regardless of where they are. All 7,9 billion people on the world. That is a huge market to address! However, what is more impressive is the cost at what they can built this network. Right now, they have something like 1,700 satellites out there. Each Falcon 9 rocket (which they own!) can transport around 40 of these satellites. All together, the per-satellite cost for SpaceX would be around 300,000$. According to Morgan Stanley, SpaceX might need well below 60 billion dollars to built a satellite internet of around 30,000 satellites. This is a way higher number than the 1,700 already up there. However, think about speed and latency – right now, with 1,700 satellites already out, StarLink is offering around 300 Mbits with 20ms latency. This is already great as compared to 4G, where you merely get up to 150 Mbits. Curious what’s in if all 30k are out? I would expect that we get some 1Gbit and a very low latency. Then it would be a strong competitor to 5G.

Again, the cost …

Morgan Stanley estimated the cost for this network to be around 60 billion USD. This is quite a lot of money StarLink has to gather. This sounds like a lot, but it isn’t. Let’s compare it again to 5G. Accenture estimates that the 5G network for the United States will cost some 275 billion alone! One market. compare the 60 billion of Starlink – a global market addressing 7.9 billion people – with the U.S., where you can address 328 million people. It is 20 times the market, by a fraction of the cost! Good night, 5G.

Internet of things via satellites rather than 5G

Building up 5G might not succeed in the race for the future of IoT applications. Just think about autonomous cars: one key issue there is a steady connectivity. 5G might not be everywhere or connectivity might be bad in a lot of regions that have a smaller population. It simply doesn’t pay out for TelCos to built 5G everywhere. But in contrast – StarLink will be everywhere. So large IoT applications will rather go for Starlink. Imagine Ford or Mercedes having one partner to negotiate rather than 50 different Telco providers around the globe for their setup. It makes things easier from a technical and commercial point of view.

Are Telcos doomed?

I would say: not yet. Starlink is at a very early stage and still in Beta. There might be some issues coming up. However, Telcos should definitely be afraid. I was in the business until recently, and most Telco executives are not much thinking about Starlink. If they do, they laugh at them. But remember what happend with the automotive industry? Yep, we are all now going electric. Automotive executives were laughing at Tesla. A low-volume, niche player they said. What is it now? Tesla being more valuable than any other automotive company in the world, producing cars in the masses.

However: one thing is different; automotive companies could easily attach to the new normal. Building a car is not just about about the engine. It is also a lot about the process, the assembly lines and alike. All major car manufacturers now offer electric cars and can built them in a competitive manner with Tesla. As of Starlink vs. 5G, this will be different: Telco companies can’t built rockets. Elon Musk will disrupt another industry – again!

This post is an off-topic post from my Big Data tutorials

Shows the code editor in Python

In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.

NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!

Before you get started, please make sure to have the Sandbox setup and ready

Getting started with NumPy

First of all, we need to import the library. This works with the following import statement in Python:

import numpy as np

This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:

vals = np.arange(15).reshape(3,5)
vals

This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:

from numpy import pi
pi

We can now use PI for further work and calculations in Python.

Simple Calculations with NumPy

Let’s create a new array with 5 values:

vl = np.arange(5)
vl

An easy way to calculate is to calculate something to the power. This works with “**”

nv = vl**2
nv

Now, this should give us the following output:

array([ 0,  1,  4,  9, 16])

The same applies to “3”: if we want to calculate everything in an array to the power of 3:

nn = vl**3
nn

And the output should be similar:

array([ 0,  1,  8, 27, 64])

Working with Random Numbers in NumPy

NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:

nr = np.random.random((3,3))
nr *= 100
nr

Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:

array([[90.30147522,  6.88948191,  6.41853222],
       [82.76187536, 73.37687372,  9.48770728],
       [59.02523947, 84.56571797,  5.05225463]])

Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:

n3d = np.random.random((3,3,3))
n3d *= 100
n3d

Also here, your numbers would be different, but the overall “structure” should look like the following:

array([[[89.02863455, 83.83509441, 93.94264059],
        [55.79196044, 79.32574406, 33.06871588],
        [26.11848117, 64.05158411, 94.80789032]],

       [[19.19231999, 63.52128357,  8.10253043],
        [21.35001753, 25.11397256, 74.92458022],
        [35.62544853, 98.17595966, 23.10038137]],

       [[81.56526913,  9.99720992, 79.52580966],
        [38.69294158, 25.9849473 , 85.97255179],
        [38.42338734, 67.53616027, 98.64039687]]])

Other means to work with Numbers in Python

NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:

n3d.max()

In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:

n3d.max(axis=1)

This would now return the maximum number for each of the axis within the array. In my example, the results look like this:

array([[93.94264059, 79.32574406, 94.80789032],
       [63.52128357, 74.92458022, 98.17595966],
       [81.56526913, 85.97255179, 98.64039687]])

Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:

n3d.sum(axis=1)

In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:

n3d.round(2)

Iterating arrays in Python

Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:

or val in np.nditer(n3d):
    print(val)

The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:

n3d = n3d.round(0)

with np.nditer(n3d, op_flags=['readwrite']) as iter:
    for i in iter:
        i[...] = i%3
        
n3d

These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.

If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

In some of my previous posts, I shared my thoughts on the data mesh architecture. The Data Mesh was originally introduced by Zhamak Dehghani in 2019 and is now enjoying huge popularity in the community. As one of the main thoughts of the data mesh architecture is the distributed nature of data, it also leads to a domain driven design of the data itself. A data circle enables this design.

What is a data circle?

A data circle is a data model, that is tailored to the use-case domain. It should follow the approach of the architectural quant from the micro service architecture. The domain model should only contain all relevant information for the purpose it is built for and not contain any additional data. Also, each circle could or should run within its own environment (e.g. database). The technology should be selected for the best use of the data. A circle might easily be confused with a data mart, that is built within the data warehouse. However, several data circles might not “live” within one (physical) data warehouse but use different technologies and are highly distributed.

Each company will have several data circles in place, each tailored to the specific needs of use-cases. When modelling data with data circles, unnecessary information will be skipped as it will – at some point – be connectable with other data circles in the company. Have we previously built our data models in a very comprehensive way (e.g. via the data warehouse), we now built the data models in a distributed way.

Samples in the telco and financial industry

If we take for example a telco company, data circles might be:

  • The customer data circle: containing the most important customer data
  • The network data circle: containing information about the network
  • The CDR data circle: containing information about calls conducted

If we look at the insurance industry, data circles might be:

  • The customer data circle: containing the most important customer data
  • The claims data circle: containing the data about past claims
  • The health data: containing the data about health related infos

If we focus back to the telco company, the data about the customer might be stored in a relational model within a RDBMS. However, network data might be stored in a graph for better spatial analysis. CDR data might be stored in a files-based setup. For each domain, the best technology is selected and the best model is designed. Similar holds true for other industries.

Several data circles make up the design

Different business units will built their own data circles to fit to their demands. This, however, makes it necessary to create a central repository that sticks it all together: a hub connecting all the circles. The hub stores information about connectivity of different circles. Imagine the network data model again – you might want to connect the the network data with customer data. There must be a way to connect this data, by still keeping its distributed aspects. The hub serves as a central data asset management tool and one-stop-shop for employees within the company to find the data they need.

A data hub connecting different data circles
Circles connected via a hub

The Data Hub also allows users to connect and analyse the data they want to access. This allows the users to use tools such as Jupyter to analyse the data. The hub also takes care about the connectivity to the data and thus provides an API for all users. A data hub is all about data governance.

What’s next?

I recommend you reading about all the other articles I’ve written about the data mesh architecture. It is fairly easy to get stated with this architectural style and the data circles contribute to this.

In my last post, I presented the concept of the data mesh. One thing that is often discussed in regards to the data mesh is how to make an agile architecture with data and microservices. To understand where this needs to go to, we must first discuss the architectural quantum.

What is the architectural quantum?

The architectural quantum is the smallest possible item that needs to be deployed in order to run an application. It is described in the book “Building Evolutionary Architectures“. The traditional approach with data lakes was to have a monolith, so the smallest entity was the datalake itself. The goal of the architectural quantum is to reduce complexity. With microservices, this is achieved by decoupling services and making small entities in a shared-nothing approach.

The goal is simplification and this can only be achieved if there is little to no shared service involved. The original expectation with SOA was to share commonly used infrastructure in order to reduce the effort in development. However, it lead to higher, rather than lower, complexity: when a change to a shared item was necessary, all teams depending on the item had to be involved. With the shared-nothing architecture, items are rather copied than shared and then independently used from each other.

Focus on the business problem

Each solution is designed around the business domain and not around the technical platform. Consequently, this also means that the technology fit for the business problem should be chosen. As of now, most IT departments have a narrow view on the technology, so they try to fit the business problem to the technology until the business problem becomes a technical problem. However, it should be vice versa.

With the data mesh and the architectural quantum, we focus fully on the domain. Since the goal is to reduce complexity (small quantum size!) we won’t re-use technology but select the appropriate one. The data mesh thus only works well if there is a large set of tools available, which can typically be found by large cloud providers such as AWS, Microsoft Azure or Google Cloud. Remember: you want to find a solution to the business problem, not create a technical problem.

Why we need data microservices

After microservices, it is about time for data microservices. There are several things that we need to change when working with data:

  • Focus on the Business. We don’t solve technical problems, we need to start solving business problems
  • Reduce complexity. Don’t think in tech porn. Simplify the architecture, don’t over-complicate.
  • Don’t build it. It already exists in the cloud and is less complicated to use than build and run it on your own
  • No Monoliths. We used to build them for decades for data, replacing a DWH with a Datalake didn’t work out.

It is just about time to start doing so.

If you want to learn more about the data mesh, make sure to read the original description of it by Zhamak Dehghani in this blog post.

mesh bags on white textile

The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?

The data divide

The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:

  • The focus on ETL processes
  • Building a monolith (aka Datalake or Data warehouse)
  • Not focusing on the business

According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.

The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?

The answer to this is simple

The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.

I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.

The Data mesh – focus on what matters

Here comes the data mesh into play. It is based loosely on some aspects that we already know:

  • Microservices architecture
  • Services meshs
  • Cloud

One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.

Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.

But what is the future role of IT?

So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.

Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.

Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.

There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!