Posts

When Kappa first appeared as an architecture style (introduced by Jay Kreps) I was really fond of this new approach. I carried out several projects that went with Kafka as the main “thing” and not having the trade-offs as Lambda. But the more complex projects got, the more I figured out that it isn’t the answer to everything and that we ended up with Lambda again … somehow.

Kappa vs. Lambda Architecture

First of all, what is the benefit of Kappa and the trade-off with Lambda? It all started with Jay Kreps in his blog post when he questioned the Lambda Architecture. Basically, with different layers in the Lambda Architecture (Speed Layer, Batch Layer and Presentation Layer) you need to use different tools and programming languages. This leads to code complexity and the risk that you end up having inconsistent versions of your processing capabilities. A change to the logic on the one layer requires changes on the other layer as well. Complexity is basically one thing we want to remove from our architecture at all times, so we should also do it with Data Processing.

The Kappa Architecture came with the promise to put everything into one system: Apache Kafka. The speed that data can be processed with it is tremendous and also the simplicity is great. You only need to change code once and not twice or three times as compared to Lambda. This leads to cheaper labour costs as well, as less people are necessary to maintain and produce code. Also, all our data is available at our fingertips, without major delays as with batch processing. This brings great benefits to business units as they don’t need to wait forever for processing.

So what is the problem about Kappa Architecture?

However, my initial statement was about something else – that I mistrust Kappa Architecture. I implemented this architecture style at several IoT projects, where we had to deal with sensor data. There was no question if Kappa is the right thing – as we were in a rather isolated Use-Case. But as soon as you have to look at a Big Data architecture for a large enterprise (and not only into isolated use-cases) you end up with one major issue around Kappa: Cost.

In use-cases where data don’t need to be available within minutes, Kappa seems to be an overkill. Especially in the cloud, Lambda brings major cost benefits with Object Storages in combination with automated processing capabilities such as Azure Databricks. In enterprise environments, cost does matter and an architecture should also be cost efficient. This also holds true when it comes to the half-live of data which I was recently writing about. Basically, data that looses its value fast should be stored on cheap storage systems at the very beginning already.

Cost of Kappa Architecture

An easy way to compare Kappa to Lambda is the comparison per Terabyte stored or processed. Basically, we will use a scenario to store 32 TB. With a Kappa Architecture running 24/7, this would mean that we have an estimated 16.000$ per month to spend (no discounts, no reserved instances – pay as you go pricing; E64 CPUs with 64 cores per node, 432 GB Ram and E80 SSDs attached with 32TB per disk). If we would use Lambda and only process once per day, this would mean that we need 32TB on a Blob Store – that costs 680$ per month. Now we would take the cluster above for processing with Spark and use it 1 hour per day: 544$. Summing up, this would equal to 1.224$ per month – a cost ratio of 1:13.

However, this is a very easy calculation and it can still be optimised on both sides. In the broader enterprise context, Kappa is only a specialisation of Lambda but won’t exist all alone at all time. Kappa vs. Lambda can only be selected by the use-case, and this is what I recommend you to do.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

One of my 5 predictions for 2019 is about Hadoop. Basically, I do expect that a lot of projects won’t take Hadoop as a full-blown solution anymore. Why is that? What is the future of Hadoop?

What happend to the future of Hadoop?

Basically, one of the most exciting news in 2018 was the merger between Hortonworks and Cloudera. The two main competitors now joining forces? How can this happen? Basically, I do believe that a lot of that didn’t come out of a strength of the two and that they somehow started to “love” each other but rather out of economical calculations. Now, it isn’t a competition between Hortonworks or Cloudera anymore (even before the merger), it is rather Hadoop vs. new solutions.

These solutions are highly diversified – Apache Spark is one of the top competitors to it. But there are also other platforms such as Apache Kafka and some NoSQL databases such as MongoDB, plus TensorFlow emerging. One would now argue that all of that is included in a Cloudera or Hortonworks distribution, but it isn’t as simple as that. Spark and Kafka founders provider their own distributions of their stack, more lightweight than the complex Hadoop stack. In several use-cases, it is simply not necessary to have a full-blown solution but rather go for a light-weighted one.

The Cloud is the real threat to Hadoop

But the real thread rather comes from something else: the Cloud. Hadoop was always running better on bare-metal and still both pre-merger companies are arguing that in fact Hadoop does better run on bare-metal. Other solutions such as Spark are performing better in the Cloud and built for the Cloud. This is the real threat for Hadoop, since the Cloud is simply something that won’t go away now – with most companies switching to it.

Object stores provide a great and cheap alternative to HDFS and the management of Object Stores is ways easier. I only call it an alternative here since Object Stores still miss several Enterprise Features. However, I expect that the large cloud providers such as AWS and Microsoft will invest significantly in this space and provide great additions to their object stores even this year. Object Stores in the cloud will catch up fast this year – and probably surpass HDFS functionality by 2020. If this happens and the cost benefits remain better than bare-metal Hadoop, there is really no need for it anymore.

On the analytics layer, the cloud is also ways superior. Running dynamic Spark Jobs against data in object stores (or managed NoSQL databases) are impressive. You don’t have to manage Clusters anymore, which takes a lot of pain and headache away from large IT departments. This will increase performance and speed of developments. Another disadvantage I see for the leading Hadoop solutions is their salesforce: they get better compensated for on-prem solutions, so they try to tell companies to keep out of the cloud – which isn’t the best strategy in 2019.

What about enterprise adoption of Hadoop?

However, there is still some hope about Enterprise Integration, which is often handled better from Hadoop distributions. And even though the entire world is moving on the Cloud, there are still many legacy systems running on-premise. Also, after the HWX/Cloudera merger, their mission statement became of being the leading company for big data in the cloud. So if they are going to fully execute this, I am sure that there will be a huge market share ahead of them – and the initially described threads could even be turned down. Let’s see what 2019 and 2020 will bring in this respect and what the future of Hadoop might bring.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

Another year has passed and 2018 has been a really great year for the cloud. We can now really say that cloud is becoming a commodity and common sense. After years of arguing why the cloud is useful, this discussion is now gone. Nobody doubts the benefits of the cloud anymore. The next year, most developments that already started in 2018 will continue and intensify. My predictions for 2019 won’t be revolutionary but the trends we will see in the short period of this year. Therefore, my 5 predictions for 2019 are:

1. Strong growth in the cloud will continue, but it won’t be hyper growth anymore

In the past years, companies such as Amazon or Microsoft saw significant growth rates in their cloud business. These numbers will still go up by a large, double digit, growth rate for all major cloud providers (not just the two of them). However, overall growth will be slower than previous years as the market gets more mature. To win market shares, it will now not only be relevant to offer cloud products (literally), now it is relevant to have a significant market presence and sales force available in all markets to win from the competition. Also, more companies are now looking for a dual-vendor cloud strategy in order to overcome potential vendor lock-ins. To make this easier for customers, cloud companies will offer more open source products in the cloud. This will give them additional options to argue against a vendor lock-in.

2. PaaS Solutions will now see significant uptake, driven by containerisation

Containerised services are already here for some years and services such as AWS Lambda or Azure Functions are really great solutions for building web based services. However, many (traditional) companies still struggle with this approach and on how to use them. With Software Engineers and Architects experimenting with these kind of services, they will bring them back to enterprise environments. Therefore, significant growth in these services will happen over the next month. This will partially eat up share on pure IaaS services, but it is the path that the cloud will become more mature in it’s service stack as well.

3. The number of existing domain specific SaaS solutions will grow significantly

Just like software products in the past emerged on the windows platform or Apps emerged on the iPhone or on Android, new SaaS solutions emerged over the last years in the cloud. This trend will now speed up, with more domain specific platforms (e.g. for finance, marketing or alike) getting more popular and widely accepted. Business functions in traditional companies won’t question the cloud the same way as IT departments do and thus seeing fast growth. IT departments have to accepted this development and shouldn’t block it. 

4. Cloud will become local

All major cloud providers are continuing to invest significantly in data centers around the world. This trend won’t stop in 2019. The first data centers were built in the US, later on we saw data centers in APAC and Europe. Until some years ago, all major cloud providers had 2 data centers in Europe, which were simply called “North Europe”, “West Europe” or alike. In the last 1 or 2 years, this was changed by having data centers dedicated to specific markets (but still being useable by others). These markets now include Germany, UK, France, Sweden, Italy and Switzerland. However, much more data centers will emerge, also covering smaller markets as the maturity grows. First data centers opened in Africa, which is a very interesting market with huge but still underestimated potential. As of Europe, I see the CEE and SEE markets not covered well and would expect a dedicated CEE or SEE data center to open in the next 1-2 years.

5. Google will catch up fast in the cloud, mainly driven by it’s strength in the AI space

When it comes to the cloud, the #3 in the market is definitely Google. They entered the market somewhat later than AWS or Microsoft did. However, they offer a very interesting portfolio and competitive pricing. A key strength Google has is their AI and Analytics services, as the company itself is very data driven. Google really knows how to handle and analyse data much more than the two others do, so it is very likely that Google will use this advantage to gain shares from their competitors. I am exited about next Google I/O and what will be shown there in terms of Analytics and AI. 

These are just my ideas about the Cloud and what we will see in the next year. What is your opinion? Where do you agree or disagree? Looking forward to your comments!

Now you probably think: is Mario crazy? In fact, during this post, I will explain why cloud is not the future.

First, let’s have a look at the economic facts of the cloud. If we look at share prices of companies providing cloud services, it is rather easy to say: those shares are skyrocketing! (Not mentioning recent drops in some shares, but these are rather market dynamics than real valuations). One thing is also about overall company performances: the income of companies providing cloud services increased a lot. Have a look at the major cloud providers such as AWS, Google, Oracle or Microsoft: they make quite a lot of their revenue now with cloud services. So, obviously here, my initial statement seems to be wrong. So why did I just choose this one? Still crazy?

Let’s look at another explanation on this: it might be all about technology, right? I was recently playing with AWS API Gateway and AWS Lambda. Wow, how easy is it to write a great API? I could program an API for an Android APP in some hours, deployment was easy. Remember back when you first had to deploy your full stack for this? Make sure to have all libraries set up and alike? Another sample: Data Analytics. Currently, much of this is moving from “classical” Hadoop-backed HDFS to decoupled Architectures (Object Stores as “Data Lake” and Spark for Compute/Analytics). This is also a clear favour for the Cloud, because both can be scaled individually and utilisation is easier to handle. When you need more compute power, you would spin up new instances and disconnect them again when you are done. This simply can’t be done with on-prem or private cloud, since the available capacity is calculated to match some corporate requirements. Also this is clearly in favour of the Cloud.

But what else? Let’s look at how new Applications or Services are developed. Nowadays, almost every Service is developed “Cloud first”, which means that they aren’t available without the cloud or at least they get available at a very late stage / substantial delay. So if you want to stay ahead in the innovation, it is necessary to embrace cloud here. And please don’t tell me that you would rather wait as it isn’t necessary to be with the first one’s to move. Answer: of course it is fine to wait until your business is dead ;).

So, there are no real points against the cloud, so why did I then formulate the title like this? Provocation? Clickbaiting? NO: Cloud is not the future, it is the present!

As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come:

  • The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms
  • The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does
  • Spark will lead the market for handling data. It will change the entire ecosystem again.
  • Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud
  • We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end

What do you think about my predictions? Do you agree or disagree?

2016 is around the corner and the question is, what the next year might bring. I’ve added my top 5 predictions that could become relevant for 2016:

  • The Cloud war will intensify. Amazon and Azure will lead the space, followed (with quite some distance) by IBM. Google and Oracle will stay far behind the leading 2+1 Cloud providers. Both Microsoft and Amazon will see significant growth, with Microsoft’s growth being higher, meaning that Microsoft will continue to catch up with Amazon
  • More PaaS Solutions will arrive. All major vendors will provide PaaS solutions on their platform for different use-cases (e.g. Internet of Things). These Solutions will become more industry-specific (e.g. a Solution specific for manufacturing workflows, …)
  • Vendors currently not using the cloud will see declines in their income, as more and more companies move to the cloud
  • Cloud Data Centers will become more often outsourced from the leading providers to local companies, in order to overcome local legislation
  • Big Data in the Cloud will grow significantly in 2016 as more companies will put workload to the Cloud for these kind of applications

What do you think? What are your predictions?

On top of all those collaboration- and cloud-services a lot of us have found out that working together has not become much easier since the introduction of those services. As today every organization uses own infrastructure either self-hosted or an online services the borders have only moved but have not gotten transparent when needed. The walls between collaborating organisations are as strong as ever.

SPHARES is here to change this.

We are allowing sharinglike DropBox, but between different systems. Even hosted on your own systems -Dietmar Gombotz, CEO of SPHARES

SPHARES is a small start-up team of 5 from Vienna with the mission to make working-life and collaboration much easier by providing a tool that allows you to integrate different work environments without having to actually change tools.

It is working as a service-integrator between different systems in the background. The sync-engine allows to transparently share data to and from colleagues using different (or even the same) systems as oneself.

As an integration type it currently allows one-way and two way synchronization, between different heterogenous systems.

Our Goal is to make sharing between organisations as easy as sitting beside each other in the same office, even at the same desk, Hannes Schmied, BizDev SPHARES

Overview SPHARES

Overview SPHARES

SPHARES either runs on your server or is hosted online for you on a dedicated virtual machine. It allows you to directly integrate your partners with you via your own server where you control the environment. Even if you have a virtual machine from us we will not have access to the users data, neither will you. We secured the communication with double encryption.

Current Use-Cases SPHARES focus on

  • Marketing Agencies for collaborator integration
  • Tax Advisors in the digital agency
  • Unique System Integration for integrating bigger solutions
  • Technology Providing for Plattforms

SPHARES provides the system either on a service agreement providing you the service on a monthly fee, including all costs for license, updates and support handling via web-interface or as a technology license for one-time fees + maintenance.

If you are interested please simply drop the team a line at office@sphares.com and they will come back to you ASAP

Self-driving cars are getting more and more momentum. In 2014, Tesla introduced the “Autopilot” feature for it’s Model S, which allows autonomous driving. The technology for self-driving cars has been around for years though – there are other factors why it is still not here. It is mainly a legal question and not a technical one.

However, autonomous systems will be here in some years from now, and they will have a positive impact on cloud computing and big data. The use-cases were already described partially with smart cities in an earlier post. However, there are several other use-cases. Positive effects of self-driving cars are the advanced security: sensors need milliseconds to react to threads whereas humans need a second. This gives more time for better reactions. Autonomous systems can then also communicate with other cars and warn them in advance. This is called “Vehicle to Vehicle communication”. But communication is also done with infrastructure (which is called Vehicle to Infrastructure communication). A street for instance can warn the car that there are problems ahead – e.g. that the street itself is getting worse.

The car IT itself doesn’t need the cloud and big data – but services around that will heavily use cloud and big data services.

Self-driving cars also brings a side-effect: Smart Logistics. Smart Logistics are fully automated logistic devices that drive without the need of a driver and deliver goods to a destination. This can start in china with a truck that brings a container to a ship. This ship is also fully automated and works independent. The ship drives to New York, where the goods are picked up by a self-driving truck again. The truck brings the container to a distribution center, where robots unload the container and drones deliver the goods to the customers. All of that is handled by cloud and big data systems that often operate in real-time.

According to various sources, we are in the middle of the so-called 4th industrial revolution. This revolution is basically lead by a very high degree of automation and IT systems. Until recently, IT played mainly a support role in the industry, but with new technologies the role will change dramatically: it will lead the industry. Industry 4.0 (or Industrie 4.0) is mainly lead by Germany which places a high bet on that topic. The industrial output of germany is high and in order to maintain it’s global position, the german industry has to – and will – change dramatically.

Let’s first look at the past industrial revolutions:

  • The first industrial revolution took place in the 18th century. This happend when the mechanical loom was introduced.
  • The second industrial revolution took place in the early 20th century, when assembly lines were introduced
  • In the 70th and 80th of the last century, the 3rd industrial revolution took place. Machines could now work on repeatable tasks and robots were first introduced

The 4th industrial revolution is now lead dramatically by the IT industry. It is not only about supporting the assembly lines but it is about replacing them. The customer can define it’s own product and make it really individual. Designers can offer templates in online stores and the product then knows how it will be produced. The product selects in what fabric it will be produced and tells the machines how it should be handled.

Everything in this process is fully automated. It starts by ordering something online. The transportation process is automated as well – autonomous systems deliver individual parts to the fabrics and this goes well beyond traditional just-in-time delivery. This is also a democratization of design: just like individuals can now write their books without a publisher as e-books, designers can provide their designs online on new platforms. This gives new opportunities to designers as well as customers.

As with Smart Homes and Smart Cities, this produces not only a lot of data – it also requires sophisticated back-end systems in the cloud that take care of this complex processes. Business processes need to be adjusted to the new challenges and they are more complex than ever. This can’t be handled by one single system – this needs a complex system running in the cloud.

By Dietmar Gombotz, CEO and Founder of Sphares

With the introduction and growth of different Cloud- and Software-As-A-Service offerings, a rapid transition process driven through the mix-up between professional and personal space has taken shape. Not only are users all over the world using modern, flexibel and new products like DropBox or others at home, they want the same usability and “ease of use” in the corporate world. This of course conflicts with internal-policy and external-compliance issues, especially when data is shared through any tool.

I will focus mainly on the aspect of sharing data (usually in the form of files, but it could be other data-objects like calender-information or CRM data)

Many organizations have not yet formulated a consistent and universal strategy on how to handle this aspect in of their daily work. We assume an organizational structure where data sharing with clients/partners/suppliers is a regular process, which will surely be the case in more than 80% of all business nowdays.

There area different strategies to handle this:

No Product Policy
Basically the most well known policy is to not allow usage of modern tools and keeping with some internal infrastructure or in-house built tools.

Pro: data storage is 100% transparent, no need for further clarification

Con: unrealistic expectation especially in fields with a lot of data sharing, email will be used to transfer data to partners anyway so the data will be in multiple places and stages distributed

 

One-Product Policy

The most widley proactive policy is to define one solution (e.g. we use Google Drive) where a business account is taken or which can be installed (owncloud, …) on own hardware

Pro: data storage can be defined, employees have access to a working solution, clarifications are not needed

Con: partner need accounts on this system and have to make an extra effort to integrate it into their processes

Product-As-You-Need

Seen at small shops often. They use whatever their partners are using and get accounts when there partners propose some solution. They usually have a prefered product, but will switch whenever the client wants to use something else.

Pro: no need of adjustment on side of partner

Con: dozens of accounts, often shared to private accounts with no central control, data will be copied into internal system like with emails

 

Usage of Aggregation Services

The organization uses the “Product-As-You-Need” view combined with aggregation tools like JoliCloud or CloudKafe

Pro: no need of adjustment on side of partner, one view on the data on the companies side

Con: data still in dozen of systems and on private accounts (central control), integration in processes not possibleas the data stay on the different systems

 

Usage of Rule-Engines

There are a couple of Rule Engines like IFTTT (If this then that) or Zapier that can help you to connect different tools and trigger actions like you are used to in e-mail inboxes (filter rules). In combination with a preferred tool this can be a valid way to get data pre-processed and put into your system

Pro: Rudimentary integration with different systems, employees stay within their system

Con: Usually One-Way actions so updates do not get back to your partners, usually on a user-basis so no central control is needed.

Service Integration

Service Integration allows the sharing of data via an intermediate layer. There are solutions that will synchronize data (SPHARES) thereby allowing data consistency. Additionally there are services that will connect to multiple cloud storage facilities to retrieve data (Zoho CRM)

Pro: Data is integrated in processes, everybody stays within their system they use

Con: additional cost for the integration service