Scalable data processing is necessary for all platforms handling data. In today’s tutorial we will have a look at this.
Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.
What is scalable data?
The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load.
(Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner.
Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.
I hope you enjoyed the first part of this tutorial about scalable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.