Data is often stored in one system, but the analytical systems are often somewhere else. In this tutorial, we will look at the challenges of moving data for analysis.

Moving data for analysis

Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible.

If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late.

In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible.

What are the challenges?

(Alexander, Hoisie , & Szalay , 2011) describes some factors that influence the challenges of moving data to another cluster: high-flux data, structured and unstructured data, real-time decisions and data organization.

High-flux data describes data that arrives in real time. If the data must be analyzed, this also has to be done in real-time. The data might be gone or modified at a later point. In Big Data applications, data will arrive structured as well as unstructured.

Decisions on Data must often be done in real time. If there is a data stream of financial transactions, an algorithm must decide in real time if the data needs more detailed analysis. If not all data is stored, an algorithm must decide if the data is stored or not. Data organization is another challenge when it comes to moving data.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

4 replies
  1. thom
    thom says:

    What about architectures that allow for redundant storing of incoming data on all places where it needs to be – be it for analysis or for the sake of DR. Would this be a solution? Is their applications of such an idea? I mean, would still slow down the real-time availability process, but in essence it would serve the matter – no?

  2. Peter Fretty
    Peter Fretty says:

    According to a recent IDG survey, this is a very real challenge that most IT leaders will admit their organizations fail to excel at. I think a lot of the problem is attributed to maturity and the lack of embracing solid strategies.

    Peter Fretty, IDG blogger working on behalf of SAS

Leave a Reply

Want to join the discussion?
Feel free to contribute!