As with any cloud-hosted system where data analytics is a major focus, one of the questions that arises early on is “how do I effectively get this massive amount of data to the cloud”? – after all it is Big Data that we’re talking about! Collectively, this “cloud data ingestion” often includes real time data, such as with IoT systems, and batch data bound for a Data Lake or other storage target.

Real time data is all about getting essential data elements to the cloud as fast as possible for real time analytics and reporting operations. Consider that potentially millions of devices are contributors, so performance and scale are key factors. Batch on the other hand is all about getting massive amounts of data content into the cloud so that it can be transformed later for application-specific needs. In this situation a strategy for high-fidelity movement of all data, not just application-specific elements, is important.

Given these categories of data and their application, a pattern that many cloud analytics solutions tend to embrace is the “Lambda Architecture”. This pattern addresses the requirements for both batching and streaming of data to the cloud as illustrated in the diagram below. The Speed Layer in this diagram uses an Event Hub to move data payloads to the cloud, sometimes storing them into an intermediate data layer such as blob storage or a NOSQL database. The Event Hub along with Stream Analytics can be used to serve real-time analytics and reporting functions.


The Batch Layer of this pattern handles bulk uploads of data which is often manifested as very large data files. For batch of large files, I’ve found that “chunking” the file into smaller, more manageable files, uploading them to a cloud Web App, then pushing the data to a data lake, does the trick nicely.

Another option to keep in mind, for particularly large, historical data sets; tens of terabytes, or petabytes of data for example, is Microsoft’s Azure Import/Export Service to transfer massive data assets to the cloud a single shot.