A big data analytics pipeline-based workload specification (8 steps)

The pipeline described in 8 steps below is a typical use-case for current big data workloads. You can use it to calibrate process in your shop.

Step 1: Collect “user” interactions data and ingest them into the big data platform(s). User interaction logs are collected in time-order, closest to the systems that enable these interactions, such as web servers, call centers, or any other medium that allows such interaction. If the “user” is a machine, the syslog/sensor data collector aggregates these interaction events on local storage very near the individual collectors. These “logs” are then ingested in the big data platforms, in the same format as the collector, or with very little transformations, such as timestamp corrections. These logs are ordered according to when they were recorded, i.e., timestamping, with some granularity.

Step 2: Reorder the logs/events according to the entity of interest, with secondary ordering according to timestamps. Thus, a syslog collection is initially ordered by local timestamps, which is converted to global timestamp and then ordered (or sessionized) by machine identifier. Similarly, users' click/view streams are ordered initially by the website (httpd) logs, but have to be reordered and sessionized according to the end-user, identified by, say, a browser cookie or other user identification information.

Step 3: Join the “fact tables” with various other “dimension tables.” This involves parsing the event data (other than timestamp/user identification), and extracting event-specific information that forms the feature in each of the events. This is the step that incorporates the late binding feature often encountered in big data applications, since the features to be extracted may be different for different applications. For example, in a news aggregator site, this may involve distilling the URL pointing to a news item to topics in the news, or in the case of machine logs, distilling the specific machine services indicated by the log message.

Step 4: Identify events of interest that one plans to correlate with other events in the same session for each entity. In case of an ad-funded service, this target event is to identify ad-clicks; for datacenter management systems, the target events are abnormalities in the logs, such as machine failure; for Facebook-like systems, the target events are “likes”; in Twitter-like systems, the target events are re-tweets/favorites; in LinkedIn-like systems, the target events are connections being established; in various subscriber-based organizations, the target event is opening an account, signing onto notification lists, etc.


Step 5: Build a model for favorable/unfavorable target events based on the past session information. Various modeling techniques are employed in this step and, depending upon the sophistication of the modeling team and the platform, increasingly complex models may be built. However, quite often, an ensemble of simple models is the preferred approach.

Step 6: Score the models built in the previous step with the hold-out data. Hold-out data is part of the total dataset available for training models that is not used for training these models but only for validating these models.

Step 7: Assuming the validation step with hold-out data passed, this step is to apply the models to the initial entities, which did not result in the target event. For example, in the news aggregation site, since the target event was to click on the news item, this model-scoring step will be applied to all the users who did not click on any news event that was shown.

Step 8: Publish the model for each user to the online serving system so that the model could be applied to that user's activities in real time.


  •  Benchmarking Big Data Systems and the BigData Top100 List -To cite this article: Chaitanya Baru, Milind Bhandarkar, Raghunath Nambiar, Meikel Poess, and Tilmann Rabl. Big Data. March 2013, 1(1): 60-64. doi:10.1089/big.2013.1509.