Monday, August 26, 2013

BlinkDB - query engine with bounded response times and errors

One problem that most folks with large sets of data have is (no surprises) that the tools that work reasonably well today might not do so in a few weeks or months time. The reason is simple - more data means more compute time needed. The tools might slow down to the point where the end users might become less than perfectly satisfied to completely stop using the tools. In many cases this will mean more work and frustration for the team responsible for upkeep and maintenance of the tools. Enter BlinkDB. BlinkDB is a funded project at University of California, Berkeley. Its purpose in life is best described by its tag line-

"Queries with Bounded Errors and Bounded Response Times on Very Large Data".

BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas:
(1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time, and
(2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.
 We have evaluated BlinkDB on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc.  (Source: BlinkDB homepage)

Query Examples




Statistical Error Convergence


This may be the way to deal with ever mounting data piles when speed is of essence and the results need to be within a statistical error band. For life saving and other important types of query results this may not be the best option.