Monday, August 29, 2016

Comparison of MPI, MapReduce and Dryad

Big data is only as good as the value that it can deliver in a timely manner to the owners of the big data. Hence the speed of data search or finding the key piece of info is always a topic of discussion in any big data shop.

Following are the top 5 high level ways to find info in a big data store. We will focus on the parallel search option since that delivers the most value (bang for the buck).
  • Bloom Filter: Bloom Filter consists of a series of Hash functions. The principle of Bloom Filter is to store Hash values of data other than data itself by utilizing a bit array, which is in essence a bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such advantages as high space efficiency and high query speed, but also has some disadvantages in mis-recognition and deletion.
  • Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but it is hard to find a sound Hash function.
  • Index: index is always an effective method to reduce the expense of disk reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and other technologies that manage semi-structured and unstructured data. However, index has a disadvantage that it has the additional cost for storing index files which should be maintained dynamically when data is updated.
  • Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics. The main idea of Triel is to utilize common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency.
  • Parallel Computing: compared to traditional serial computing, parallel computing refers to simultaneously utilizing several computing resources to complete a computation task. Its basic idea is to decompose a problem and assign them to several separate processes to be independently completed, so as to achieve co-processing. Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad.

    Comparison of MPI, MapReduce and Dryad

    • Chen, M., Mao, S. & Liu, Y. Mobile Netw Appl (2014) 19: 171. doi:10.1007/s11036-013-0489-0