Monday, August 29, 2016

Comparison of MPI, MapReduce and Dryad

Big data is only as good as the value that it can deliver in a timely manner to the owners of the big data. Hence the speed of data search or finding the key piece of info is always a topic of discussion in any big data shop.

Following are the top 5 high level ways to find info in a big data store. We will focus on the parallel search option since that delivers the most value (bang for the buck).
  • Bloom Filter: Bloom Filter consists of a series of Hash functions. The principle of Bloom Filter is to store Hash values of data other than data itself by utilizing a bit array, which is in essence a bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such advantages as high space efficiency and high query speed, but also has some disadvantages in mis-recognition and deletion.
  • Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but it is hard to find a sound Hash function.
  • Index: index is always an effective method to reduce the expense of disk reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and other technologies that manage semi-structured and unstructured data. However, index has a disadvantage that it has the additional cost for storing index files which should be maintained dynamically when data is updated.
  • Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics. The main idea of Triel is to utilize common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency.
  • Parallel Computing: compared to traditional serial computing, parallel computing refers to simultaneously utilizing several computing resources to complete a computation task. Its basic idea is to decompose a problem and assign them to several separate processes to be independently completed, so as to achieve co-processing. Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad.

    Comparison of MPI, MapReduce and Dryad
     Reference

    • Chen, M., Mao, S. & Liu, Y. Mobile Netw Appl (2014) 19: 171. doi:10.1007/s11036-013-0489-0


Tuesday, August 23, 2016

Machine Vision and Big Data

I started my computing career with machine vision. It was my first large sized project in school a few years ago (or was it a few decades ago but who is counting). Anyway, the project was to determine defects in computer chips with machine vision. We had employed a simple neural network learning algorithm that tried to determine if the image of the chip was a match with the image of a good specimen stored in the system or not. So, what we were trying to solve was comparing the captured image with one fixed image and deciding on if there was a match within an acceptable level of deviation. It was not perfect but it worked. It was unfortunate that the company that provided us with the grant money decided to not deploy the solution in production for reasons best known to the company.

Anyway, today I came across this nicely written short editorial piece in International Journal of Computer Vision titled simply Big Data. The author describes the new techniques that are now available to the area of machine/computer vision mostly driven by advances by machine translation and speech to text processing. Both these areas came into their own once large training data sets were used. And this was possible only with general advances in storing, retrieving and processing big data sets. For more details please refer to the article here.

The current state of image recognition is best illustrated by the two examples below. In the first set the learning engine was asked to choose an image in either column 1 or 2 that looked liked the image in the remaining columns (3-5) in each of the two sets. So, for first set the learning engine needed to identify a horse and in the second set a face. And as you can see the system was able to do so correctly. Of course this by does not mean that the same algorithm would be able to identify a tree if given bunch of pictures with shrubs etc. But, it s a promising start.

Machine Vision to identify a horse and a face in a set of images


Reference
  • Rubinstein, M., Liu, C. & Freeman, W.T. Int J Comput Vis (2016) 119: 23. doi:10.1007/s11263-016-0894-5

Saturday, August 13, 2016

If you have worked for more than 5 years you are less likely to pay your loan than people who just started working!

This is an excellent post on how to use IBM Watson Analytics on Lending Club data. For those who may not know - Lending Club is one of the largest platforms for peer-to-peer (P2P) lending. It promises higher returns to the lenders and better experience for the borrowers. There were several key findings from this analysis. Top 3 are listed below-

  1. Default rate is highest in Nevada (maybe because people tend to use the money in the slot machines or game tables of Sin City instead of putting the money to good use for the purpose they used to get the loan in the first place) Any guesses for which state has the lowest - Well, its Wyoming! Who would have guessed that. :) Go Cowboys and Cowgirls!
  2. Another key finding was to the influence of the reason for the loan on the default rates. People tend to default half as much for auto and wedding expenses related loans than they tend to for renewable energy. Never trusted those green types. 
  3. And the last but the most alarming finding was that folks who have worked for more than 5 years tend to default more than people just starting out. This is interesting since you tend to make more money over time in any profession (no matter which profession it is). I guess work makes you less honest. If you have other reasons why this might be - feel free to tweet me at @shankar_sahai

I wonder how default rates varies on the gender x-axis.

Note
For transparency, Lending Club provides loan data to the public. This data can be downloaded from the Lending Club website.

Monday, December 22, 2014

Big data deployment interdependencies

Over the last few years since MapReduce/Hadoop spawned a whole new era of big data, we have seen many vendors, community driven tools, products, libraries explode on the market. To even the most ardent followers of the space and especially to a casual observer or a new-entrant to the big data field making sense of all the tools and technologies is a nightmare-- bad enough for many to consider their choice of profession. But fear no more. Below are two things that will help you make sense of the big data jungle out there. First is a table of the technologies and their purpose/function. And the second is the interdependency chart of the various technologies.

F UNCTIONAL CLASSIFICA TION OF EXISTING HIGHER - LEVEL B IG D ATA APPLICATIONS AND LIB RARIES
Table - Classification of Big Data Technologies

Big data deployment interdependencies
Big data deployment interdependencies


If you are a big data novice start with the table, figure out the technology options that are in-play for your situation. Once you have narrowed down the options, research some more. Hint- Infoivy.com is a good starting place.

Deciding the options to use is never easy but once you do have made the decision,  use the dependency chart to figure the deployment stack that you would need for the options chosen. Word of caution - not all stacks can be deployed - at least not yet. So, before you jump into the big data pit check with people who might know more than you but make your own decisions. Remember all decisions come up with a price -- the most expensive one being time especially if you are on a tight timeline. Choose wisely!

Credit:
Deciphering Big Data Stacks: An Overview of Big Data Tools by Tomislav Lipic, Karolj Skala, Enis Afgan

Saturday, December 13, 2014

Performance testing results for tools used in big data transfers

Many organizations face every day task of moving big data from one deployment to another. There are many options for doing so. Recently I came across a paper that presents results of network performance testing with common commercial tools used for data transfer. The 10 Gb/s network topology used -
Network topology for performance testing
They also conducted the test with background noise thrown in to the mix for a more realistic simulation. Following is the chart how adding more background noise affects the performance of the tools


Following is the abstract from the paper that summarizes the results of the study.
We present performance and fairness analysis of two TCP-
based (GridFTP and FDT) and one UDP-based (UDT)
big data transfer protocols. We perform long-haul performance
experiments using a 10 Gb/s national network, and conduct
fairness tests in our 10 Gb/s local network. Our results
show that GridFTP with jumbo frames provides fast data
transfers. GridFTP is also fair in sharing bandwidth with
competing background TCP flows.

Reference
Performance and Fairness Issues in Big Data Transfers by Se-young Yu, Nevil Brownlee, and Aniket Mahanti, University of Auckland, New Zealand

Thursday, October 16, 2014

Machine learning suites comparison (SAS, R, Pentaho, Mahout, Hadoop and others)

If you have wondered about comparing different machine learning suites available, you have come to the right place. Some of the suites covered are - R, SAS, Mahout, Hadoop, Spark, Pentaho and others. Table below shows the tradeoffs of the suites in an easy to read fashion.

Comparison of Machine Learning Suites
Comparison of machine learning suites
Hint- Its not hard to see that Hadoop is an good choice. Click here for more details.

Thursday, October 9, 2014

Comparison of NoSQL platforms for CAP (Consistency, Availability and Partition tolerance)

We are all aware of the CAP theorem and if you are not, please refer to the Wikipedia page. As your NoSQL instance grows and you have to deal with multiple nodes, you will need to prioritize the CAP dimensions for your use case. This is needed because of the famous theorem which states -- you can only get 2 out of the 3 guarantees. Many a times you will realize that the NoSQL option that made sense when you started earlier does not any longer. Since with just one node/partition you can get all all three guarantees! But with multi-partition deployments you have a decision to make.

Decision to change NoSQL platform should not be taken lightly. Transition from one to another takes a lot of planning, effort and time. Anyway, if you do need to choose a new NoSQL platform, here is the table of the CAP options/guarantees that you get with each of the NoSQL platforms. Choose wisely!

CAP options for NoSQL platforms
CAP for NoSQL platforms
A good rule of thumb while making your selection is that CP platforms are good for OLTP type usage and AP platforms are better suited for web applications.

Reference
Big Data with Cloud Computing: an insight on the computing environment, MapReduce,and programming frameworks by Alberto Fernández et al.

Monday, September 29, 2014

3 Useful tips for executives planning to make use of big data

Many companies disembark on big data journey for all the good reasons but do not necessarily end up in a good place after having spent large chunks of corporate change on it. Generally there could be many reasons for this but it would take a team of experts to analyze each situation to come up with the root cause(es). This post is mostly targeted for executives who have the desire to make use of big data and want to ensure their staff has the right high level guidance at the onset so that they can manage transition to big data in an effective and timely manner. Since being on time and budget are the two most important metrics for an executive.

Following are the three hints -
  1. Paying attention to data stream - Businesses generate data in real time and the tools that make use of data should also be real time. Looking at a snapshot of data in a warehouse for example might throw up some interesting observations but you should be able to get the same guidance in real time. Hence focus on Streaming Analytics.
  2. Relying on data scientists as opposed to data analysts - Data scientists are trained not only with data analytics skills but also posses good technical/programming skills. Without having hands on technical skills,  they become dependent on IT staff thereby slowing down the entire process. Many companies are either hiring or acquiring companies that posses these advanced skills.
  3. Moving analytics from IT into core business functions - If you have an analytics department or group tucked away in out reaches of your company, its time to make some changes. They should be front and center of your main business organizations. Following are the three main reasons for this recommendation-
  • Allows for not having to replicate or sync multiple data marts/warehouses/stores etc. Helps at technical level. Big red flag - it is easy to underestimate the cost of having to do so. 
  • Functional groups tend to trust their own finding/analysis more than when it comes from an external adjunct group such as IT. This is just part of corporate culture.
  • Saves time since this obsoletes need for replicating data which in turn leads to time being wasted waiting for the process of analyzing the data to finish for results to be known. 
For reasons listed above, executives will be well advised to have each functional group that can make use of big data to create a focused team of data scientists within the group. If you have a central group of data analysts go ahead and disband it. Its best for your corporate goals.

For further reading and helpful reference click here

Sunday, September 7, 2014

NoSQL with MongoDB in 24 hours - Book Review

NoSQL with MongoDB in 24 hours book cover
So, you have a burning desire to learn about NoSQL and MongoDB and can go without sleep for 24 hours, I have the perfect book for you. The book is titled "Sams Teach Yourself NoSQL with MongoDB in 24 Hours". Well, you would have to not only give up sleep for 24 hours but also either be productive enough for 24 hours straight or be caffeinated enough for that time period to pull this off.

The book has 24 chapters- one per hour. So brush up on your speed reading skills. The book has great code examples (but only for learning purposes, do not try to "borrow" code for anything real). It also has useful "tips" and "cautions" planted throughout the book. Below is the introduction from the book.

MongoDB/NoSQL book introduction

Each chapter (which represents one hour of learning) has a Q&A section at the end. I strongly encourage readers to not go on to the next chapter without spending time on it. If you don't spend time on the Q&As (like I tried to) you will find yourself having to go back to earlier chapters for reference more than you would care for. On this note, my disclaimer- I did *not* read the book in 24 hours straight but wanted to. Anyway, I heartily recommend the book to anyone who wishes to become proficient with NoSQL concepts along with a solid foundation of MongoDB. The book is available on Google Books for electronic download. Bon Appétit!

Friday, August 1, 2014

Cost of too much data

Saw this interesting infographic and hence sharing with my readers.

The Cost Of Too Much Data Infographic

Saturday, July 12, 2014

Smarty Pins - Google Map Game

Google Maps is good for not just about anything that can start with "Where" but now can also provide the answers to many different ways of "how to kill time?".  The game is called "Smarty Pins" (Yes, that is the best name that highly paid smarty pants at Big G came up with).  Game is simple and yet engaging. Another good way to make use of truck loads of big data that drives it.
Smarty Pins

The options for the player to choose from -
smarty pins options
Enjoy!

Sunday, June 22, 2014

Linux distributions to package Hadoop (Seems like it)

There are several commercial and community based Hadoop distributions out there but so far none of the major Linux flavors (Ubuntu, Debian, Fedora) has decided to bundle a Hadoop distribution. Imagine the ease for an end-user who can use simple command line tools to install and run Hadoop. No more wading through bureaucratic delays or lengthy purchase processes to get permission to install and run Hadoop.

Couple of the larger Hadoop companies control the fate of what people currently consider the "standard" Hadoop package. It will be interesting to see how the dynamics will change once the major Linux operating system vendors make Hadoop available as a base RPM. Doing some research I did find the following Fedora project https://fedoraproject.org/wiki/Changes/Hadoop - which leads me to suggest that effort is underway to do just that. If Fedora takes the lead, it is just a matter of time before other distributions do the same. Please note that I have not seen any official comment about this in the press so this could well be a false alarm.

Following is taken from the web page mentioned above and clearly shows the intent of the project. Cannot wait for this to become real for us all.

Detailed Description

Apache Hadoop is a widely used, increasingly complete big data platform, with a strong open source community and growing ecosystem. The goal is to package and integrate the core of the Hadoop ecosystem for Fedora, allowing for immediate use and creating a base for the rest of the ecosystem.

Benefit to Fedora

The Apache Hadoop software will be packaged and integrated with Fedora. The core of the Hadoop ecosystem will be available with Fedora and provide a base for additional packages.

Scope

  • Proposal owners:
  •    Note: target is Apache Hadoop 2.2.0
  •    Package all dependencies needed for Apache Hadoop 2.x
  •    Package the Apache Hadoop 2.x software
  • Other developers: N/A (not a System Wide Change)
  • Release engineering: N/A (not a System Wide Change)
  • Policies and guidelines: N/A (not a System Wide Change)
 Image - Courtesy of rubixdesignandrepair.com

Sunday, June 15, 2014

Not all data mining packages are created equal (Comparison of 6 major free tools)

For anyone looking to compare characteristics, pros and cons of the six most commonly used data mining free software tools, please refer to a great expert paper titled "An overview of free software tools for general data mining" by A. Jović, K. Brkićand N. Bogunović, Faculty of Electrical Engineering and Computing, University of Zagreb / Department of Electronics, Microelectronics,Computer and Intelligent Systems, Zagreb, Croatia. The six tools extensively covered in the paper are as following-
  • RapidMiner
  • R
  • Weka
  • KNIME
  • Orange
  • scikit-learn
The paper contains a comparison of the implemented algorithms covering all areas of data mining, such as,
  • classification
  • regression
  • clustering
  • associative rules
  • feature selection
  • evaluation criteria
  • visualization
Also covered in the paper are advanced and specialized research topics, such as,
  • big data
  • data streams
  • text mining
In short it is a treasure trove of generally great information for either a novice or an expert who might just be wondering if he or she choose the right tool.

Comparison of free data mining tools (R, RapidMiner, Weka and more)
Comparison of free data mining tools

Sunday, June 8, 2014

The best players in world cup soccer 2014 (statistical analysis)

The world cup soccer is around the corner and there are many great ways to analyze the numbers available for tournament itself, team playing in the tournament, venues where the games are being played, the players representing the national sides. I share the image below - simple and player/position focused - for the readers of this blog. Hover on the image and then click on circles that appear to get more information on the best position players in the tournament (statistically speaking of course).



Reference
LiveMint: The world cup in numbers

Thursday, May 29, 2014

6 questions to ask your NoSQL vendor

Many people I talk to are considering make the switch from traditional RDBMSes to NoSQL style data stores. There are many inherent differences between the two. And, if you are at the point of considering a NoSQL data store chances are that you have done your homework or are being pushed to this bridge by your management or customers. Hence I will make no attempt to present high level pros and cons of each type of data store.

Anyway, before you make the final leap, here are 6 questions that you must ask the NoSQL vendor before making your selection. All of the questions are related to consistency of data read from the data store. And it would not come as a surprise to anyone that data consistency is the single most critical element of any data base. And if you answered "Performance" chances are that you are grossly overestimating the data since most companies have no where near the amount of data to push any of the leading NoSQL offerings into the "red zone". 

Here is the list of questions-
  1. What is the probability of observing an accurate value a fixed amount of time, say t seconds, after a write occurs, termed as freshness confidence?
  2. What percentage of reads that observe a value other than what is expected, quantified as the percentage of unpredictable data?
  3. How much time is required for an updated value to be visible by all subsequent reads? This is termed inconsistency window.
  4. What is the probability of a read observing a value fresher than the previous read for a specific data item, termed monotonic read consistency?
  5. What is the mean age of a value read from the updated data item? This might be quantified in terms of versions or time.
  6. How different is the value of a data item from its actual value? For example, with a member with 1000 friends, a solution may return 998 friends for her whereas a different solution may return 20 friends. An application may prefer the first.
Reference

Benchmarking Correctness of Operations in Big Data Applications, Sumita Barahmand and Shahram Ghandeharizadeh, Database Laboratory Technical Report 2014-05, Computer Science Department, USC, Los Angeles, California 90089-0781. http://dblab.usc.edu/Users/papers/ConsistencyTechReport.pdf