Sunday, December 16, 2018

World of Unstructured Clinical Research Data

The world of bio-sciences related research relies on data mostly of the unstructured type just as much or maybe even more than any other field - such as machine learning or AI. This blog post has done a tremendous job of laying in simple terms the challenges ahead for any one working in clinical trials and clinical research areas. The thought leaders are now working with companies such as which have solutions that can help a clinical researcher navigate the treacherous waters with a platform with the following features.
  1. Better data storage options: There are a lot more options available to institutions today enabling data storage of both structured and unstructured data in a manner that efficiently leverages methods for processing and analyzing across the combined data-pool. Cloud platforms have made this an easier and more economical option for institutions of all types and sizes.
  2. More efficient opportunities for integration: A federated approach is often an efficient option to be able to leverage data across multiple sources while allowing the individual systems to remain optimized to address their respective functions. Opportunities for system integration has increased across the board, helped in part by more widespread availability of Application Programming Interfaces (APIs) and continuing evolution of global data exchange standards. 
  3. Improved data processing technologies: Newer storage and processing technologies for Big Data e.g. Optical Character Recognition (OCR), Natural Language Processing (NLP), machine learning and others, make it easier to rapidly process large amounts of data while also integrating patterns and information from unstructured data into decision-making pathways and other analytical solutions.


Saturday, November 3, 2018

“Show me the Money” - Where the conversions at?

Many of us remember the famous line from the movie Jerry Maguire. For those who do not, click here (Video Length: 2:32 min). In the marketing world, many marketers love to see clicks, impressions and views but what they really want to count are conversions (closest thing to tracking money coming in from an advertiser’s perspective).

Why is it so hard to get a consistent picture on conversions?

Many advertisers use Google analytics to aggregate conversions from different channels. While Google Analytics is an efficient and free, yet easy to use tool for keeping track of all of your online activities, it is very important to understand what it actually tracks and what it doesn't.

There are three main reasons Google Analytics may show different conversion numbers compared to other providers. We’ll deep dive into those in more detail below.
  • Credits last paid click regardless of channel
  • Has no ability to track provider view-through conversions
  • Lacks capability to track cross-device conversions
(Note: I have used Facebook for my examples below but the same is true for many other channels)

#1 - Facebook attributes a conversion to the last click the user has made on a Facebook ad (click-through or post click conversion), or, if no clicks happened, the last ad they've seen before converting (view-through or post view conversion). Google Analytics, on the other hand, gives credit to the last paid click by default, regardless of the channel, if any have happened.

#2 - One of Facebook's advantages over its competition is the ability to link actions to users instead of cookies. In practice this means you can track and target the same user across all their browsers and devices as long as they are signed in to Facebook. Google Analytics, on the other hand, relies solely on cookies which means all tracking happens inside the same browser where the cookie was dropped.

#3 -  To understand this point let’s consider the example of a user who clicks on a Facebook ad on their smartphone and find something they want. Later that night they go directly to the same site on their computer to finalize the purchase. Facebook counts a cross-device conversion, but Google Analytics shows the source as direct traffic.

Which numbers should I use?

To get a more accurate view of all their digital channels, many advertisers have chosen to go for a more robust tracking tool, like DoubleClick by Google. The advantage of these is that you can track impressions and the conversions that result from those in all channels, but this does require separately tagging each ad you make in each channel, which can be a tedious operation.

How can Improvado help?

When you have multiple channels but still crave for a single dashboard that shows you all the data from multiple sources, you should turn to Improvado. We can bring together data from close to 100 sources in one dashboard and put you in control. You can view conversion data as well as myriad other metrics from multiple different dimensions.

Tuesday, September 11, 2018

Customer Success Best Practices for a Startup

I recently recorded the following interview for Helpware - a leading online publication for topics that CXO staff for startups. It was an interesting conversation and some great points that were made during the talk. Hopefully, you enjoy the talk as much as I did recording it. If there are other topics you want to hear about, do leave a comment below.


Monday, August 29, 2016

Comparison of MPI, MapReduce and Dryad

Big data is only as good as the value that it can deliver in a timely manner to the owners of the big data. Hence the speed of data search or finding the key piece of info is always a topic of discussion in any big data shop.

Following are the top 5 high level ways to find info in a big data store. We will focus on the parallel search option since that delivers the most value (bang for the buck).
  • Bloom Filter: Bloom Filter consists of a series of Hash functions. The principle of Bloom Filter is to store Hash values of data other than data itself by utilizing a bit array, which is in essence a bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such advantages as high space efficiency and high query speed, but also has some disadvantages in mis-recognition and deletion.
  • Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but it is hard to find a sound Hash function.
  • Index: index is always an effective method to reduce the expense of disk reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and other technologies that manage semi-structured and unstructured data. However, index has a disadvantage that it has the additional cost for storing index files which should be maintained dynamically when data is updated.
  • Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics. The main idea of Triel is to utilize common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency.
  • Parallel Computing: compared to traditional serial computing, parallel computing refers to simultaneously utilizing several computing resources to complete a computation task. Its basic idea is to decompose a problem and assign them to several separate processes to be independently completed, so as to achieve co-processing. Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad.

    Comparison of MPI, MapReduce and Dryad

    • Chen, M., Mao, S. & Liu, Y. Mobile Netw Appl (2014) 19: 171. doi:10.1007/s11036-013-0489-0

Tuesday, August 23, 2016

Machine Vision and Big Data

I started my computing career with machine vision. It was my first large sized project in school a few years ago (or was it a few decades ago but who is counting). Anyway, the project was to determine defects in computer chips with machine vision. We had employed a simple neural network learning algorithm that tried to determine if the image of the chip was a match with the image of a good specimen stored in the system or not. So, what we were trying to solve was comparing the captured image with one fixed image and deciding on if there was a match within an acceptable level of deviation. It was not perfect but it worked. It was unfortunate that the company that provided us with the grant money decided to not deploy the solution in production for reasons best known to the company.

Anyway, today I came across this nicely written short editorial piece in International Journal of Computer Vision titled simply Big Data. The author describes the new techniques that are now available to the area of machine/computer vision mostly driven by advances by machine translation and speech to text processing. Both these areas came into their own once large training data sets were used. And this was possible only with general advances in storing, retrieving and processing big data sets. For more details please refer to the article here.

The current state of image recognition is best illustrated by the two examples below. In the first set the learning engine was asked to choose an image in either column 1 or 2 that looked liked the image in the remaining columns (3-5) in each of the two sets. So, for first set the learning engine needed to identify a horse and in the second set a face. And as you can see the system was able to do so correctly. Of course this by does not mean that the same algorithm would be able to identify a tree if given bunch of pictures with shrubs etc. But, it s a promising start.

Machine Vision to identify a horse and a face in a set of images

  • Rubinstein, M., Liu, C. & Freeman, W.T. Int J Comput Vis (2016) 119: 23. doi:10.1007/s11263-016-0894-5

Saturday, August 13, 2016

If you have worked for more than 5 years you are less likely to pay your loan than people who just started working!

This is an excellent post on how to use IBM Watson Analytics on Lending Club data. For those who may not know - Lending Club is one of the largest platforms for peer-to-peer (P2P) lending. It promises higher returns to the lenders and better experience for the borrowers. There were several key findings from this analysis. Top 3 are listed below-

  1. Default rate is highest in Nevada (maybe because people tend to use the money in the slot machines or game tables of Sin City instead of putting the money to good use for the purpose they used to get the loan in the first place) Any guesses for which state has the lowest - Well, its Wyoming! Who would have guessed that. :) Go Cowboys and Cowgirls!
  2. Another key finding was to the influence of the reason for the loan on the default rates. People tend to default half as much for auto and wedding expenses related loans than they tend to for renewable energy. Never trusted those green types. 
  3. And the last but the most alarming finding was that folks who have worked for more than 5 years tend to default more than people just starting out. This is interesting since you tend to make more money over time in any profession (no matter which profession it is). I guess work makes you less honest. If you have other reasons why this might be - feel free to tweet me at @shankar_sahai

I wonder how default rates varies on the gender x-axis.

For transparency, Lending Club provides loan data to the public. This data can be downloaded from the Lending Club website.

Monday, December 22, 2014

Big data deployment interdependencies

Over the last few years since MapReduce/Hadoop spawned a whole new era of big data, we have seen many vendors, community driven tools, products, libraries explode on the market. To even the most ardent followers of the space and especially to a casual observer or a new-entrant to the big data field making sense of all the tools and technologies is a nightmare-- bad enough for many to consider their choice of profession. But fear no more. Below are two things that will help you make sense of the big data jungle out there. First is a table of the technologies and their purpose/function. And the second is the interdependency chart of the various technologies.

Table - Classification of Big Data Technologies

Big data deployment interdependencies
Big data deployment interdependencies

If you are a big data novice start with the table, figure out the technology options that are in-play for your situation. Once you have narrowed down the options, research some more. Hint- is a good starting place.

Deciding the options to use is never easy but once you do have made the decision,  use the dependency chart to figure the deployment stack that you would need for the options chosen. Word of caution - not all stacks can be deployed - at least not yet. So, before you jump into the big data pit check with people who might know more than you but make your own decisions. Remember all decisions come up with a price -- the most expensive one being time especially if you are on a tight timeline. Choose wisely!

Deciphering Big Data Stacks: An Overview of Big Data Tools by Tomislav Lipic, Karolj Skala, Enis Afgan

Saturday, December 13, 2014

Performance testing results for tools used in big data transfers

Many organizations face every day task of moving big data from one deployment to another. There are many options for doing so. Recently I came across a paper that presents results of network performance testing with common commercial tools used for data transfer. The 10 Gb/s network topology used -
Network topology for performance testing
They also conducted the test with background noise thrown in to the mix for a more realistic simulation. Following is the chart how adding more background noise affects the performance of the tools

Following is the abstract from the paper that summarizes the results of the study.
We present performance and fairness analysis of two TCP-
based (GridFTP and FDT) and one UDP-based (UDT)
big data transfer protocols. We perform long-haul performance
experiments using a 10 Gb/s national network, and conduct
fairness tests in our 10 Gb/s local network. Our results
show that GridFTP with jumbo frames provides fast data
transfers. GridFTP is also fair in sharing bandwidth with
competing background TCP flows.

Performance and Fairness Issues in Big Data Transfers by Se-young Yu, Nevil Brownlee, and Aniket Mahanti, University of Auckland, New Zealand

Thursday, October 16, 2014

Machine learning suites comparison (SAS, R, Pentaho, Mahout, Hadoop and others)

If you have wondered about comparing different machine learning suites available, you have come to the right place. Some of the suites covered are - R, SAS, Mahout, Hadoop, Spark, Pentaho and others. Table below shows the tradeoffs of the suites in an easy to read fashion.

Comparison of Machine Learning Suites
Comparison of machine learning suites
Hint- Its not hard to see that Hadoop is an good choice. Click here for more details.

Thursday, October 9, 2014

Comparison of NoSQL platforms for CAP (Consistency, Availability and Partition tolerance)

We are all aware of the CAP theorem and if you are not, please refer to the Wikipedia page. As your NoSQL instance grows and you have to deal with multiple nodes, you will need to prioritize the CAP dimensions for your use case. This is needed because of the famous theorem which states -- you can only get 2 out of the 3 guarantees. Many a times you will realize that the NoSQL option that made sense when you started earlier does not any longer. Since with just one node/partition you can get all all three guarantees! But with multi-partition deployments you have a decision to make.

Decision to change NoSQL platform should not be taken lightly. Transition from one to another takes a lot of planning, effort and time. Anyway, if you do need to choose a new NoSQL platform, here is the table of the CAP options/guarantees that you get with each of the NoSQL platforms. Choose wisely!

CAP options for NoSQL platforms
CAP for NoSQL platforms
A good rule of thumb while making your selection is that CP platforms are good for OLTP type usage and AP platforms are better suited for web applications.

Big Data with Cloud Computing: an insight on the computing environment, MapReduce,and programming frameworks by Alberto Fernández et al.

Monday, September 29, 2014

3 Useful tips for executives planning to make use of big data

Many companies disembark on big data journey for all the good reasons but do not necessarily end up in a good place after having spent large chunks of corporate change on it. Generally there could be many reasons for this but it would take a team of experts to analyze each situation to come up with the root cause(es). This post is mostly targeted for executives who have the desire to make use of big data and want to ensure their staff has the right high level guidance at the onset so that they can manage transition to big data in an effective and timely manner. Since being on time and budget are the two most important metrics for an executive.

Following are the three hints -
  1. Paying attention to data stream - Businesses generate data in real time and the tools that make use of data should also be real time. Looking at a snapshot of data in a warehouse for example might throw up some interesting observations but you should be able to get the same guidance in real time. Hence focus on Streaming Analytics.
  2. Relying on data scientists as opposed to data analysts - Data scientists are trained not only with data analytics skills but also posses good technical/programming skills. Without having hands on technical skills,  they become dependent on IT staff thereby slowing down the entire process. Many companies are either hiring or acquiring companies that posses these advanced skills.
  3. Moving analytics from IT into core business functions - If you have an analytics department or group tucked away in out reaches of your company, its time to make some changes. They should be front and center of your main business organizations. Following are the three main reasons for this recommendation-
  • Allows for not having to replicate or sync multiple data marts/warehouses/stores etc. Helps at technical level. Big red flag - it is easy to underestimate the cost of having to do so. 
  • Functional groups tend to trust their own finding/analysis more than when it comes from an external adjunct group such as IT. This is just part of corporate culture.
  • Saves time since this obsoletes need for replicating data which in turn leads to time being wasted waiting for the process of analyzing the data to finish for results to be known. 
For reasons listed above, executives will be well advised to have each functional group that can make use of big data to create a focused team of data scientists within the group. If you have a central group of data analysts go ahead and disband it. Its best for your corporate goals.

For further reading and helpful reference click here

Sunday, September 7, 2014

NoSQL with MongoDB in 24 hours - Book Review

NoSQL with MongoDB in 24 hours book cover
So, you have a burning desire to learn about NoSQL and MongoDB and can go without sleep for 24 hours, I have the perfect book for you. The book is titled "Sams Teach Yourself NoSQL with MongoDB in 24 Hours". Well, you would have to not only give up sleep for 24 hours but also either be productive enough for 24 hours straight or be caffeinated enough for that time period to pull this off.

The book has 24 chapters- one per hour. So brush up on your speed reading skills. The book has great code examples (but only for learning purposes, do not try to "borrow" code for anything real). It also has useful "tips" and "cautions" planted throughout the book. Below is the introduction from the book.

MongoDB/NoSQL book introduction

Each chapter (which represents one hour of learning) has a Q&A section at the end. I strongly encourage readers to not go on to the next chapter without spending time on it. If you don't spend time on the Q&As (like I tried to) you will find yourself having to go back to earlier chapters for reference more than you would care for. On this note, my disclaimer- I did *not* read the book in 24 hours straight but wanted to. Anyway, I heartily recommend the book to anyone who wishes to become proficient with NoSQL concepts along with a solid foundation of MongoDB. The book is available on Google Books for electronic download. Bon Appétit!

Friday, August 1, 2014

Cost of too much data

Saw this interesting infographic and hence sharing with my readers.

The Cost Of Too Much Data Infographic

Saturday, July 12, 2014

Smarty Pins - Google Map Game

Google Maps is good for not just about anything that can start with "Where" but now can also provide the answers to many different ways of "how to kill time?".  The game is called "Smarty Pins" (Yes, that is the best name that highly paid smarty pants at Big G came up with).  Game is simple and yet engaging. Another good way to make use of truck loads of big data that drives it.
Smarty Pins

The options for the player to choose from -
smarty pins options

Sunday, June 22, 2014

Linux distributions to package Hadoop (Seems like it)

There are several commercial and community based Hadoop distributions out there but so far none of the major Linux flavors (Ubuntu, Debian, Fedora) has decided to bundle a Hadoop distribution. Imagine the ease for an end-user who can use simple command line tools to install and run Hadoop. No more wading through bureaucratic delays or lengthy purchase processes to get permission to install and run Hadoop.

Couple of the larger Hadoop companies control the fate of what people currently consider the "standard" Hadoop package. It will be interesting to see how the dynamics will change once the major Linux operating system vendors make Hadoop available as a base RPM. Doing some research I did find the following Fedora project - which leads me to suggest that effort is underway to do just that. If Fedora takes the lead, it is just a matter of time before other distributions do the same. Please note that I have not seen any official comment about this in the press so this could well be a false alarm.

Following is taken from the web page mentioned above and clearly shows the intent of the project. Cannot wait for this to become real for us all.

Detailed Description

Apache Hadoop is a widely used, increasingly complete big data platform, with a strong open source community and growing ecosystem. The goal is to package and integrate the core of the Hadoop ecosystem for Fedora, allowing for immediate use and creating a base for the rest of the ecosystem.

Benefit to Fedora

The Apache Hadoop software will be packaged and integrated with Fedora. The core of the Hadoop ecosystem will be available with Fedora and provide a base for additional packages.


  • Proposal owners:
  •    Note: target is Apache Hadoop 2.2.0
  •    Package all dependencies needed for Apache Hadoop 2.x
  •    Package the Apache Hadoop 2.x software
  • Other developers: N/A (not a System Wide Change)
  • Release engineering: N/A (not a System Wide Change)
  • Policies and guidelines: N/A (not a System Wide Change)
 Image - Courtesy of