Many of us who have come from the traditional database world have spent a lot of time working on issues related to data quality, identifying bad data, rectifying or isolating bad data. The question for this post is "Does data quality change in the world of big data?"
The answer is not simple. The use case for the data along with the source determines how much effort you should be willing to spend on fixing data quality woes. If the data (from log files) will be used for code performance or something like that it may be OK to ignore certain data quality issues. If the data (HTML content) is from websites and will be used for searches etc, maybe ignoring some data quality issues make sense. Especially when the data size is in petabytes or more range.
But lets take the example of click stream data. Data that will be used by e-commerce platforms. Data that will be used for disease management or drug discovery must be of the highest quality. You get no relief on data quality front just because the data volumes are huge.
So, here are the steps to address data quality in the big data world -
The answer is not simple. The use case for the data along with the source determines how much effort you should be willing to spend on fixing data quality woes. If the data (from log files) will be used for code performance or something like that it may be OK to ignore certain data quality issues. If the data (HTML content) is from websites and will be used for searches etc, maybe ignoring some data quality issues make sense. Especially when the data size is in petabytes or more range.
But lets take the example of click stream data. Data that will be used by e-commerce platforms. Data that will be used for disease management or drug discovery must be of the highest quality. You get no relief on data quality front just because the data volumes are huge.
So, here are the steps to address data quality in the big data world -
- Assess your use cases & data sources and then decide what is the most pragmatic approach for you. Do so in consultation with the consumers of the data.
- Document the data quality issues that your ETL code is either ignoring or remedying.
- Have nightly script run against your data to ensure no unexpected new data quality issues are seen. Also check for any synchronization issues. (Hint: This is not easy to do.)
- Perform periodic intense data quality audits
- Plan for data quality and assign an owner at the onset