But lets take the example of click stream data. Data that will be used by e-commerce platforms. Data that will be used for disease management or drug discovery must be of the highest quality. You get no relief on data quality front just because the data volumes are huge.
So, here are the steps to address data quality in the big data world -
- Assess your use cases & data sources and then decide what is the most pragmatic approach for you. Do so in consultation with the consumers of the data.
- Document the data quality issues that your ETL code is either ignoring or remedying.
- Have nightly script run against your data to ensure no unexpected new data quality issues are seen. Also check for any synchronization issues. (Hint: This is not easy to do.)
- Perform periodic intense data quality audits
- Plan for data quality and assign an owner at the onset