There are some new solutions that are coming up to address the security issue for Hadoop but none is currently considered a robust and trusted solution. So, if you are not an early adopter of new security technology or have other good reasons to not be on the bleeding edge, it might be better to rely on tried and trusted practices for securing your Hadoop cluster.
Remember the data managed in your cluster will grow over time and even if you may not have sensitive data such as social security numbers or Personal Health Information (PHI) or financial data, chances are that a business group will end up "dumping" such data without your knowledge. Everyone is busy and things like this just happen. So, be prepared for it. Assume that your Hadoop cluster has such kind of data to begin with and put in place security practices and tools that can effectively protect your data and allow you to get a good night's sleep. Red Bull and 5-hour energy drinks only go so far.
What are the some things you need to focus on?
System level security:Instead of using 777 type directory permissions, use 700 for all tasks and data folders. Limit number of users that have access. Only hire people after doing background checks.
Network level security:Use encryption and do not transfer any data in the clear even internally. Also ensure that the Hadoop cluster is completely protected by your firewalls for access to any port. Implement dual factor authentication for outside access. Paying for penetration testing is money well spent. Always.
Audit log analysis:Perform nightly analysis of your access logs. Any anomalies should be fully investigated and quickly resolved.
Delete staging data:If you are using FTPS or SFTP to transfer files from other servers to your big data cluster, delete the files after it has been uploaded. You should save meta data about the files though such as:
- Name of file
- Size/check sum of file
- IP of source server
- Headers (if available) but save no data from the files(not even a single line)