Thrive School - Big data, Hadoop, MapReduce, Hive, Pig, Hbase Tutorials: Big Data

In my previous post, I talked about need for securing big data environments and also mentioned 8 key areas of enterprise security. Let’s discuss these areas in little more detail to understand concept behind all these terms. Such understanding will be necessary to know what we intend to do under each vertical of enterprise security and form a basis for requirement. Before that, I just want to share some stats from PWC survey “The Global State of Information Security® Survey 2014. Detailed report of this survey is available here.

Authentication

Authentication is first level in security, one of the simplest term to understand but one of the most complex things to implement. Authentication means you need to be authenticated to ensure that you are the one who you claim to be. Everyone must have experienced authentication while giving user name and password to any system they want to use. But there are several questions which you may ask while implementing it, some of them are listed below.

Who (Server) will authenticate users?

When you login from a client machine, do you want to send your password to server for verification?

Do you want all communication between client and server to be encrypted?

How you want to manage encryption key?

How you want to manage list of all valid users?

Do you want users to allow to login from any machine or device like their mobile?

What are valid applications (command line interface, commercial applications and custom build applications) which will be allowed to access your systems?

How will you ensure those applications are using secure method to establish connection?

Do you want single authentication for all your services?

How will you integrate it with existing authentication system?

Authorization

Once you are authenticated and system knows “who you are”. You can get into the system, what’s next? You would want to perform some task, execute some programs/scripts/commands, read or write some files, access some services or other systems within network. That’s where authorization plays its role, it controls, what is that you are allowed to do after entering into the system? In fact, authorization and authentication are tightly coupled in any system. Authorizing any user may again trigger a need for subsequent authentication. For example, you are authorized to execute a program which connects to a database, and then there is need for you to qualify for database authentication. This is a typical situation where you may need a single sign-on to be implemented, in absence of single sign-on you will end up creating many credentials, but question is how you want to do it in your system and how you want to manage such passing-on authorizations in a secure manner? Do you want to embed such credentials within your applications, I see such implementations where all application users use single credential to connect to specific services like databases whereas user management is controlled through another layer within application, this approach might be good as long as you can implement mechanism to trace application users till the end point within database activities.
Another consideration under this vertical is how you want to manage machine to machine or service to service authentication and authorization within your system. For example in a hadoop system, when a node wants to register itself as a data node, how do you check if it is authorized to be registered as data node and more importantly how do you authenticate that it is the node it claims to be. Such issues are addressed under authorization.

Access Control

If you are familiar with any RDBMS technologies, probably you already understand access control. Authorization tells you what you can do for example which files you can read. But authorization does not extend it further to fine grained level. RDBMS are excellent in providing such fine grained access control. In a typical RDBMS system, you can control which tables a user can access or read. It goes further and allows you to restrict specific columns, some databases gives you capability to restrict rows as well. Those databases who don’t give such capabilities of restricting directly rows provide it indirectly using views. Your big data environment is actually going to store data. In that respect it is not any way different than your database. Without such access control capabilities, how can you imagine a system which will store data? It is very obvious requirement that you may not want everyone to see everything in a file. It doesn’t ends there but requires much more than that, like you may want to build various access profiles, assign access credentials to profiles and then assigns profile to users or you may want some different ways of defining policy based or role based access control. You may also want policy based resource management for example who can use how much disk, cpu and memory etc. Such things are considered under access control.

Encryption & Masking

Access control gives ability to restrict access of individual data items to specific users. But they don’t solve data security completely. There are many complex requirements for data security. For example PCI-DSS is a necessary compliance requirement for any one dealing with payment card holder data. One of such card holder data element is PAN (Primary account number). As per PCI-DSS compliance requirement, you can store PAN into your system but you can’t reveal or render this number in readable form to anyone. That’s why your card numbers are always printed on your receipt as *********wxyz. It’s a complex requirement. You want to store PAN accurately, you want your users to have read access for this PAN but you don’t want them to be able to understand actual number. How would you implement it in your system as a security rule? Yes, you guessed it correctly. Answer is encryption.
There are some other requirements related to privacy and some weird examples are there where a retailer knows that the girl is pregnant before her father knows it. A mobile operator will have enough data to draw patterns of where do you spend most of your time, easily determine and identify your relationships. There are enough regulations and laws to protect people privacy. Your big data system will have to comply for such regulations. Masking and tokenization are most suitable techniques to take you closer to such compliance requirements.

Network Security

Network security is all about securing your network from unauthorized access. It’s all about drawing a virtual boundary across your network and restricting all accesses and entries into the network from one or more very well secured gates. By doing this you control every in and out movement of data and information from your network. Firewalls, proxies and gateways are best answers for such requirements. Apart from these considerations, you may also have to consider protecting data on the fly when it is being transmitted over network.

System security

What comes under system security, it’s about file system security, software, patches and updates etc. it’s very obvious that outdated software and lack of patches and updates leaves scope for vulnerabilities into your system. You have to have a mechanism for deployments, maximum possible automation and methods to easily identify such issues and mechanism to fix them. File system is another key part of hadoop security and you need to pay special attention for it. We all understand that HDFS is not a real POSIX compliant file system and data lies into blocks by default exposed to everyone who has access to data nodes. You will have to secure block level data, encryption/description may be a good solution in this case but there are many complexities to be addressed with respect to hadoop.

Infrastructure security

Infrastructure security is more about controlling physical access to your infrastructure but not limited to actual physical access. Remote access to your systems is as good as physical access except protection to physical damages. This is the vertical where you will have to plan for disaster recovery, backups/restores and business continuity considerations.

Audit & Monitoring

Audit and monitoring is an extremely complex and vast area in terms of security considerations. You will need easy and workable mechanism for monitoring your system that everything is in place and working as per expectations. You need automated and manual mechanism of discovering and alerting about unusual events and activities. Just look at the PWC survey results, current employees and insider trusted partners covers major chunk of likely sources of creating security breach incidents. In such an environment you are not secure by just implementing enough security measures to cover items which we discussed above. No security system is perfect and you will have to have a monitoring system in place to monitor who is doing what and draw unusual patterns of activities so that you can strengthen your system further.

Implementing such monitoring system is may be extremely complex requirement in absence of right methodology and technology. To implement an effective monitoring system, you will have to enable extended logging, before that you will have to understand where and what is to be logged. Once you have all required logging in place you get into a new problem of collecting all those logs from various individual machines and systems to a central repository. Another problem to be addressed after that, how will you draw information out of that raw data, what reports will you prepare and what are your KPIs and thresholds to trigger alerts, who will have access to such reports and alerts, what actions to be taken. Logs and audit information is more valuable in case of incident occurrence, you need them for your investigation and forensics. In absence of logs and traces, you can’t complete your RCA, can’t understand what other vulnerabilities are caused by incident and can’t evaluate damages.

In this post I tried to discuss basics of various aspects around enterprise security which must be addressed by you implementation of big data environment. It is not limited to my discussions above but there are many more considerations apart from my discussions above for example you will have to have a well-defined security policy and practice, you will have to train your people about security and associated risks and obligation, and you will have to have a risk management practice in place. This could be a very long list and there are enough materials over internet. But I wanted to summarize key concerns from enterprises regarding to security and best reference which I found was in TOGAF™ 9.1 documents. Here is a summary of Generally Accepted Areas of Concern as per TOGAF™ 9.1.

Any implementation of Big Data solution will have to address all of above concers to be successfull.
Good luck to your implementations.

Wednesday, February 26, 2014

Big Data – Enterprise security requirements

Authentication