With the recent GDPR policy, security with data is becoming a bigger concern for enterprise. In this post, you will learn about True Delegation and how it helps enterprises achieve their data security and governance requirements. You will also learn about the challenges, the Hadoop authorization landscape, and the unique solution that AtScale provides.
The term True Delegation, comes from the fact that authorization is enforced without the need to configure ad-hoc delegation functionality available in some SQL-on-Hadoop data engines. One example is Impala’s authorized_proxy_user_config startup option. With True Delegation, an authenticated user’s credentials flow seamlessly toward authorization enforcement checkpoints.
Clearly, successful data security and governance relies on two key elements: authentication and authorization.
- Authentication is the process of verifying the identity of an end user.
- Authorization is the process of applying policies to the verified user to ensure that they are only able to access data that they are explicitly entitled to.
The Problem with Credential Delegation
For the purpose of this article, we’ll assume that the data being accessed by an end user resides in a Hadoop data lake. But keep in mind that the same principles apply regardless of end location of the big data.
A challenge with authorization policy enforcement comes into play when a system (Service A) that an end-user (Client) has already been authorized to connect to must be used to connect to another service (Service B) to retrieve data on behalf of the end user (or client). This challenge is known as the double-hop problem because the end-user goes through one system (first hop) to access data in a second system (second hop). The ultimate need is to maintain the original client’s credentials, not the credentials of the intermediate system, because the end-users' "rights to access data" are typically already established and should be leveraged to maintain adherence to security and governance rules.
Here is a quick breakdown of the double-hop problem in action.
- Step 1 & 2: The end-user (Client) logs on and is authenticated (1) and authorized (2) to access the networks and systems.
- Step 3: The end-user accesses the first system (Service A) using their own credentials (first Hop)
- Step 4: Service A uses its credentials to access the system where the data resides (Service B) (second Hop)
Figure 1 above visually shows this problem along with the painful result; where the Client’s credentials are lost along the way to connecting to Service A and then to Service B.
The Double-Hop Problem is a ‘problem’ because the data in a data-lake is a mixture of data from multiple sources. This could be financial, marketing, HR, inventory...whatever data an organization chooses to throw in there. Authorization to limit what data end-users can access in originating systems (for example limiting non-HR employees from being able to see salary data) could be lost during the double-hop (meaning end-users could potentially access data in the data-lake that they are unauthorized to see).
In the Hadoop landscape, there is currently limited, if any, support for overcoming the Double-Hop problem. And to the extent that support is available, it is brittle and requires cumbersome configuration. This is especially problematic in the typical case where the end-user population is very dynamic.
Figure 2 above shows how AtScale’s True Delegation solves the double-hop problem and successfully maintains the end user’s credentials when connecting through the AtScale Engine (Service A) to a Hadoop cluster (Service B). By presenting the client’s credentials when connecting to Hadoop, we are able to leverage, and seamlessly integrate with, ANY authorization mechanism being used by Hadoop. This includes Sentry, Ranger, HDFS permissions, and HDFS ACLs.
Once AtScale connects to the SQL-on-Hadoop data engine in Hadoop, query processing progresses through its normal stages and delivers the end-user the data they requested (through whichever analytics or Business Intelligence tool they chose to use: Tableau, Excel, etc...). ‘Normal Stages’ for AtScale True Delegation includes checking with the authorization service to ensure the user accessing the data is authorized to do so. Unauthorized users will fail to access the data, much to the relief of IT and data systems owners responsible for adhering to the latest data security standards. The beauty of True Delegation is that enterprises can now configure their authorization policies with the tool of their choice, and it will simply work without any additional management overhead within the AtScale system.
Figure 3 above shows the case where authorization and auditing is handled by Sentry. The Sentry plugin integrates into the data engine via Sentry’s plugin architecture. Sentry’s audit trail is augmented by additional data available within the AtScale system. Ranger integration employs a similar architecture.
With the Authorization Service in place, Figure 4 shows how authorization checks are made after a submitted query has been parsed but before any data is accessed. Authorization checks are made relative to the client credentials provided by the connection to the SQL-on-Hadoop engine. In the case of True Delegation, those credentials are those of the original end-user working with a BI client! That end-user will only see data if the authorization checks allow it with the policies being completely managed with the customary tools used for the authorization service being employed.
Figure 4. Query Processing Stages
Authorization in the Adaptive Cache World
That’s all well and good when running ordinary queries against raw data in a data lake. However, there’s an additional challenge when queries employ AtScale’s Adaptive Cache.
As a quick reminder, AtScale Adaptive Cache allows business users to get interactive query performance no matter what type of data they access: aggregate or atomic. It is powered through the AtScale engine and it is the first of its kind to provide speed on unlimited scale, without requiring any data movement.
How do we know if a particular client is authorized to access aggregated data? There’s a challenge in that special aggregate tables are created automatically by AtScale and the data is stored in tables managed by AtScale. Canary Queries come to the rescue. A Canary Query is a highly efficient version of user query that is executed simply for the purpose of establishing the user’s authorization to see the data in question. Given a query that AtScale decides should draw data from the Adaptive Cache, AtScale will first generate and execute a Canary Query that bypasses the Adaptive Cache and establishes whether the user is authorized to see that data or not. The Canary Query is generated in a form that will force all necessary authorization checks without actually retrieving any of the raw data. That’s what makes it so efficient, and the negligible cost is well worth the protection provided. Canary Queries are also useful to provide True Delegation in the case where a customer’s preferred SQL-on-Hadoop engine does not integrate with the customer’s preferred authorization mechanism. For example, there currently is no integration of Sentry into Spark SQL. In this case, Canary Queries can be run against Hive to do the authorization checks, and then have the full query run against Spark.
AtScale has provided True Delegation to our customers since version 4.0. The AtScale True Delegation solution:
- Ensures enforcement of data authorization policies for BI users
- Seamlessly integrates with Ranger, Sentry, and HDFS authorization
- Extends enforcement into the AtScale Adaptive Cache
- Adds no management overhead
Some of the world’s largest companies with extremely robust data security and governance requirements encouraged us to provide this capability. Check out this white paper below on Delegated Authorization.