

AWS Cloud Engineer
AWS Cloud Engineer
This article is the second in a series of two that aims to address the integration of Databricks in AWS environments by analyzing the alternatives offered by the product with respect to architectural design. In the first one we talked about topics more related to architecture and networking, in this second delivery we will talk about topics related to security and general administration.
The contents of each article are as follows:
First delivery:
This delivery:
The first article can be visited at the following link [1].
This section will analyze the security of the information processed in Databricks from two perspectives: while the information is in transit and while the data is at rest.
This section deals with the encryption of the data sent to the buckets from the cluster in order to define policies in the buckets to reject all incoming unencrypted communication. The necessary configuration for the root bucket and the buckets where the data is stored (raw, prepared and trusted) are different since the root bucket is part of the DBFS.
In order to encrypt the communication from the client we must include a file called core-site.xml in each node of the cluster through an init script, either global or cluster specific. This configuration is filled in for the DBFS since the root bucket is part of it.
In this file we must add the algorithm to use and the key in KMS, being able to be keys administered by the client. An example of the content of this file could be the following:
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>SSE-KMS</value>
</property>
<property>
<name>fs.s3a.server-side-encryption.key</name>
<value>${kms_key_arn}</value>
</property>
Definition of the algorithm and KMS key to be used in the root bucket encryption.
In the case of the Data Lake buckets, we must add some Spark configurations in the cluster so that it configures the encryption.
An example of configuration could be the following, where at the bottom we include the same information as in the previous section, i.e. the encryption algorithm and key.
spark_conf = {
"spark.databricks.repl.allowedLanguages" : "python,sql",
"spark.databricks.cluster.profile" : "serverless",
"spark.databricks.passthrough.enabled" : true,
"spark.databricks.pyspark.enableProcessIsolation" : true,
"spark.databricks.hive.metastore.glueCatalog.enabled" : true,
"spark.hadoop.aws.region" : var.aws_region,
#caches for glue to run faster
"spark.hadoop.aws.glue.cache.db.enable" : true,
"spark.hadoop.aws.glue.cache.db.size" : 1000,
"spark.hadoop.aws.glue.cache.db.ttl-mins" : 30,
"spark.hadoop.aws.glue.cache.table.enable" : true,
"spark.hadoop.aws.glue.cache.table.size" : 1000,
"spark.hadoop.aws.glue.cache.table.ttl-mins" : 30,
#encryption
"spark.hadoop.fs.s3a.server-side-encryption.key" : var.datalake_key_arn
"spark.hadoop.fs.s3a.server-side-encryption-algorithm" : "SSE-KMS"
}
Definition of the algorithm and key to be used in the encryption of the Data Lake buckets.
This configuration can also be included through the Databricks UI in the Spark config section when creating the cluster.
There is the possibility of adding an additional layer of security by implementing encryption when information is shared between worker nodes when the accomplishment of tasks requires it.
To facilitate the integration of this service you can make use of AWS EC2 instances that encrypt traffic between them by default. These are the following:
This section describes how to encrypt the data when it is at rest. The methods for encrypting this data are different for the Data Plane and Control Plane since they are managed by different entities (Databricks and the client, respectively).
For the encryption of our persistence layer within the Data Plane it is necessary to create a key within KMS and register it in Databricks through the Account API. It is possible to choose whether to encrypt only the root bucket or the root bucket and EBS. This differentiation is achieved through the policy attached to the key itself, giving the necessary permissions to encrypt Data Lake buckets or also including the possibility of encrypting EBS.
The features of this functionality are as follows:
All the information about this type of encryption is in the following link [2].
This section describes how to encrypt all the elements hosted in the Control Plane (notebooks, secrets and SQL queries and query history).
By default the Control Plane hosts a Databricks Managed Key (DMK) for each workspace, an encryption key managed by Databricks. A user-managed Customer Managed Key (CMK) must then be created and registered through the Databricks Account API. Databricks then uses both keys to encrypt and decrypt the Data Encryption Key (DEK), which is stored as an encrypted secret.
The DEK is cached in memory for various read and write operations and is evicted from it at certain intervals so that the consequent operations require a new call to the client infrastructure to obtain the CMK again. Therefore, in the case of losing permissions to this key, once the refresh time interval is met, access to the elements hosted on the Control Plane will be lost.
This can be seen in more detail in the following image:
It is important to highlight that in case you want to add an encryption key to an already created workspace you will have to make an additional PATCH request through the Account API.
All the information about this type of encryption is in the following link [3].
The development of pipelines or workflows generally involves the need to communicate critical or confidential information between the different phases of the process. This information in other Cloud environments is usually stored in a Secret Scope to avoid compromising its integrity. An example of the information handled in these Secret Scopes could be the access credentials to a database.
Databricks enables end-to-end management with AWS, both Secret Scope and secrets, via the Databricks API. It should be noted that by default full Secret Scope management permissions (Manage permission) are assigned by default unless this is declared at the time of deployment (the permissions assignable to the set of users are Read, Write and Manage).
For Premium or Enterprise subscriptions, Databricks offers the possibility to manage user permissions in a more granular way by means of the ACL (Access Control List), thus being able to assign different levels of permissions depending on the group of users.
These default secrets are stored in an encrypted database managed by Databricks and access to it can be managed through Secret ACLs. This type of information can be managed directly from the AWS KMS by dumping the key-value pairs and managing access permissions to it via IAM.
For all those administration tasks that do not need to be performed by a particular member of the organization, we can use the Service Principal.
A Service Principal is an entity created for use with automated tools, running jobs and applications, i.e. a non-nominal administration account. It is possible to restrict the accesses of this entity through permissions. A Service Principal cannot access Databricks through the UI, its actions are only possible through the APIs.
The recommended way to work with this entity is for an administrator to create a token on behalf of the Service Principal so that the Service Principal can authenticate with the token and perform the necessary actions.
The following limitations have been encountered when working with a Service Principal:
Databricks, like other services that offer the possibility of deploying a serverless cluster, also offers a common functionality such as auto-scaling. This functionality is completely managed by Databricks in terms of the different situations that must occur in order to decide to auto-scale the number of nodes in the cluster (scaling out) or the type of nodes (scaling up).
Databricks stands out in this functionality with respect to the competition because it offers an optimized auto-scaling that achieves on average savings of 30% higher than the rest of the solutions (although it also offers the possibility of implementing the standard version). The main difference between the optimized version and the standard version is the incorporation of scale-down.
The fact that other solutions do not include scale-down is due to the fact that the elimination of nodes may compromise the process, since even if the node in question is not performing any task, it may contain cached information or shuffle files that will be necessary for future processes in other nodes. This action is considered as sensitive because the elimination of a node containing information used in other processes would cause the need for reprocessing, which would generate higher costs than the savings generated by the scale-down.
Databricks, in the new update, implements a validation in the identification of less required nodes when cluster underutilization occurs, taking into account the following:
This allows for a more aggressive “scale-down” since at all times the nodes that can be eliminated without jeopardizing the integrity of the process are identified. Databricks takes preventive measures in this type of scale-down and implements extra waiting times to validate with greater security that this action is correct (Job Cluster: 30” , All-Purpose Cluster: 2′ 30”).
When auto-scaling, whether scale-up or scale-down, both the degree of general CPU saturation/underutilization and that of the individual nodes are taken into account, and the following scenarios can be identified:
Databricks’ optimized auto-scaling is especially recommended when tasks have pronounced peaks, since this generates a fast scaling to support the demand and in case it is a punctual peak and the demand is not maintained, these nodes would remain operative even if they are not going to process information:
In the search for the overall efficiency of the cluster, initiatives such as the possibility of grouping tasks in order to make the assignment of tasks as optimal as possible and to allow other actions, such as auto-scaling, to be carried out more efficiently and without compromising the integrity of the processes.
This grouping of nodes according to usage is done for two main reasons, the prioritization in the assignment of tasks to nodes (priority order is indicated in the following image), and the correct identification of saturation or generalized underutilization of the cluster nodes. The grouping is as follows:
As mentioned above, there are two major advantages to be gained from this type of grouping:
As an alternative to this clustering, Databricks offers an external storage support (External Shuffle Service – ESS) in order to have the degree of saturation of the node as the only factor to take into account when deciding which node to eliminate in the scaling-down processes.
In order to be able to further define the characteristics of the cluster that provide more power of adaptation to the different use cases that may occur, there is the alternative called EC2 Fleet.
This is based on being able to request instances, offering a range of possibilities in terms of the type of instances (unlimited) and the availability zones where they should be purchased. It also allows you to define if the instances are On-Demand or Spot, being able to specify in both cases the maximum acquisition price:
The use of EC2 Fleet does not imply an extra cost, you only pay for the cost of the instances.
There are the following limitations related to this service:
It should be noted that this solution is in Private-Preview at the time of writing.
In order to enable the deployment of Spark clusters, Databricks relies on DBFS as a distributed storage system. It relies on third parties, such as S3, to be able to both launch and perform the required tasks.
DBFS allows the linking of storage sources such as S3 through mounts, which makes it possible to simulate the visualization of files stored locally, even though they are still physically stored in S3. This simulation facilitates both the authentication tasks when consulting the files and the simplification of the paths when referencing them, thus avoiding having to load the files in DBFS each time they are consulted.
In addition, the information stored by the cluster when deployed, such as initial scripts, generated tables, metadata, etc., can be stored in storage sources such as S3 so that they are available even when the cluster is not operational. They can be stored in storage sources such as S3 so that they are available even if the cluster is not.
When choosing how and where to deploy the metastore Databricks proposes the following alternatives:
It is important to note that a Customer Managed VPC is required for the external Hive and Glue options. See first article for more information.
The alternatives of an external metastore, either Hive or Glue, could be interesting in cases where access to an existing metastore is needed, centralizing these catalogs, or where more autonomy and control is sought. Also, the implementation with Glue could facilitate access to this metadata from other services in AWS, such as Athena, and it also presents a lower latency of access to these services, since the communication can be done through the private network of AWS.
This section briefly describes where to access in detail the information about the costs that Databricks will charge the customer.
In order to access this information it is necessary to complete the necessary configuration to generate and host these logs within the infrastructure deployed in AWS. All this information can be found at the following link [4].
The logs are generated daily (the workspace must have been operational for at least 24 hours) in which the computation hours and the rate applied to these hours are imputed according to the cluster characteristics and the package assigned by Databricks.
fields | description |
---|---|
workspaceId | Workspace Id |
timestamp | Established frequency (hourly, daily,...) |
clusterId | Cluster Id |
clusterName | Name assigned to the Cluster |
clusterNodeType | Type of node assigned |
clusterOwnerUserId | Cluster creator user id |
clusterCustomTags | Customizable cluster information labels |
sku | Package assigned by Databricks in relation to the cluster characteristics. |
dbus | DBUs consumption per machine hour |
machineHours | Cluster deployment machine hours |
clusterOwnerUserName | Username of the cluster creator |
tags | Customizable cluster information labels |
This section will describe the different alternatives and tools that can be used to deploy a functional and secure Databricks environment on AWS.
This alternative is characterized by the execution of a CloudFormation script created by Databricks that gives the possibility to create a Databricks Managed VPC or Customer Managed VPC.
For both modalities Databricks creates a cross-account role with the necessary permissions (which differ for each alternative, being those of the Customer Managed VPC option more limited) and creates a Lambda through calls to the Databricks APIs to raise the necessary infrastructure, configure it and register it within the Control Plane. All these calls will make use of the cross-account role in order to authenticate to the customer’s AWS account.
CloudFormation
CloudFormation is the infrastructure-as-code service from AWS. It would be possible to reuse the code created by Databricks in Quickstart but this alternative would be too intensive in API calls as there are no specific modules for deployment with Databricks.
Terraform
Terraform is an infrastructure-as-code tool that is agnostic to different cloud providers, i.e. it can be used with any of them.
Terraform has modules available for Databricks that simplify deployment and reduce the need for Databricks API calls but do not completely eliminate them. For example, API calls will have to be made in order to use Credentials Passthrough without an Identity Provider (SSO).
In this initiative, a public GitHub repository [5] has been created in order to deploy an environment using this tool.
The infrastructure to be deployed has been divided into different modules for simplicity, the main ones being the following:
Therefore, the high-level architecture diagram is as follows:
Full deployment can be accomplished in less than 5 minutes.
When estimating the average time for launching and availability of the instances, it has been considered appropriate to differentiate into subgroups due to the similarity of the time spent in each of them:
The following table specifies the minimum requirements regarding networks, computation and storage that, from our point of view, a client deployment must have. It is important to emphasize that this presents differences with the GitHub repository presented in the Deployment chapter..
The differences are as follows:
GROUP | SUBGROUP | MINIMUM REQUIREMENTS | EXTERNAL NEEDS |
---|---|---|---|
Networking | Subnets | 3 private subnets with 3 Availability Zones | N/A |
| VPC Endpoints | S3 Gateway Endpoint STS Interface Endpoint
| N/A N/A N/A |
| Private Links | VPC Endpoint for connection Cluster (Data Plane) - Control Plane VPC Endpoint for connection with REST APIs (Control Plane) VPC Endpoint for connection user-frontend (Control Plane) | Contact with authorized Databricks support contact
|
| VPN | Site to site VPN para conectar la red del cliente con la VPC de tránsito | Communication with the customer |
Compute | Minimal instances | Zero with one autoscaling group | N/A |
Storage | Bucket Instances | S3 EBS Volumes | N/A N/A |
Minimum requirements for Databricks deployment on AWS
[1] Databricks on AWS – Architectural Perspective (part 1) [link] (January 13, 2022)
[2] EBS and S3 Encryption on Data Plane [link]
[3] Control Plane Encryption [link]
[4] Billing Logs Access [link].
[5] Databricks Deployment on AWS – GitHub Repository Guide [link].
I started my career with the development, maintenance and operation of multidimensional databases and Data Lakes. From there I started to get interested in data systems and cloud architectures, getting certified in AWS as a Solutions Architect and entering the Cloud Native practice within Bluetab.
I am currently working as a Cloud Engineer developing a Data Lake with AWS for a company that organizes international sporting events.
During the last few years I have specialized as a Data Scientist in different sectors (banking, consulting,…) and the decision to switch to Bluetab was motivated by the interest in specializing as a Data Engineer and start working with the main Cloud providers (AWS, GPC and Azure).
Thus getting first access to the Data Engineer practice group and collaborating in real data projects such as the current one with Olympics.
Patron
Sponsor
© 2022 Bluetab Solutions Group, SL. All rights reserved.