Blog

Deploying Apache Spark on OBLV Deploy

A technical tutorial on how to deploy Apache Spark in a Trusted Execution Environment

7 mins read

Jan 30, 2025

As the volume of data generated by businesses and individuals continues to grow exponentially, so do the challenges in efficiently processing and securing that data. Big data tools like Apache Spark have revolutionised data processing by enabling large-scale distributed computation across clusters, making it possible to process vast datasets at unprecedented speeds.

However, handling such large volumes of sensitive data—whether it be financial transactions, personal information, or proprietary business data—comes with heightened security risks. Traditional data security methods are often insufficient to safeguard against modern threats, especially when data is being processed in real-time.

In this article, we’ll demonstrate how deploying Apache Spark on OBLV Deploy can address these security challenges. We’ll explore how OBLV Deploy provides Confidential Computing using secure enclaves to protect data in real-time processing environments, outline the steps we followed for secure deployment, and share best practices for leveraging this approach in your own big data projects.

The Critical Need for Secure Data Processing in Real-Time Analytics

In industries like healthcare, finance, and government, where Apache Spark is widely used for data analytics, the need for secure data processing is critical. Data must be protected across its entire lifecycle, which includes three fundamental states:

Data at Rest: This refers to data stored on physical drives or databases, typically protected through encryption mechanisms.
Data in Transit: While being transmitted across networks, data is safeguarded using protocols like Transport Layer Security (TLS), ensuring it cannot be intercepted or altered during transmission.
Data in Use: This is the most vulnerable state, as data is actively processed in memory and typically not encrypted. This is where secure enclaves come into play, isolating and encrypting the computation environment to prevent unauthorised access even if the infrastructure is compromised.

Without securing data in all three states, organisations risk exposure to breaches, leaks, and unauthorised access. Robust deployment solutions are essential to mitigate these risks, especially during real-time data processing tasks conducted by platforms like Apache Spark.

Why Use OBLV Deploy for Secure Data Processing?

OBLV Deploy is a robust platform that enhances security by leveraging secure enclaves, providing a secure, trusted execution environment for applications like Apache Spark. It allows users to process sensitive data in memory without exposing it to unauthorised actors. The platform’s seamless integration with existing infrastructure makes it a compelling solution for organisations aiming to safeguard their data while utilising the power of big data tools like Apache Spark.

In this article, we’ll demonstrate how Apache Spark can be deployed on OBLV Deploy, showcasing its ability to enhance data security throughout the computation process. We’ll cover the general setup, integration with an S3 filesystem for secure storage, and best practices for optimising your deployment. This demonstration emphasises the advantages of secure data processing without compromising performance or exposing sensitive information.

Understanding Secure Enclaves

What Are Secure Enclaves?

Secure enclaves are hardware-based isolated environments designed to protect code and data from outside access, including from the host operating system and even cloud administrators. Data processed within these enclaves is encrypted, and access to it is tightly restricted to the enclave itself. This isolation ensures that sensitive computations, such as encryption keys, personal data, or proprietary algorithms, remain secure, even in environments where the infrastructure may be compromised. In fields like finance and healthcare, where data confidentiality is paramount, secure enclaves are essential for protecting data while it is actively processed. By leveraging enclaves, organisations can confidently process sensitive data without exposing it to external threats.

Comparing OBLV Deploy and AWS Nitro Enclaves

OBLV Deploy:

OBLV Deploy is a versatile platform designed to facilitate secure computation using secure enclaves. It offers compatibility with various cloud providers and on-premise environments, allowing users to run applications securely without making significant changes to their codebase or toolchain. The platform abstracts the complexity of managing enclave technologies, making it easier for developers and IT operations teams to integrate secure enclaves into their workflow. OBLV Deploy provides strong in-memory data protection, preventing unauthorised access even by the underlying infrastructure. OBLV Deploy also enforces networking restrictions, allowing only authorised connections and providing an attested guarantee that neither the code nor the connectivity has been tampered with.

AWS Nitro Enclaves:

AWS Nitro Enclaves is a solution from Amazon Web Services (AWS) that enables users to create isolated environments within EC2 instances to process highly sensitive data. These enclaves have no persistent storage, no external networking, and are designed to reduce the attack surface for security-sensitive applications. Nitro Enclaves utilise the AWS Nitro Security Module (NSM), a specialised hardware component designed to perform cryptographic operations and attestation, ensuring that only trusted code is running within the enclave. However, Nitro Enclaves are limited to AWS infrastructure, meaning users need to operate within the AWS ecosystem.

Deploying Apache Spark on OBLV Deploy

Our demonstration showcases the process we followed when deploying Apache Spark on OBLV Deploy and connecting the Spark cluster to an S3 filesystem, where input files are securely stored using an AWS Key Management Service (KMS) key. The KMS key policy is configured so that the data can only be decrypted within the secure enclave, ensuring the highest level of protection.

To prepare for deploying Apache Spark on OBLV Deploy, we configured the following:

AWS EKS: A Kubernetes cluster, preferably AWS EKS for manageability, is required to deploy the application. We followed AWS’s setup documentation to create the EKS environment.
OBLV Deploy Stack: Once the EKS cluster was up and running, we installed OBLV Deploy following the official documentation, ensuring compatibility between the EKS cluster and OBLV Deploy.

With both components set up, we proceeded to configure the Docker image and Kubernetes manifests for Apache Spark.

Preparing Docker Image

To enable Apache Spark to interact with the S3 filesystem securely, we created a Docker image containing Apache Spark and additional dependencies. These dependencies allow Spark to read from and write to S3, ensuring compatibility with the AWS infrastructure.

Below is the Dockerfile we used for this image, which includes configurations for Spark and AWS S3 access. We are using Apache Spark v3.4.0.

# Dockerfile
FROM apache/spark:3.4.0

USER root

ARG SPARK_VERSION=3.4.0

ARG SPARK_CONNECT_VERSION=2.12

ADD https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.3.4/spark-hadoop-cloud_2.12-3.3.4.jar $SPARK_HOME/jars/

RUN chmod 644 $SPARK_HOME/jars/spark-hadoop-cloud_2.12-3.3.4.jar

ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar $SPARK_HOME/jars/ RUN chmod 644 $SPARK_HOME/jars/hadoop-aws-3.3.4.jar

ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.691/aws-java-sdk-bundle-1.12.691.jar $SPARK_HOME/jars/

RUN chmod 644 $SPARK_HOME/jars/aws-java-sdk-bundle-1.12.691.jar

After building the image, we pushed it to Docker Hub for deployment.

docker build -t oblivious/apache-spark:3.4.0-s3 .

docker push oblivious/apache-spark:3.4.0-s3

Kubernetes Manifests for Apache Spark

To streamline the deployment, we prepared Kubernetes manifests for Apache Spark, hosted in a private repository. These manifests specify the setup for Spark master and worker nodes, including configurations for secure outbound connections, the TCP Spark protocol, Spark UI, and authentication.

The enclaves are isolated confidential computing environments. We need to provide an exhaustive list of all authorised outbound connections in our manifests. In a Spark cluster the nodes communicate over each other through various ports. Since, we want our nodes to talk to each other over attested TLS connections, we will set the outbound connections as follows:

# master/ned.yaml

outboundConnections:
- fqdn:
value: spark-worker.default.svc.cluster.local
port: 12002
tls: false
redirects: false
- fqdn:
value: spark-worker.default.svc.cluster.local
port: 10000
tls: false
redirects: false
attestation: true
configMap: "spark-worker-oblv-cli-config"
- fqdn:
value: spark-worker.default.svc.cluster.local
port: 10001
tls: false
redirects: false
attestation: true
configMap: "spark-worker-oblv-cli-config"
- fqdn:
value: spark-worker.default.svc.cluster.local
port: 8081
tls: false
redirects: false
attestation: true
configMap: "spark-worker-oblv-cli-config"
- fqdn:
value: spark-worker.default.svc.cluster.local
port: 7076
tls: false
redirects: false
attestation: true
configMap: "spark-worker-oblv-cli-config"

# worker/ned.yaml

outboundConnections:
- fqdn:
value: spark-master.default.svc.cluster.local
port: 80
tls: false
redirects: false
- fqdn:
value: spark-master.default.svc.cluster.local
port: 8080
tls: false
redirects: false
attestation: true
configMap: "spark-master-oblv-cli-config"
- fqdn:
value: spark-master.default.svc.cluster.local
port: 7077
tls: false
redirects: false
attestation: true
configMap: "spark-master-oblv-cli-config"
- fqdn:
value: spark-master.default.svc.cluster.local
port: 10000
tls: false
redirects: false
attestation: true
configMap: "spark-master-oblv-cli-config"
- fqdn:
value: spark-master.default.svc.cluster.local
port: 10001
tls: false
redirects: false
attestation: true
configMap: "spark-master-oblv-cli-config"

The complete list can be found at -https://archive.apache.org/dist/spark/docs/3.4.0/security.html#configuring-ports-for-network-security.

Apply the Kubernetes manifests for Apache Spark

We have already prepared the manifests hosted in the GitHub repository. The repository contains two manifests, one for a Spark master and another for a Spark worker.

The manifests were applied using the following command:

kubectl apply -f master/ned.yaml worker/ned.yaml

We checked the status of our deployment using:

kubectl get ned

NAME 		READY 	AGE 	STATUS
spark-master 	1/1 	13d 	SCHEDULED
spark-worker 	1/1 	13d 	SCHEDULED

kubectl get pod

NAME 				 READY 	STATUS   RESTARTS 	  AGE
oblv-proxy-spark-master-** 2/2 		Running 0 		1m
spark-master-** 		 3/3 		Running 0		1m
spark-worker-** 		 3/3 		Running 0		1m

Connecting to the Apache Spark Cluster

With the deployments up and running, we connected to the Apache Spark cluster. Here’s how we established a secure connection:

Step 1. Download Configuration File: We retrieved the configuration file needed for CLI access to the Spark cluster.

kubectl get cm spark-master-oblv-cli-config -o jsonpath='{.data.config\.yaml}' &gt; /tmp/spark-master.yaml

Step 2. Connecting to the Enclave: Using the OBLV Deploy Client, we connected to the enclave, which ensures a secure session and limits access only to trusted users.

oblv connect -u (your domain) --config spark-master.yaml --local-port 3030 --auth-passthrough

Step 3. Accessing Spark UI: Finally, we accessed the Spark UI from a browser at http://localhost:3030, enabling secure monitoring and interaction with the deployed Apache Spark cluster.

In this setup, we also configured Apache Spark’s spark-defaults.conf file to handle specific cluster configurations, such as driver and worker settings, logging, and S3 endpoint connections. The file needs to be present in /opt/spark/conf directory.

These configurations ensure that Spark operates efficiently within the secure enclave.

spark.master spark://spark-master.default.svc.cluster.local:7077

spark.driver.bindAddress 0.0.0.0

spark.driver.host spark-client.default.svc.cluster.local

spark.driver.port 10000

spark.blockManager.port 10001

spark.master.ui.port 8080

spark.worker.ui.port 8081

spark.ui.reverseProxy true

spark.ui.reverseProxyUrl http://localhost:3030

spark.log.level ALL

spark.eventLog.enabled true

spark.eventLog.dir /opt/spark/spark-events

spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

spark.hadoop.fs.s3a.path.style.access true

spark.hadoop.fs.s3a.endpoint https://s3.us-east-2.amazonaws.com

spark.hadoop.fs.s3a.endpoint.region us-east-2

spark.driver.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true

Final Thoughts

Deploying Apache Spark on OBLV Deploy provides a powerful solution for processing large-scale data in a secure and compliant manner. By running Spark within a Trusted Execution Environment, we gain enhanced control over data security, especially for data in use, which is often the most vulnerable.

OBLV Deploy’s ability to integrate seamlessly with Kubernetes and support secure interactions with AWS services like S3 makes it an ideal platform for data-intensive applications where privacy is paramount. This approach ensures that even sensitive analytics workloads can be handled securely and efficiently, allowing organisations to leverage big data while maintaining strict privacy standards.

For more information on how OBLV Deploy can support your secure data processing needs, feel free to contact us here.

oblv deploy

apache spark

tutorial

Deploying Apache Spark on OBLV Deploy

A technical tutorial on how to deploy Apache Spark in a Trusted Execution Environment

The Critical Need for Secure Data Processing in Real-Time Analytics

Why Use OBLV Deploy for Secure Data Processing?

Understanding Secure Enclaves

What Are Secure Enclaves?

Comparing OBLV Deploy and AWS Nitro Enclaves

OBLV Deploy:

AWS Nitro Enclaves:

Deploying Apache Spark on OBLV Deploy

Preparing Docker Image

Kubernetes Manifests for Apache Spark

Apply the Kubernetes manifests for Apache Spark

Connecting to the Apache Spark Cluster

Final Thoughts

Products

Community

Resources

Legal

Products

Community

Resources

Legal

Products

Community

Resources

Legal