Deploying Apache Spark on OBLV Deploy
A technical tutorial on how to deploy Apache Spark in a Trusted Execution Environment
7 mins read
Jan 30, 2025

As the volume of data generated by businesses and individuals continues to grow exponentially, so do the challenges in efficiently processing and securing that data. Big data tools like Apache Spark have revolutionised data processing by enabling large-scale distributed computation across clusters, making it possible to process vast datasets at unprecedented speeds.
However, handling such large volumes of sensitive data—whether it be financial transactions, personal information, or proprietary business data—comes with heightened security risks. Traditional data security methods are often insufficient to safeguard against modern threats, especially when data is being processed in real-time.
In this article, we’ll demonstrate how deploying Apache Spark on OBLV Deploy can address these security challenges. We’ll explore how OBLV Deploy provides Confidential Computing using secure enclaves to protect data in real-time processing environments, outline the steps we followed for secure deployment, and share best practices for leveraging this approach in your own big data projects.
The Critical Need for Secure Data Processing in Real-Time Analytics
In industries like healthcare, finance, and government, where Apache Spark is widely used for data analytics, the need for secure data processing is critical. Data must be protected across its entire lifecycle, which includes three fundamental states:
Data at Rest:
This refers to data stored on physical drives or databases, typically protected through encryption mechanisms.Data in Transit
: While being transmitted across networks, data is safeguarded using protocols like Transport Layer Security (TLS), ensuring it cannot be intercepted or altered during transmission.Data in Use
: This is the most vulnerable state, as data is actively processed in memory and typically not encrypted. This is where secure enclaves come into play, isolating and encrypting the computation environment to prevent unauthorised access even if the infrastructure is compromised.
Without securing data in all three states, organisations risk exposure to breaches, leaks, and unauthorised access. Robust deployment solutions are essential to mitigate these risks, especially during real-time data processing tasks conducted by platforms like Apache Spark.
Why Use OBLV Deploy for Secure Data Processing?
OBLV Deploy is a robust platform that enhances security by leveraging secure enclaves, providing a secure, trusted execution environment for applications like Apache Spark. It allows users to process sensitive data in memory without exposing it to unauthorised actors. The platform’s seamless integration with existing infrastructure makes it a compelling solution for organisations aiming to safeguard their data while utilising the power of big data tools like Apache Spark.
In this article, we’ll demonstrate how Apache Spark can be deployed on OBLV Deploy, showcasing its ability to enhance data security throughout the computation process. We’ll cover the general setup, integration with an S3 filesystem for secure storage, and best practices for optimising your deployment. This demonstration emphasises the advantages of secure data processing without compromising performance or exposing sensitive information.

Understanding Secure Enclaves
What Are Secure Enclaves?
Secure enclaves are hardware-based isolated environments designed to protect code and data from outside access, including from the host operating system and even cloud administrators. Data processed within these enclaves is encrypted, and access to it is tightly restricted to the enclave itself. This isolation ensures that sensitive computations, such as encryption keys, personal data, or proprietary algorithms, remain secure, even in environments where the infrastructure may be compromised. In fields like finance and healthcare, where data confidentiality is paramount, secure enclaves are essential for protecting data while it is actively processed. By leveraging enclaves, organisations can confidently process sensitive data without exposing it to external threats.
Comparing OBLV Deploy and AWS Nitro Enclaves
OBLV Deploy:
OBLV Deploy is a versatile platform designed to facilitate secure computation using secure enclaves. It offers compatibility with various cloud providers and on-premise environments, allowing users to run applications securely without making significant changes to their codebase or toolchain. The platform abstracts the complexity of managing enclave technologies, making it easier for developers and IT operations teams to integrate secure enclaves into their workflow. OBLV Deploy provides strong in-memory data protection, preventing unauthorised access even by the underlying infrastructure. OBLV Deploy also enforces networking restrictions, allowing only authorised connections and providing an attested guarantee that neither the code nor the connectivity has been tampered with.
AWS Nitro Enclaves:
AWS Nitro Enclaves is a solution from Amazon Web Services (AWS) that enables users to create isolated environments within EC2 instances to process highly sensitive data. These enclaves have no persistent storage, no external networking, and are designed to reduce the attack surface for security-sensitive applications. Nitro Enclaves utilise the AWS Nitro Security Module (NSM)
, a specialised hardware component designed to perform cryptographic operations and attestation, ensuring that only trusted code is running within the enclave. However, Nitro Enclaves are limited to AWS infrastructure, meaning users need to operate within the AWS ecosystem.
Deploying Apache Spark on OBLV Deploy
Our demonstration showcases the process we followed when deploying Apache Spark on OBLV Deploy and connecting the Spark cluster to an S3 filesystem, where input files are securely stored using an AWS Key Management Service (KMS) key. The KMS key policy is configured so that the data can only be decrypted within the secure enclave, ensuring the highest level of protection.
To prepare for deploying Apache Spark on OBLV Deploy, we configured the following:
AWS EKS
: A Kubernetes cluster, preferably AWS EKS for manageability, is required to deploy the application. We followed AWS’s setup documentation to create the EKS environment.OBLV Deploy Stack
: Once the EKS cluster was up and running, we installed OBLV Deploy following the official documentation, ensuring compatibility between the EKS cluster and OBLV Deploy.
With both components set up, we proceeded to configure the Docker image and Kubernetes manifests for Apache Spark.
Preparing Docker Image
To enable Apache Spark to interact with the S3 filesystem securely, we created a Docker image containing Apache Spark and additional dependencies. These dependencies allow Spark to read from and write to S3, ensuring compatibility with the AWS infrastructure.
Below is the Dockerfile we used for this image, which includes configurations for Spark and AWS S3 access. We are using Apache Spark v3.4.0.
After building the image, we pushed it to Docker Hub for deployment.
Kubernetes Manifests for Apache Spark
To streamline the deployment, we prepared Kubernetes manifests for Apache Spark, hosted in a private repository. These manifests specify the setup for Spark master and worker nodes, including configurations for secure outbound connections, the TCP Spark protocol, Spark UI, and authentication.
The enclaves are isolated confidential computing environments. We need to provide an exhaustive list of all authorised outbound connections in our manifests. In a Spark cluster the nodes communicate over each other through various ports. Since, we want our nodes to talk to each other over attested TLS connections, we will set the outbound connections as follows:
The complete list can be found at -https://archive.apache.org/dist/spark/docs/3.4.0/security.html#configuring-ports-for-network-security.
Apply the Kubernetes manifests for Apache Spark
We have already prepared the manifests hosted in the GitHub repository. The repository contains two manifests, one for a Spark master and another for a Spark worker.
The manifests were applied using the following command:
We checked the status of our deployment using:
Connecting to the Apache Spark Cluster
With the deployments up and running, we connected to the Apache Spark cluster. Here’s how we established a secure connection:
Step 1. Download Configuration File
: We retrieved the configuration file needed for CLI access to the Spark cluster.
Step 2. Connecting to the Enclave
: Using the OBLV Deploy Client, we connected to the enclave, which ensures a secure session and limits access only to trusted users.
Step 3. Accessing Spark UI:
Finally, we accessed the Spark UI from a browser at http://localhost:3030, enabling secure monitoring and interaction with the deployed Apache Spark cluster.

In this setup, we also configured Apache Spark’s spark-defaults.conf file to handle specific cluster configurations, such as driver and worker settings, logging, and S3 endpoint connections. The file needs to be present in /opt/spark/conf directory.
These configurations ensure that Spark operates efficiently within the secure enclave.
spark.master spark://spark-master.default.svc.cluster.local:7077
spark.driver.bindAddress 0.0.0.0
spark.driver.host spark-client.default.svc.cluster.local
spark.driver.port 10000
spark.blockManager.port 10001
spark.master.ui.port 8080
spark.worker.ui.port 8081
spark.ui.reverseProxy true
spark.ui.reverseProxyUrl http://localhost:3030
spark.log.level ALL
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint https://s3.us-east-2.amazonaws.com
spark.hadoop.fs.s3a.endpoint.region us-east-2
spark.driver.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4=true
Final Thoughts
Deploying Apache Spark on OBLV Deploy provides a powerful solution for processing large-scale data in a secure and compliant manner. By running Spark within a Trusted Execution Environment, we gain enhanced control over data security, especially for data in use, which is often the most vulnerable.
OBLV Deploy’s ability to integrate seamlessly with Kubernetes and support secure interactions with AWS services like S3 makes it an ideal platform for data-intensive applications where privacy is paramount. This approach ensures that even sensitive analytics workloads can be handled securely and efficiently, allowing organisations to leverage big data while maintaining strict privacy standards.
For more information on how OBLV Deploy can support your secure data processing needs, feel free to contact us here.
oblv deploy
apache spark
tutorial