Apache Spark has become the go-to big data processing engine for various use cases, including reading and processing Optimized Row Columnar (ORC) files from object storage services like Oracle Cloud Infrastructure (OCI) Data Flow and OCI Big Data Service. However, performance issues can arise when reading large ORC files from these cloud storage services. In this blog, we explore how we tackled an ORC file reading performance issue and achieved significant improvements by tuning specific Spark configurations.
When attempting to read ORC files from OCI Object Storage in Spark clusters, you can often encounter slow read times, resulting in long processing durations. This sluggish performance can lead to bottlenecks and adversely impact the overall efficiency of data processing pipelines.
To address the ORC file reading performance issue in Spark on Data Flow and Big Data, we experimented with various configurations and identified the following settings that delivered remarkable improvements:
These configurations use the Hadoop Distributed File System (HDFS) connector version 188.8.131.52.3.2, and they’re compatible with Hadoop 3.3.x, Spark 3.2.1, and Scala 2.12.x. Let’s look at the configurations and understand how they help enhance ORC file reading performance.
fs.oci.io.read.ahead: Setting this configuration to true enables read-ahead functionality when accessing data from OCI Object Storage. Read-ahead allows Spark to fetch more data blocks ahead of time, reducing the latency in fetching subsequent blocks during data reading. This optimization minimizes the number of round trips required to fetch data from Object Storage.
fs.oci.io.read.ahead.blocksize: This configuration determines the size of the data blocks that Spark reads ahead of time. By setting it to "6291456" (6 MB), we found an optimal balance between read performance and memory utilization. Adjusting this value might require some experimentation based on your specific use case and data characteristics.
In addition to the two primary configurations, you can experiment with a few more settings to further tune your Spark cluster for optimal ORC file reading performance, including the following examples:
fs.oci.rename.operation.numthreads: This configuration determines the number of threads used for rename operations when working with OCI Object Storage. Renaming files can be a resource-intensive operation, especially in distributed environments like Spark clusters. By default, this value is set to 1, meaning that rename operations run sequentially. However, by increasing the number of threads to 2, you allow multiple rename operations run concurrently, potentially reducing the time required to rename files in your storage bucket.
fs.oci.io.read.ahead.blockcount: Similar to fs.oci.io.read.ahead.blocksize, which we discussed earlier, this configuration controls the number of data blocks to be read ahead of time from OCI Object Storage. By setting it to 4, Spark fetches four data blocks in advance, preloading data and reducing the time spent waiting for subsequent blocks during data reading. The optimal value for this setting can vary based on your specific use case, so experimentation is encouraged.
After implementing the above configurations, we observed a substantial reduction in ORC file reading times. Previously, it took approximately 20 minutes to read a 1.6-GB ORC file from OCI Object Storage. However, after applying the recommended settings, the read time decreased significantly to just one minute. This enhancement translates into substantial improvements in data processing efficiency and overall pipeline performance.
Efficient data processing is critical for successful big data applications, and Spark has proven to be a powerful tool for tackling such challenges. By finetuning specific configurations, such as fs.oci.io.read.ahead and fs.oci.io.read.ahead.blocksize, you can achieve substantial improvements in ORC file reading performance on OCI Data Flow, OCI Big Data Service, or even standard Spark clusters.
By sharing our experiences and insights through this post, we hope to help others optimize their Spark workloads for reading ORC files from OCI Object Storage and achieve similar performance gains. Remember that the effectiveness of these configurations can vary depending on the dataset and workload characteristics, so itis essential to experiment and tailor the settings to your specific use case.