REDWOOD

 

This article introduces Rclone and describes how to set up and configure Rclone to copy and sync files to Oracle Cloud Infrastructure Object Storage and Hadoop distributed file systems. After migrating your files to Oracle Cloud using Rclone, you can use Oracle Analytics Cloud to build enhanced data insights.

Rclone

Rclone is an open source command line tool for managing files on a range of storage systems. Refer to the Rclone documentation to find out which storage systems Rclone supports. Rclone can help you to back up files, mirror files between storage systems, migrate files between storage systems, and more.

This article covers three use cases:

  • Copying files from a Local File System to Oracle Cloud Infrastructure Object Storage
  • Copying files from a Hadoop Distributed File System to Oracle Cloud Infrastructure Object Storage
  • Copying files from a Hadoop Distributed File System to another Hadoop Distributed File System

Oracle Cloud Infrastructure Object Storage

Oracle Cloud Infrastructure (OCI) Object Storage is a secure, scalable, and cost-effective cloud storage solution which allows you to store an unlimited amount of structured and unstructured data of any content type, including analytic data and rich content, such as images and videos.

OCI Object Storage offers high durability, multiple access tiers and seamless integration with Oracle Cloud services, such as Oracle Analytics Cloud. It also provides robust access control through OCI identity management (IAM) to ensure secure data management. OCI Object Storage is ideal for backup, archival, and big data analytics, and it’s a key component of modern cloud strategies.

Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant storage solution designed for big data processing. It splits large datasets into blocks, distributing them across multiple nodes for parallel processing. HDFS ensures high reliability by replicating data across nodes, making it resilient to hardware failures. As the backbone of Hadoop ecosystems, it supports high-throughput data access, ideal for analytics and large-scale data storage.

High-Level Flow for Data Migration

1_rclone

                                                                     Fig. 1: Rclone Data Transfer Flow

Install Rclone

Install Rclone on a suitable machine. Follow the Rclone documentation for your operating system.

If you use Linux, the following command might be sufficient. For more detail, refer to the Rclone documentation.

	sudo yum install -y rclone

Configure Rclone for OCI Object Storage

  1. Log in to the machine where you installed Rclone.
  2. Run the command rclone config from a terminal.

    2_rclone

                                                                         Fig. 2: Rclone Config

  3. Enter n (New remote) and enter a name for the new remote (OCI Object Storage).

    3_rclone

                                                                         Fig. 3: New Remote

  4. Select the type of storage from the list. Enter the number associated with OCI Object Storage. For example, 38.

    4_rclone

                                                                         Fig. 4: Select OCI Object Storage

  5. Select your authentication provider (Auth Provider) from the list.
    1. Enter 2 if you use an OCI user and API key for authentication.
    2. Enter 3 to use an instance principal. If your instance is launched in OCI, this avoids copying sensitive information to your machine.

    5_rclone

                                                                         Fig. 5: Select Auth Provider

  6. Enter the OCI Object Storage namespace.

    6_rclone

                                                                         Fig. 6: Provide OCI Object Storage Namespace

  7. Enter the OCID of the OCI Object Storage compartment.

    7_rclone

                                                                         Fig. 7: Provide OCID of Compartment

  8. Enter the region where OCI Object Storage is deployed.

    8_rclone

                                                                         Fig. 8: Provide Region

  9. Enter the endpoint for the OCI Object Storage API. Leave blank to use the default endpoint for the region.
  10. If you use an OCI user and API key for authentication, provide the following additional information:
    1. Create an OCI configuration file as described in the OCI Documentation, and copy the file to your machine.
    2. Provide the path to the OCI configuration file. Leave blank to use the default path: ~/.oci/config

      9_rclone

                                                                           Fig. 9: Path of OCI Configuration File

    3. Enter the name of the profile in the configuration file. Leave blank to use the Default profile name.

      10_rclone

                                                                           Fig. 10: Name of Profile

  11. Enter y (Yes) to edit advanced configuration. You only need to edit advanced configuration if you require a specific behavior or performance. For example, edit one or more of the properties shown here:
    1. storage_tier – Select from Standard, InfrequentAccess, and Archive storage tier.

      11_rclone

                                                                           Fig. 11: Storage Tier

    2. upload_cutoff

      12_rclone

                                                                           Fig. 12: Upload Cutoff

    3. chunk_size

      13_rclone

                                                                           Fig. 13: Chunk Size

    4. max_upload_parts

      14_rclone

                                                                           Fig. 14: Max Upload Parts

    5. upload_concurrency

      15_rclone

                                                                           Fig. 15: Upload Concurrency

    6. copy_cutoff

      16_rclone

                                                                           Fig. 16: Copy Cutoff

    7. copy_timeout

      17_rclone

                                                                           Fig. 17: Copy Timeout

    8. disable_checksum

      18_rclone

                                                                           Fig. 18: Disable Checksum

    9. encoding

      19_rclone

                                                                           Fig. 19: Encoding

    10. leave_parts_on_error

      20_rclone

                                                                           Fig. 20: Leave Parts On Error

    11. attempt_resume_upload

      21_rclone

                                                                           Fig. 21: Attempt Resume Upload

    12. no_check_bucket

      22_rclone

                                                                           Fig. 22: No Check Bucket

    13. sse_customer_key_file

      23_rclone

                                                                           Fig. 23: SSE Customer Key File

    14. sse_customer_key

      24_rclone

                                                                           Fig. 24: SSE Customer Key

    15. sse_customer_key_sha256

      25_rclone

                                                                           Fig. 25: SSE Customer Key SHA256

    16. sse_kms_key_id

      26_rclone

                                                                           Fig. 26: SSE KMS Key Id

    17. sse_customer_algorithm

      27_rclone

                                                                           Fig. 27: SSE Customer Algorithm

  12. Enter n (No) to exit the configuration.
  13. Verify that the configuration details displayed are correct and enter y (Yes this is OK) to save the configuration.
  14. Enter q (Quit) to exit the configuration.
  15. Validate your configuration by running the ls command on the target bucket in OCI Object Storage. For example, in the following ls command, os_with_api is the name of the remote and rclone_api is the name of the target bucket in OCI Object Storage.

    28_rclone

                                                                         Fig. 28: Validate OCI Object Storage Remote

Configure Rclone for HDFS

A non-Kerberos Hadoop cluster is used for this article. Hadoop cluster nodes must be reachable from machine, where Rclone is configured.

  1. Log in to the machine where you installed Rclone.
  2. Run the command rclone config from a terminal.

    29_rclone

                                                                         Fig. 29: Rclone Config

  3. Enter n (New remote) and enter a name for the new remote.

    30_rclone

                                                                         Fig. 30: New Remote

  4. Select the type of storage from the list. Enter the number associated with HDFS. For example, 22.

    31_rclone

                                                                         Fig. 31: Provide HDFS Storage

  5. Enter a comma separated list for namenode and port in namenode:port format.

    32_rclone

                                                                         Fig. 32: Provide Namenode and Port

  6. Enter the Hadoop username. Make sure the user has the required permissions on the HDFS folder you want to migrate.

    33_rclone

                                                                         Fig. 33: Provide Hadoop Username

  7. Enter n (No) to complete the configuration.

    Alternatively, enter y (Yes) if you need to override service_principal_name, data_transfer_protection or encoding properties though advanced configuration.

    Note: For a Kerberos cluster, you must update the advanced configuration.

  8. Verify the configuration details and enter y (Yes this is OK), to save the configuration.

    34_rclone

                                                                         Fig. 34: Save Configuration

  9. Validate the configuration by running the ls command on a folder in the HDFS path. For example, in the following sample ls command, hdfs_cluster_1 is the remote name and /data/input is a path on the HDFS cluster.

    35_rclone

                                                                         Fig. 35: Validate HDFS Remote

Copy Files Using Rclone

  1. Copy files from your local file system to OCI Object Storage.
    1. Run the ls command on the source local directory that you want to copy to the destination OCI Object Storage.

      36_rclone

                                                                           Fig. 36: List Source Local Directory

    2. Run the copy command with the dry-run flag and review the output.

      37_rclone

                                                                           Fig. 37: Copy Command

    3. If the output of the dry run looks correct, execute the command without the flag to copy the files.
  2. Copy files from your HDFS file system to OCI Object Storage.
    1. Run the ls command on the source HDFS directory that you want to copy to the destination OCI Object Storage.

      38_rclone

                                                                           Fig. 38: List Source HDFS Directory

    2. Run the copy command with the dry-run flag and review the output.

      39_rclone

                                                                           Fig. 39: Copy Command

    3. If the output of the dry run looks correct, execute the copy command without the flag to copy the files.
  3. Copy files from your HDFS file system to a HDFS file system on another Hadoop Cluster.
    1. Run the ls command on the source HDFS directory that you want to copy to the destination HDFS.

      40_rclone

                                                                           Fig. 40: List Source HDFS Directory

    2. Run the copy command with the dry-run flag and review the output.

      41_rclone

                                                                           Fig. 41: Copy Command

    3. If the output of the dry run looks correct, execute the copy command without the flag to copy the files.

Considerations

  • Use the sync command if you want to sync the source to the destination. This might delete files from the destination.
  • If you’re migrating production data, Oracle recommends that you test first using the dry-run option or run with the interactive flag.
  • The dry-run option doesn’t list all permission related issues. This means you might not see errors during the dry run but still get errors during the actual run.
  • If you’re migrating files for the first time and the destination folder is empty, the copy command might perform better than the sync command, due to the number of files.
  • If you’re copying files to OCI Object Storage from a data center or company network, consider configuring Rclone in the same network to avoid network-related issues.

References

Oracle Solution Playbook: Learn More About Moving Data to Cloud-based Object Storage by Using Rclone

Rclone website: Rclone

Call to Action

Now you’ve learned how to set up and use Rclone, try it yourself. Migrate data from your data center machines or data center Hadoop cluster to OCI Object Storage (or OCI Big Data Service Hadoop cluster) using Rclone.

After migrating your data, connect Oracle Analytics Cloud to the destination OCI Object Storage (or Big Data Service Hadoop cluster), and start to build enhanced insights from the data. For details, see Connect to Dataflow SQL Endpoint from OAC and Connect to Big Data Service Spark and Hive from OAC.

If you have questions, post them in the Oracle Analytics Community and we’ll follow up with answers.

REDWOOD