Home Artificial Intelligence Managing Your Cloud-Based Data Storage with Rclone Data Retrieval from Cloud Storage with Rclone Data Transfer Between Object Storage Systems Summary

Managing Your Cloud-Based Data Storage with Rclone Data Retrieval from Cloud Storage with Rclone Data Transfer Between Object Storage Systems Summary

0
Managing Your Cloud-Based Data Storage with Rclone
Data Retrieval from Cloud Storage with Rclone
Data Transfer Between Object Storage Systems
Summary

The best way to optimize data transfer across multiple object storage systems

Towards Data Science
Photo by Tom Podmore on Unsplash

As firms develop into an increasing number of depending on cloud-based storage solutions, it’s imperative that they’ve the suitable tools and techniques for effective management of their big data. In previous posts (e.g., here and here) we now have explored several different methods for retrieving data from cloud storage and demonstrated their effectiveness at several types of tasks. We found that probably the most optimal tool can vary based on the particular task at hand (e.g., file format, size of information files, data access pattern) and the metrics that we want to optimize (e.g., latency, speed, or cost). On this post, we explore yet one more popular tool for cloud-based storage management — sometimes known as “the Swiss army knife of cloud storage” — the rclone command-line utility. Supporting greater than 70 storage service providers, rclone supports similar functionality to vendor-specific storage management applications reminiscent of AWS CLI (for Amazon S3) and gsutil (for Google Storage). But does it perform well enough to constitute a viable alternative? Are there situations wherein rclone could be the tool of selection? In the next sections we’ll display rclone’s usage, assess its performance, and highlight its value in a specific use-case — transferring data across different object storage systems.

Disclaimers

This post will not be, by any means, intended to exchange the official rclone documentation. Neither is it intended to be an endorsement of using rclone or any of the opposite tools we must always mention. The most effective selection in your cloud-based data management will greatly rely upon the small print of your project and needs to be made following thorough, use-case specific testing. Please make sure to re-evaluate the statements we make against probably the most up so far tools available on the time you might be reading this.

The next command line uses rclone sync in an effort to sync the contents of a cloud-based object-storage path with a neighborhood directory. This instance demonstrates using the Amazon S3 storage service but could just as easily have used a special cloud storage service.

rclone sync -P 
--transfers 4
--multi-thread-streams 4
S3store:my-bucket/my_files ./my_files

The rclone command has dozens of flags for programming its behavior. The -P flag outputs the progress of the info transfer including the transfer rate and overall time. Within the command above we included two (of the various) controls that may impact rclone’s runtime performance: The transfers flag determines the utmost variety of files to download concurrently and multi-thread-streams determines the utmost variety of threads to make use of to transfer a single file. Here we now have left each at their default values (4).

Rclone’s functionality relies on the suitable definition of the rclone configuration file. Below we display the definition of the distant S3store object storage location utilized in the command line above.

[S3store]
type = s3
provider = AWS
access_key_id =
secret_access_key =
region = us-east-1

Now that we now have seen rclone in motion, the query that arises is whether or not it provides any value over the opposite cloud storage management tools which might be on the market reminiscent of the favored AWS CLI. In the following two sections we’ll evaluate the performance of rclone in comparison with a few of its alternatives in two scenarios that we now have explored intimately in our previous posts: 1) downloading a 2 GB file and a couple of) downloading lots of of 1 MB files.

Use Case 1: Downloading a Large File

The command line below uses the AWS CLI to download a 2 GB file from Amazon S3. That is just considered one of the various of methods we evaluated in a previous post. We use the linux time command to measure the performance.

time aws s3 cp s3://my-bucket/2GB.bin .

The reported download time amounted to roughly 26 seconds (i.e., ~79 MB/s). Have in mind that this value was calculated on our own local PC and may vary greatly from one runtime environment to a different. The equivalent rclone copy command appears below:

rclone sync -P S3store:my-bucket/2GB.bin .

In our setup, we found the rclone download time to be greater than two times slower than the usual AWS CLI. It is very likely that this might be improved significantly through appropriate tuning of the rclone control flags.

Use Case 2: Downloading a Large Variety of Small Files

On this use case we evaluate the runtime performance of downloading 800 relatively small files of size 1 MB each. In a previous blog post we discussed this use case within the context of streaming data samples to a deep-learning training workload and demonstrated the superior performance of s5cmd beast mode. In beast mode we create a file with an inventory of object-file operations which s5cmd performs in using multiple parallel employees (256 by default). The s5cmd beast mode option is demonstrated below:

time s5cmd --run cmds.txt

The cmds.txt file comprises an inventory of 800 lines of the shape:

cp s3://my-bucket/small_files/.jpg /.jpg

The s5cmd command took a median time of 9.3 seconds (averaged over ten trials).

Rclone supports a functionality much like s5cmd’s beast mode with the files-from command line option. Below we run rclone copy on our 800 files with the transfers value set to 256 to match the default concurrency settings of s5cmd.

rclone -P --transfers 256 --files-from files.txt S3store:my-bucket /my-local

The files.txt file comprises 800 lines of the shape:

small_files/.jpg

The rclone copy of our 800 files took a median of 8.5 seconds, barely lower than s5cmd (averaged over ten trials).

We acknowledge that the outcomes demonstrated to date might not be enough to persuade you to prefer rclone over your existing tools. In the following section we’ll describe a use case that highlights considered one of the potential benefits of rclone.

Lately it will not be unusual for development teams to keep up their data in a couple of object store. The motivation behind this might be the necessity to protect against the opportunity of a storage failure or the choice to make use of data-processing offerings from multiple cloud service providers. For instance, your solution for AI development might depend on training your models within the AWS using data in Amazon S3 and running data analytics in Microsoft Azure using the identical data stored in Azure Storage. Moreover, it’s possible you’ll want to keep up a duplicate of your data in a neighborhood storage infrastructure reminiscent of FlashBlade, Cloudian, or VAST. These circumstances require the power to transfer and synchronize your data between multiple object stores in a secure, reliable, and timely fashion.

Some cloud service providers offer dedicated services for such purposes. Nevertheless, these don’t at all times address the precise needs of your project or may not enable you the extent of control you desire. For instance, Google Storage Transfer excels at speedy migration of all of the info inside a specified storage folder, but doesn’t (as of the time of this writing) support transferring a particular subset of files from inside it.

An alternative choice we could consider could be to use our existing data management towards this purpose. The issue with that is that tools reminiscent of AWS CLI and s5cmd don’t (as of the time of this writing) support specifying different access settings and security-credentials for the source and goal storage systems. Thus, migrating data between storage locations requires transferring it to an intermediate (temporary) location. Within the command below we mix using s5cmd and AWS CLI to repeat a file from Amazon S3 to Google Storage via system memory and using Linux piping:

s5cmd cat s3://my-bucket/file 
| aws s3 cp --endpoint-url https://storage.googleapis.com
--profile gcp - s3://gs-bucket/file

While it is a legitimate, albeit clumsy way of transferring a single file, in practice, we might have the power to transfer many tens of millions of files. To support this, we would want so as to add a further layer for spawning and managing multiple parallel employees/processors. Things could get ugly pretty quickly.

Data Transfer with Rclone

Contrary to tools like AWS CLI and s5cmd, rclone enables us to specify different access settings for the source and goal. In the next rclone config file we add settings for Google Cloud Storage access:

[S3store]
type = s3
provider = AWS
access_key_id =
secret_access_key =

[GSstore]
type = google cloud storage
provider = GCS
access_key_id =
secret_access_key =
endpoint = https://storage.googleapis.com

Transferring a single file between storage systems has the identical format as copying it to a neighborhood directory:

rclone copy -P S3store:my-bucket/file GSstore:gs-bucket/file

Nevertheless, the true power of rclone comes from combining this feature with the files-from option described above. Fairly than having to orchestrate a custom solution for parallelizing the info migration, we will transfer a protracted list of files using a single command:

rclone copy -P --transfers 256 --files-from files.txt 
S3store:my-bucket/file GSstore:gs-bucket/file

In practice, we will further speed up the info migration by parsing the list of object files into smaller lists (e.g., with 10,000 files each) and running each list on a separate compute resource. While the precise impact of this type of solution will vary from project to project, it may possibly provide a big boost to the speed and efficiency of your development.

On this post we now have explored cloud-based storage management using rclone and demonstrated its application to the challenge of maintaining and synchronizing data across multiple storage systems. There are undoubtedly many different solutions for data transfer. But there is no such thing as a questioning the convenience and elegance of the rclone-based method.

That is just considered one of many posts that we now have written on the subject of maximizing the efficiency of cloud-based storage solutions. Make sure you take a look at a few of our other posts on this necessary topic.

LEAVE A REPLY

Please enter your comment!
Please enter your name here