Home Artificial Intelligence Managing the Cloud Storage Costs of Big-Data Applications A Easy Thought Experiment Batch Data into Large Files Use Tools that Enable Control Over Multi-part Data Transfer Conclusion Summary

Managing the Cloud Storage Costs of Big-Data Applications A Easy Thought Experiment Batch Data into Large Files Use Tools that Enable Control Over Multi-part Data Transfer Conclusion Summary

Managing the Cloud Storage Costs of Big-Data Applications
A Easy Thought Experiment
Batch Data into Large Files
Use Tools that Enable Control Over Multi-part Data Transfer

Suggestions for Reducing the Expense of Using Cloud-Based Storage

Towards Data Science
Photo by JOSHUA COLEMAN on Unsplash

With the growing reliance on ever-increasing amounts of information, modern-day firms are more dependent than ever on high-capacity and highly scalable data-storage solutions. For a lot of firms this solution is available in the shape of cloud-based storage service, comparable to Amazon S3, Google Cloud Storage, and Azure Blob Storage, each of which include a wealthy set of APIs and features (e.g., multi-tier storage) supporting a wide selection of information storage designs. In fact, cloud storage services even have an associated cost. This cost is normally comprised of various components including the general size of the space for storing you utilize, in addition to activities comparable to transferring data into, out of, or inside cloud storage. The worth of Amazon S3, for instance, includes (as of the time of this writing) six cost components, each of which should be considered. It’s easy to see how managing the price of cloud storage can get complicated, and designated calculators (e.g., here) have been developed to help with this.

In a recent post, we expanded on the importance of designing your data and your data usage in order to cut back the prices related to data storage. Our focus there was on using data compression as a technique to reduce the general size of your data. On this post we deal with a sometimes ignored cost-component of cloud storage — the price of API requests made against your cloud storage buckets and data objects. We are going to show, by example, why this component is usually underestimated and the way it may possibly change into a good portion of the price of your big data application, if not managed properly. We are going to then discuss a pair of easy ways to maintain this cost under control.


Although our demonstrations will use Amazon S3, the contents of this post are only as applicable to another cloud storage service. Please don’t interpret our selection of Amazon S3 or another tool, service, or library we must always mention, as an endorsement for his or her use. The very best option for you’ll depend upon the unique details of your individual project. Moreover, please have in mind that any design selection regarding the way you store and use your data can have its pros and cons that must be weighed heavily based on the small print of your individual project.

This post will include various experiments that were run on an Amazon EC2 c5.4xlarge instance (with 16 vCPUs and “as much as 10 Gbps” of network bandwidth). We are going to share their outputs as examples of the comparative results you may see. Consider that the outputs may vary greatly based on the environment through which the experiments are run. Please don’t depend on the outcomes presented here for your individual design decisions. We strongly encourage you to run these in addition to additional experiments before deciding what’s best for your individual projects.

Suppose you could have a knowledge transformation application that acts on 1 MB data samples from S3 and produces 1 MB data outputs which might be uploaded to S3. Suppose that you just are tasked with transforming 1 billion data samples by running your application on an appropriate Amazon EC2 instance (in the identical region as your S3 bucket in an effort to avoid data transfer costs). Now let’s assume that Amazon S3 charges $0.0004 for each 1000 GET operations and $0.005 for each 1000 PUT operations (as on the time of this writing). At first glance, these costs may appear so low that they might be negligible in comparison with the opposite costs related to the information transformation. Nonetheless, a straightforward calculation shows that our Amazon S3 API calls alone will tally a bill of $5,400!! This may easily be probably the most dominant cost factor of your project, even greater than the price of the compute instance. We are going to return to this thought experiment at the tip of the post.

The apparent technique to reduce the prices of the API calls is to group samples together into files of a bigger size and run the transformation on batches of samples. Denoting our batch size by N, this strategy could potentially reduce our cost by an element of N (assuming that multi-part file transfer isn’t used — see below). This system would lower your expenses not only on the PUT and GET calls but on all of the price components of Amazon S3 which might be depending on the variety of object files fairly than the general size of the information (e.g., lifecycle transition requests).

There are various disadvantages to grouping samples together. For instance, once you store samples individually, you’ll be able to freely access any considered one of them at will. This becomes more difficult when samples are grouped together. (See this post for more on the professionals and cons of batching samples into large files.) In the event you do go for grouping samples together, the massive query is the way to select the scale N. A bigger N could reduce storage costs but might introduce latency, increase the compute time, and, by extension, increase the compute costs. Finding the optimal number may require some experimentation that takes into consideration these and extra considerations.

But let’s not kid ourselves. Making this sort of change is not going to be easy. Your data can have many consumers (each human and artificial) each with their very own particular set of demands and constraints. Storing your samples in separate files could make it easier to maintain everyone completely happy. Finding a batching strategy that satisfies everyone will probably be difficult.

Possible Compromise: Batched Puts, Individual Gets

A compromise you may consider is to upload large files with grouped samples while enabling access to individual samples. One technique to do that is to keep up an index file with the locations of every sample (the file through which it’s grouped, the start-offset, and the end-offset) and expose a skinny API layer to every consumer that might enable them to freely download individual samples. The API could be implemented using the index file and an S3 API that allows extracting specific ranges from object files (e.g., Boto3’s get_object function). While this sort of solution wouldn’t save any money on GET calls (since we’re still pulling the identical variety of individual samples), the dearer PUT calls could be reduced since we could be uploading a lower variety of larger files. Note that this sort of solution poses some limitations on the library we use to interact with S3 because it depends upon an API that permits for extracting partial chunks of the massive file objects. In previous posts (e.g., here) we’ve discussed the several ways of interfacing with S3, a lot of which do not support this feature.

The code block below demonstrates the way to implement a straightforward PyTorch dataset (with PyTorch version 1.13) that uses the Boto3 get_object API to extract individual 1 MB samples from large files of grouped samples. We compare the speed of iterating the information in this fashion to iterating the samples which might be stored in individual files.

import os, boto3, time, numpy as np
import torch
from torch.utils.data import Dataset
from statistics import mean, variance

KB = 1024
MB = KB * KB
GB = KB ** 3

sample_size = MB
num_samples = 100000

# modify to differ the scale of the files
samples_per_file = 2000 # for 2GB files
num_files = num_samples//samples_per_file
bucket = ''
single_sample_path = ''
large_file_path = ''

class SingleSampleDataset(Dataset):
def __init__(self):
self.bucket = bucket
self.path = single_sample_path
self.client = boto3.client("s3")

def __len__(self):
return num_samples

def get_bytes(self, key):
response = self.client.get_object(
return response['Body'].read()

def __getitem__(self, index: int):
key = f'{self.path}/{index}.image'
image = np.frombuffer(self.get_bytes(key),np.uint8)
return {"image": image}

class LargeFileDataset(Dataset):
def __init__(self):
self.bucket = bucket
self.path = large_file_path
self.client = boto3.client("s3")

def __len__(self):
return num_samples

def get_bytes(self, file_index, sample_index):
response = self.client.get_object(
return response['Body'].read()

def __getitem__(self, index: int):
file_index = index // num_files
sample_index = index % samples_per_file
image = np.frombuffer(self.get_bytes(file_index, sample_index),
return {"image": image}

# toggle between single sample files and enormous files
use_grouped_samples = True

if use_grouped_samples:
dataset = LargeFileDataset()
dataset = SingleSampleDataset()

# set the variety of parallel employees in keeping with the variety of vCPUs
dl = torch.utils.data.DataLoader(dataset, shuffle=True,
batch_size=4, num_workers=16)

stats_lst = []
t0 = time.perf_counter()
for batch_idx, batch in enumerate(dl, start=1):
if batch_idx % 100 == 0:
t = time.perf_counter() - t0
t0 = time.perf_counter()

mean_calc = mean(stats_lst)
var_calc = variance(stats_lst)
print(f'mean {mean_calc} variance {var_calc}')

The table below summarizes the speed of information traversal for various decisions of the sample grouping size, N.

Impact of Different Grouping Strategies on Data Traversal Time (by Creator)

Note, that although these results strongly imply that grouping samples into large files has a comparatively small impact on the performance of extracting them individually, we’ve found that the comparative results vary based on the sample size, file size, the values of the file offsets, the variety of concurrent reads from the identical file, etc. Although we should not aware about the inner workings of the Amazon S3 service, it isn’t surprising that considerations comparable to memory size, memory alignment, and throttling would impact performance. Finding the optimal configuration to your data will likely require a little bit of experimentation.

One significant factor that might interfere with the money-saving grouping strategy we’ve described here is the usage of multi-part downloading and uploading, which we are going to discuss in the following section.

Many cloud storage service providers support the choice of multi-part uploading and downloading of object files. In multi-part data transfer, files which might be larger than a certain threshold are divided into multiple parts which might be transferred concurrently. This can be a critical feature if you should speed up the information transfer of huge files. AWS recommends using multi-part upload for files larger than 100 MB. In the next easy example, we compare the download time of a 2 GB file with the multi-part threshold and chunk-size set to different values:

import boto3, time
KB = 1024
MB = KB * KB
GB = KB ** 3

s3 = boto3.client('s3')
bucket = ''
key = ''
local_path = '/tmp/2GB.bin'
num_trials = 10

for size in [8*MB, 100*MB, 500*MB, 2*GB]:
print(f'multi-part size: {size}')
stats = []
for i in range(num_trials):
config = boto3.s3.transfer.TransferConfig(multipart_threshold=size,
t0 = time.time()
s3.download_file(bucket, key, local_path, Config=config)
print(f'multi-part size {size} mean {mean(stats)}')

The outcomes of this experiment are summarized within the table below:

Impact of Multi-part chunk size on Download Time (by Creator)

Note that the relative comparison will greatly depend upon the test environment and specifically on the speed and bandwidth of communication between the instance and the S3 bucket. Our experiment was run on an instance that was in the identical region because the bucket. Nonetheless, as the gap increases, so will the impact of using multi-part downloading.

With reference to the subject of our discussion, it’s important to notice the price implications of multi-part data transfer. Specifically, once you use multi-part data transfer, you’re charged for the API operation of every considered one of the file parts. Consequently, using multi-part uploading/downloading will limit the price savings potential of batching data samples into large files.

Many APIs use multi-part downloading by default. That is great in case your primary interest is reducing the latency of your interaction with S3. But in case your concern is limiting cost, this default behavior doesn’t work in your favor. Boto3, for instance, is a preferred Python API for uploading and downloading files from S3. If not specified, the boto3 S3 APIs comparable to upload_file and download_file will use a default TransferConfig, which applies multi-part uploading/downloading with a chunk-size of 8 MB to any file larger than 8 MB. In the event you are accountable for controlling the cloud costs in your organization, you is likely to be unhappily surprised to learn that these APIs are being widely used with their default settings. In lots of cases, you may find these settings to be unjustified and that increasing the multi-part threshold and chunk-size values, or disabling multi-part data transfer altogether, can have little impact on the performance of your application.

Example — Impact of Multi-part File Transfer Size on Speed and Cost

Within the code block below we create a straightforward multi-process transform function and measure the impact of the multi-part chunk size on its performance and value:

import os, boto3, time, math
from multiprocessing import Pool
from statistics import mean, variance

KB = 1024
MB = KB * KB

sample_size = MB
num_files = 64
samples_per_file = 500
file_size = sample_size*samples_per_file
num_processes = 16
bucket = ''
large_file_path = ''
local_path = '/tmp'
num_trials = 5
cost_per_get = 4e-7
cost_per_put = 5e-6

for multipart_chunksize in [1*MB, 8*MB, 100*MB, 200*MB, 500*MB]:
def empty_transform(file_index):
s3 = boto3.client('s3')
config = boto3.s3.transfer.TransferConfig(

stats = []
for i in range(num_trials):
with Pool(processes=num_processes) as pool:
t0 = time.perf_counter()
pool.map(empty_transform, range(num_files))
transform_time = time.perf_counter() - t0

num_operations = num_files*math.ceil(file_size/multipart_chunksize)
transform_cost = num_operations * (cost_per_get + cost_per_put)
print(f'chunk size {multipart_chunksize}')
print(f'transform time {mean(stats)} variance {variance(stats)}
print(f'cost of API calls {transform_cost}')

In this instance we’ve fixed the file size to 500 MB and applied the identical multi-part settings to each the download and upload. A more complete evaluation would vary the scale of the information files and the multi-part settings.

Within the table below we summarize the outcomes of the experiment.

Impact of Multi-part Chunk Size on Data Transformation Speed and Cost (by Creator)

The outcomes indicate that as much as a multi-part chunk size of 500 MB (the scale of our files), the impact on the time of the information transformation is minimal. Alternatively, the potential savings to the cloud storage API costs is important, as much as 98.4% when put next with using Boto3’s default chunk size (8MB). Not only does this instance show the price advantage of grouping samples together, nevertheless it also implies an extra opportunity for savings through appropriate configuration of the multi-part data transfer settings.

Let’s apply the outcomes of our last example to the thought experiment we introduced at the highest of this post. We showed that applying a straightforward transformation to 1 billion data samples would cost $5,400 if the samples were stored in individual files. If we were to group the samples into 2 million files, each with 500 samples, and apply the transformation without multi-part data transfer (as within the last trial of the instance above), the price of the API calls could be reduced to $10.8!! At the identical time, assuming the identical test environment, the impact we’d expect (based on our experiments) on the general runtime could be relatively low. I’d call that a reasonably good deal. Wouldn’t you?

When developing cloud-based big-data applications it’s critical that we be fully aware of all of the small print of the prices of our activities. On this post we focused on the “Requests & data retrievals” component of the Amazon S3 pricing strategy. We demonstrated how this component can change into a serious a part of the general cost of a big-data application. We discussed two of the aspects that may affect this cost: the style through which data samples are grouped together and the way in which through which multi-part data transfer is used.

Naturally, optimizing only one cost component is prone to increase other components in a way that may raise the general cost. An appropriate design to your data storage might want to keep in mind all potential cost aspects and can greatly depend upon your specific data needs and usage patterns.

As usual, please be at liberty to succeed in out with comments and corrections.


Please enter your comment!
Please enter your name here