A comparative evaluation for AWS S3 development
On this tutorial, we are going to delve into the world of AWS S3 development with Python by exploring and comparing two powerful libraries: boto3
and awswrangler
.
If you happen to’ve ever wondered
“What’s the very best Python tool to interact with AWS S3 Buckets? “
“Methods to perform S3 operations in essentially the most efficient way?”
then you definately’ve come to the appropriate place.
Indeed, throughout this post, we are going to cover a variety of common operations essential for working with AWS S3 buckets amongst which:
- listing objects,
- checking object existence,
- downloading objects,
- uploading objects,
- deleting objects,
- writing objects,
- reading objects (standard way or with SQL)
By comparing the 2 libraries, we are going to discover their similarities, differences, and optimal use cases for every operations. By the top, you’ll have a transparent understanding of which library is healthier fitted to specific S3 tasks.
Moreover, for many who read to the very bottom, we may even explore the right way to leverage boto3
and awswrangler
to read data from S3 using friendly SQL queries.
So let’s dive in and discover the very best tools for interacting with AWS S3 and learn the right way to perform these operations efficiently with Python using each libraries.
The package versions utilized in this tutorial are:
boto3==1.26.80
awswrangler==2.19.0
Also, three initial files including randomly generated account_balances
data have been uploaded to an S3 bucket named coding-tutorials
:
Despite you have to be aware that various ways exists to ascertain a connection to a S3 bucket, on this case, we’re going to use the setup_default_session()
from boto3
:
# CONNECTING TO S3 BUCKET
import os
import io
import boto3
import awswrangler as wr
import pandas as pdboto3.setup_default_session(aws_access_key_id = 'your_access_key',
aws_secret_access_key = 'your_secret_access_key')
bucket = 'coding-tutorials'
This method is handy as, once the session has been set, it may possibly be shared by each boto3
and awswrangler
, meaning that we won’t must pass any more secrets down the road
Now let’s compare boto3
and awswrangler
while performing various common operations and find what’s the very best tool for the job.
The total notebook including the code that follows may be present in this GitHub folder.
# 1 Listing Objects
Listing objects, might be the primary operation we should always perform while exploring a brand new S3 bucket and is an easy method to check whether a session has been accurately set.
With boto3
objects may be listed using:
boto3.client('s3').list_objects()
boto3.resource('s3').Bucket().objects.all()
print('--BOTO3--')
# BOTO3 - Preferred Method
client = boto3.client('s3')for obj in client.list_objects(Bucket=bucket)['Contents']:
print('File Name:', obj['Key'], 'Size:', round(obj['Size']/ (1024*1024), 2), 'MB')
print('----')
# BOTO3 - Alternative Method
resource = boto3.resource('s3')
for obj in resource.Bucket(bucket).objects.all():
print('File Name:', obj.key, 'Size:', round(obj.size/ (1024*1024), 2), 'MB')
Despite each client
and resource
classes do an honest job, the client
class must be preferred, because it is more elegant and provides a lot of [easily accessible] low-level metadata as a nested JSON
( amongst which the item size
).
Then again, awswrangler
only provides a single method to list objects:
Being a high-level method, this doesn’t return any low-level metadata concerning the object, such that to seek out the file size
we want to call:
print('--AWS_WRANGLER--')
# AWS WRANGLERfor obj in wr.s3.list_objects("s3://coding-tutorials/"):
print('File Name:', obj.replace('s3://coding-tutorials/', ''))
print('----')
for obj, size in wr.s3.size_objects("s3://coding-tutorials/").items():
print('File Name:', obj.replace('s3://coding-tutorials/', '') , 'Size:', round(size/ (1024*1024), 2), 'MB')
The code above returns:
Comparison → Boto3 Wins
Despite awswrangler
is more straightforward to make use of, boto3
wins the challenge, while listing S3 objects. In reality, its low-level implementation, implies that many more objects metadata may be retrieved using one in all its classes. Such information is incredibly useful while accessing S3 bucket in a programmatic way.
# 2 Checking Object Existence
The power to envision objects existence is required after we wish for extra operations to be triggered because of this of an object being already available in S3 or not.
With boto3
such checks may be performed using:
boto3.client('s3').head_object()
object_key = 'account_balances_jan2023.parquet'# BOTO3
print('--BOTO3--')
client = boto3.client('s3')
try:
client.head_object(Bucket=bucket, Key = object_key)
print(f"The article exists within the bucket {bucket}.")
except client.exceptions.NoSuchKey:
print(f"The article doesn't exist within the bucket {bucket}.")
As a substitute awswrangler
provides the dedicated method:
wr.s3.does_object_exist()
# AWS WRANGLER
print('--AWS_WRANGLER--')
try:
wr.s3.does_object_exist(f's3://{bucket}/{object_key}')
print(f"The article exists within the bucket {bucket}.")
except:
print(f"The article doesn't exist within the bucket {bucket}.")
The code above returns:
Comparison → AWSWrangler Wins
Let’s admit it: boto3
method name [head_object()
] is just not that intuitive.
Also having a dedicated method is undoubtedly and advantage of awswrangler
that wins this match.
# 3 Downloading Objects
Downloading objects in local is incredibly easy with each boto3
and awswrangler
using the next methods:
boto3.client('s3').download_file()
orwr.s3.download()
The one difference is that download_file()
takes bucket
, object_key
and local_file
as input variables, whereas download()
only requires the S3 path
and local_file
:
object_key = 'account_balances_jan2023.parquet'# BOTO3
client = boto3.client('s3')
client.download_file(bucket, object_key, 'tmp/account_balances_jan2023_v2.parquet')
# AWS WRANGLER
wr.s3.download(path=f's3://{bucket}/{object_key}', local_file='tmp/account_balances_jan2023_v3.parquet')
When the code is executed, each versions of the identical object are indeed downloaded in local contained in the tmp/
folder:
Comparison → Draw
We will consider each libraries being equivalent so long as downloading files is anxious, due to this fact let’s call it a draw.
# 4 Uploading Objects
Same reasoning applies while uploading files from local environment to S3. The methods that may be employed are:
boto3.client('s3').upload_file()
orwr.s3.upload()
object_key_1 = 'account_balances_apr2023.parquet'
object_key_2 = 'account_balances_may2023.parquet'file_path_1 = os.path.dirname(os.path.realpath(object_key_1)) + '/' + object_key_1
file_path_2 = os.path.dirname(os.path.realpath(object_key_2)) + '/' + object_key_2
# BOTO3
client = boto3.client('s3')
client.upload_file(file_path_1, bucket, object_key_1)
# AWS WRANGLER
wr.s3.upload(local_file=file_path_2, path=f's3://{bucket}/{object_key_2}')
Executing the code, uploads two recent account_balances
objects (for the months of April and May 2023) to the coding-tutorials
bucket:
Comparison → Draw
That is one other draw. To date there’s absolute parity between the 2 libraries!
# 5 Deleting Objects
Let’s now assume we wished to delete the next objects:
#SINGLE OBJECT
object_key = ‘account_balances_jan2023.parquet’#MULTIPLE OBJECTS
object_keys = [‘account_balances_jan2023.parquet’,
‘account_balances_feb2023.parquet’,
‘account_balances_mar2023.parquet’]
boto3
allows to delete objects one-by-one or in bulk using the next methods:
boto3.client('s3').delete_object()
boto3.client('s3').delete_objects()
Each methods return a response
including ResponseMetadata
that may be used to confirm whether objects have been deleted successfully or not. As an illustration:
- while deleting a single object, a
HTTPStatusCode==204
indicates that the operation has been accomplished successfully (if objects are present in the S3 bucket); - while deleting multiple objects, a
Deleted
list is returned with the names of successfully deleted items.
# BOTO3
print('--BOTO3--')
client = boto3.client('s3')# Delete Single object
response = client.delete_object(Bucket=bucket, Key=object_key)
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']
if response['ResponseMetadata']['HTTPStatusCode'] == 204:
print(f'Object {object_key} deleted successfully on {deletion_date}.')
else:
print(f'Object couldn't be deleted.')
# Delete Multiple Objects
objects = [{'Key': key} for key in object_keys]
response = client.delete_objects(Bucket=bucket, Delete={'Objects': objects})
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']
if len(object_keys) == len(response['Deleted']):
print(f'All objects were deleted successfully on {deletion_date}')
else:
print(f'Object couldn't be deleted.')
Then again, awswrangler
provides a technique that may be used for each single and bulk deletions:
Since object_keys
may be recursively passed to the tactic as a list_comprehension
– as an alternative of being converted to a dictionary first like before – using this syntax is an actual pleasure.
# AWS WRANGLER
print('--AWS_WRANGLER--')
# Delete Single object
wr.s3.delete_objects(path=f's3://{bucket}/{object_key}')# Delete Multiple Objects
try:
wr.s3.delete_objects(path=[f's3://{bucket}/{key}' for key in object_keys])
print('All objects deleted successfully.')
except:
print(f'Objects couldn't be deleted.')
Executing the code above, deletes the objects in S3 after which returns:
Comparison → Boto3 Wins
This is hard one: awswrangler
has an easier syntax to make use of while deleting multiple objects, as we will simply pass the complete list to the tactic.
Nonetheless boto3
returns a lot of information within the response
which might be extremely useful logs, while deleting objects programmatically.
Because in a production environment, low-level metadata is healthier than almost no metadata, boto3
wins this challenge and now leads 2–1.
# 6 Writing Objects
Relating to write files to S3, boto3
doesn’t even provide an out-of-the-box method to perform such operations.
For instance, if we desired to create a brand new parquet
file using boto3
, we’d first must persist the item on the local disk (using to_parquet()
method from pandas
) after which upload it to S3 using the upload_fileobj()
method.
In another way from upload_file()
(explored at point 4) the upload_fileobj()
method is a managed transfer which is able to perform a multipart upload in multiple threads, if vital:
object_key_1 = 'account_balances_june2023.parquet'# RUN THE GENERATOR.PY SCRIPT
df.to_parquet(object_key_1)
# BOTO3
client = boto3.client('s3')
# Upload the Parquet file to S3
with open(object_key_1, 'rb') as file:
client.upload_fileobj(file, bucket, object_key_1)
Then again, one in all the important benefits of the awswrangler
library (while working with pandas
) , is that it may possibly be used to write down objects on to the S3 bucket (without saving them to the local disk) that’s each elegant and efficient.
Furthermore, awswrangler
offers great flexibility allowing users to:
- Apply specific compression algorithms like
snappy
,gzip
andzstd
; append
to oroverwrite
existing files via themode
parameter whendataset = True
;- Specify a number of partitions columns via the
partitions_col
parameter.
object_key_2 = 'account_balances_july2023.parquet'# AWS WRANGLER
wr.s3.to_parquet(df=df,
path=f's3://{bucket}/{object_key_2}',
compression = 'gzip',
partition_cols = ['COMPANY_CODE'],
dataset=True)
Once executed, the code above writes account_balances_june2023
as a single parquet
file, and account_balances_july2023
as a folder with 4 files already partitioned by COMPANY_CODE
:
Comparison → AWSWrangler Wins
If working with pandas
is an option, awswrangler
offers a way more advanced set of operations while writing files to S3, particularly when put next to boto3
that on this case, is just not exactly the very best tool for the job.
# 7.1 Reading Objects (Python)
As similar reasoning applies while attempting to read objects from S3 using boto3
: since this library doesn’t offer a in-built read method, the very best option we have now is to perform an API call (get_object()
), read the Body
of the response
after which pass the parquet_object
to pandas
.
Note that pd.read_parquet()
method expects a file-like object as input, which is why we want to pass the content read from the parquet_object
as a binary stream.
Indeed, by utilizing io.BytesIO()
we create a brief file-like object in memory, avoiding the necessity to save lots of the Parquet file locally before reading it. That is in turn improves performance, especially when working with large files:
object_key = 'account_balances_may2023.parquet'# BOTO3
client = boto3.client('s3')
# Read the Parquet file
response = client.get_object(Bucket=bucket, Key=object_key)
parquet_object = response['Body'].read()
df = pd.read_parquet(io.BytesIO(parquet_object))
df.head()
As expected, awswrangler
as an alternative excels at reading objects from S3, returning a pandas
df as an output.
It supports various input formats like csv
, json
, parquet
and more recently delta
tables. Also passing the chunked
parameter allows to read objects in a memory-friendly way:
# AWS WRANGLER
df = wr.s3.read_parquet(path=f's3://{bucket}/{object_key}')
df.head()# wr.s3.read_csv()
# wr.s3.read_json()
# wr.s3.read_parquet_table()
# wr.s3.read_deltalake()
Executing the code above returns a pandas
df with May data:
Comparison → AWSWrangler Wins
Yes, there are methods around the dearth of proper methods in boto3.
Nonetheless, awswrangler
is a library conceived to read S3 objects efficiently, hence it also wins this challenge.
# 7.2 Reading Objects (SQL)
Those that managed to read until this point deserve a bonus and that bonus is reading objects from S3 using plain SQL.
Let’s suppose we wished to fetch data from the account_balances_may2023.parquet
object using the query
below (that filters data by AS_OF_DATE
):
object_key = 'account_balances_may2023.parquet'
query = """SELECT * FROM s3object s
WHERE AS_OF_DATE > CAST('2023-05-13T' AS TIMESTAMP)"""
In boto3
this may be achieved via the select_object_content()
method. Note how we should always also specify the inputSerialization
and OutputSerialization
formats:
# BOTO3
client = boto3.client('s3')resp = client.select_object_content(
Bucket=bucket,
Key=object_key,
Expression= query,
ExpressionType='SQL',
InputSerialization={"Parquet": {}},
OutputSerialization={'JSON': {}},
)
records = []
# Process the response
for event in resp['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'].decode('utf-8'))
# Concatenate the JSON records right into a single string
json_string = ''.join(records)
# Load the JSON data right into a Pandas DataFrame
df = pd.read_json(json_string, lines=True)
# Print the DataFrame
df.head()
If working with pandas
df is an option, awswrangler
also offers a really handy select_query()
method that requires minimal code:
# AWS WRANGLER
df = wr.s3.select_query(
sql=query,
path=f's3://{bucket}/{object_key}',
input_serialization="Parquet",
input_serialization_params={}
)
df.head()
For each libraries, the returned df will appear to be this:
On this tutorial we explored 7 common operations that may be performed on S3 buckets and run a comparative evaluation between boto3
and awswrangler
libraries.
Each approaches allow us to interact with S3 buckets, nonetheless the important difference is that the boto3
client provides low-level access to AWS services, while awswrangler
offers a simplified and more high-level interface for various data engineering tasks.
Overall, awswrangler
is our winner with 3 points (checking objects existence, write objects, read objects) vs 2 points scored by boto3
(listing object, deleting objects). Each the upload/download objects categories were draws and didn’t assign points.
Despite the result above, the reality is that each libraries give their best when used interchangeably, to excel within the tasks they’ve been built for.