Home Artificial Intelligence Boto3 vs AWS Wrangler: Simplifying S3 Operations with Python Introduction Prerequisites & Data Comparative Evaluation Conclusion

Boto3 vs AWS Wrangler: Simplifying S3 Operations with Python Introduction Prerequisites & Data Comparative Evaluation Conclusion

0
Boto3 vs AWS Wrangler: Simplifying S3 Operations with Python
Introduction
Prerequisites & Data
Comparative Evaluation
Conclusion

A comparative evaluation for AWS S3 development

Towards Data Science
Picture by Hemerson Coelho on Unsplash

On this tutorial, we are going to delve into the world of AWS S3 development with Python by exploring and comparing two powerful libraries: boto3 and awswrangler.

If you happen to’ve ever wondered

“What’s the very best Python tool to interact with AWS S3 Buckets? “

“Methods to perform S3 operations in essentially the most efficient way?”

then you definately’ve come to the appropriate place.

Indeed, throughout this post, we are going to cover a variety of common operations essential for working with AWS S3 buckets amongst which:

  1. listing objects,
  2. checking object existence,
  3. downloading objects,
  4. uploading objects,
  5. deleting objects,
  6. writing objects,
  7. reading objects (standard way or with SQL)

By comparing the 2 libraries, we are going to discover their similarities, differences, and optimal use cases for every operations. By the top, you’ll have a transparent understanding of which library is healthier fitted to specific S3 tasks.

Moreover, for many who read to the very bottom, we may even explore the right way to leverage boto3 and awswrangler to read data from S3 using friendly SQL queries.

So let’s dive in and discover the very best tools for interacting with AWS S3 and learn the right way to perform these operations efficiently with Python using each libraries.

The package versions utilized in this tutorial are:

  • boto3==1.26.80
  • awswrangler==2.19.0

Also, three initial files including randomly generated account_balances data have been uploaded to an S3 bucket named coding-tutorials:

Despite you have to be aware that various ways exists to ascertain a connection to a S3 bucket, on this case, we’re going to use the setup_default_session() from boto3:

# CONNECTING TO S3 BUCKET
import os
import io
import boto3
import awswrangler as wr
import pandas as pd

boto3.setup_default_session(aws_access_key_id = 'your_access_key',
aws_secret_access_key = 'your_secret_access_key')

bucket = 'coding-tutorials'

This method is handy as, once the session has been set, it may possibly be shared by each boto3 and awswrangler, meaning that we won’t must pass any more secrets down the road

Now let’s compare boto3 and awswrangler while performing various common operations and find what’s the very best tool for the job.

The total notebook including the code that follows may be present in this GitHub folder.

# 1 Listing Objects

Listing objects, might be the primary operation we should always perform while exploring a brand new S3 bucket and is an easy method to check whether a session has been accurately set.

With boto3 objects may be listed using:

  • boto3.client('s3').list_objects()
  • boto3.resource('s3').Bucket().objects.all()
print('--BOTO3--') 
# BOTO3 - Preferred Method
client = boto3.client('s3')

for obj in client.list_objects(Bucket=bucket)['Contents']:
print('File Name:', obj['Key'], 'Size:', round(obj['Size']/ (1024*1024), 2), 'MB')

print('----')
# BOTO3 - Alternative Method
resource = boto3.resource('s3')

for obj in resource.Bucket(bucket).objects.all():
print('File Name:', obj.key, 'Size:', round(obj.size/ (1024*1024), 2), 'MB')

Despite each client and resource classes do an honest job, the client class must be preferred, because it is more elegant and provides a lot of [easily accessible] low-level metadata as a nested JSON ( amongst which the item size).

Then again, awswrangler only provides a single method to list objects:

Being a high-level method, this doesn’t return any low-level metadata concerning the object, such that to seek out the file size we want to call:

print('--AWS_WRANGLER--') 
# AWS WRANGLER

for obj in wr.s3.list_objects("s3://coding-tutorials/"):
print('File Name:', obj.replace('s3://coding-tutorials/', ''))

print('----')
for obj, size in wr.s3.size_objects("s3://coding-tutorials/").items():
print('File Name:', obj.replace('s3://coding-tutorials/', '') , 'Size:', round(size/ (1024*1024), 2), 'MB')

The code above returns:

Comparison → Boto3 Wins

Despite awswrangler is more straightforward to make use of, boto3 wins the challenge, while listing S3 objects. In reality, its low-level implementation, implies that many more objects metadata may be retrieved using one in all its classes. Such information is incredibly useful while accessing S3 bucket in a programmatic way.

# 2 Checking Object Existence

The power to envision objects existence is required after we wish for extra operations to be triggered because of this of an object being already available in S3 or not.

With boto3 such checks may be performed using:

  • boto3.client('s3').head_object()
object_key = 'account_balances_jan2023.parquet'

# BOTO3
print('--BOTO3--')
client = boto3.client('s3')
try:
client.head_object(Bucket=bucket, Key = object_key)
print(f"The article exists within the bucket {bucket}.")
except client.exceptions.NoSuchKey:
print(f"The article doesn't exist within the bucket {bucket}.")

As a substitute awswrangler provides the dedicated method:

  • wr.s3.does_object_exist()
# AWS WRANGLER
print('--AWS_WRANGLER--')
try:
wr.s3.does_object_exist(f's3://{bucket}/{object_key}')
print(f"The article exists within the bucket {bucket}.")
except:
print(f"The article doesn't exist within the bucket {bucket}.")

The code above returns:

Comparison → AWSWrangler Wins

Let’s admit it: boto3 method name [head_object()] is just not that intuitive.

Also having a dedicated method is undoubtedly and advantage of awswrangler that wins this match.

# 3 Downloading Objects

Downloading objects in local is incredibly easy with each boto3 and awswrangler using the next methods:

  • boto3.client('s3').download_file() or
  • wr.s3.download()

The one difference is that download_file() takes bucket , object_key and local_file as input variables, whereas download() only requires the S3 path and local_file :

object_key = 'account_balances_jan2023.parquet'

# BOTO3
client = boto3.client('s3')
client.download_file(bucket, object_key, 'tmp/account_balances_jan2023_v2.parquet')

# AWS WRANGLER
wr.s3.download(path=f's3://{bucket}/{object_key}', local_file='tmp/account_balances_jan2023_v3.parquet')

When the code is executed, each versions of the identical object are indeed downloaded in local contained in the tmp/ folder:

Comparison → Draw

We will consider each libraries being equivalent so long as downloading files is anxious, due to this fact let’s call it a draw.

# 4 Uploading Objects

Same reasoning applies while uploading files from local environment to S3. The methods that may be employed are:

  • boto3.client('s3').upload_file() or
  • wr.s3.upload()
object_key_1 = 'account_balances_apr2023.parquet'
object_key_2 = 'account_balances_may2023.parquet'

file_path_1 = os.path.dirname(os.path.realpath(object_key_1)) + '/' + object_key_1
file_path_2 = os.path.dirname(os.path.realpath(object_key_2)) + '/' + object_key_2

# BOTO3
client = boto3.client('s3')
client.upload_file(file_path_1, bucket, object_key_1)

# AWS WRANGLER
wr.s3.upload(local_file=file_path_2, path=f's3://{bucket}/{object_key_2}')

Executing the code, uploads two recent account_balances objects (for the months of April and May 2023) to the coding-tutorials bucket:

Comparison → Draw

That is one other draw. To date there’s absolute parity between the 2 libraries!

# 5 Deleting Objects

Let’s now assume we wished to delete the next objects:

#SINGLE OBJECT
object_key = ‘account_balances_jan2023.parquet’

#MULTIPLE OBJECTS
object_keys = [‘account_balances_jan2023.parquet’,
‘account_balances_feb2023.parquet’,
‘account_balances_mar2023.parquet’]

boto3 allows to delete objects one-by-one or in bulk using the next methods:

  • boto3.client('s3').delete_object()
  • boto3.client('s3').delete_objects()

Each methods return a response including ResponseMetadata that may be used to confirm whether objects have been deleted successfully or not. As an illustration:

  • while deleting a single object, a HTTPStatusCode==204 indicates that the operation has been accomplished successfully (if objects are present in the S3 bucket);
  • while deleting multiple objects, a Deleted list is returned with the names of successfully deleted items.
# BOTO3
print('--BOTO3--')
client = boto3.client('s3')

# Delete Single object
response = client.delete_object(Bucket=bucket, Key=object_key)
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if response['ResponseMetadata']['HTTPStatusCode'] == 204:
print(f'Object {object_key} deleted successfully on {deletion_date}.')
else:
print(f'Object couldn't be deleted.')

# Delete Multiple Objects
objects = [{'Key': key} for key in object_keys]

response = client.delete_objects(Bucket=bucket, Delete={'Objects': objects})
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if len(object_keys) == len(response['Deleted']):
print(f'All objects were deleted successfully on {deletion_date}')
else:
print(f'Object couldn't be deleted.')

Then again, awswrangler provides a technique that may be used for each single and bulk deletions:

Since object_keys may be recursively passed to the tactic as a list_comprehension as an alternative of being converted to a dictionary first like before – using this syntax is an actual pleasure.

# AWS WRANGLER
print('--AWS_WRANGLER--')
# Delete Single object
wr.s3.delete_objects(path=f's3://{bucket}/{object_key}')

# Delete Multiple Objects
try:
wr.s3.delete_objects(path=[f's3://{bucket}/{key}' for key in object_keys])
print('All objects deleted successfully.')
except:
print(f'Objects couldn't be deleted.')

Executing the code above, deletes the objects in S3 after which returns:

Comparison → Boto3 Wins

This is hard one: awswrangler has an easier syntax to make use of while deleting multiple objects, as we will simply pass the complete list to the tactic.

Nonetheless boto3 returns a lot of information within the responsewhich might be extremely useful logs, while deleting objects programmatically.

Because in a production environment, low-level metadata is healthier than almost no metadata, boto3 wins this challenge and now leads 2–1.

# 6 Writing Objects

Relating to write files to S3, boto3 doesn’t even provide an out-of-the-box method to perform such operations.

For instance, if we desired to create a brand new parquet file using boto3, we’d first must persist the item on the local disk (using to_parquet() method from pandas) after which upload it to S3 using the upload_fileobj() method.

In another way from upload_file() (explored at point 4) the upload_fileobj() method is a managed transfer which is able to perform a multipart upload in multiple threads, if vital:

object_key_1 = 'account_balances_june2023.parquet'

# RUN THE GENERATOR.PY SCRIPT

df.to_parquet(object_key_1)

# BOTO3
client = boto3.client('s3')

# Upload the Parquet file to S3
with open(object_key_1, 'rb') as file:
client.upload_fileobj(file, bucket, object_key_1)

Then again, one in all the important benefits of the awswrangler library (while working with pandas) , is that it may possibly be used to write down objects on to the S3 bucket (without saving them to the local disk) that’s each elegant and efficient.

Furthermore, awswrangler offers great flexibility allowing users to:

  • Apply specific compression algorithms like snappy , gzip and zstd;
  • append to or overwrite existing files via the mode parameter when dataset = True;
  • Specify a number of partitions columns via the partitions_col parameter.
object_key_2 = 'account_balances_july2023.parquet'

# AWS WRANGLER
wr.s3.to_parquet(df=df,
path=f's3://{bucket}/{object_key_2}',
compression = 'gzip',
partition_cols = ['COMPANY_CODE'],
dataset=True)

Once executed, the code above writes account_balances_june2023 as a single parquet file, and account_balances_july2023 as a folder with 4 files already partitioned by COMPANY_CODE:

Comparison → AWSWrangler Wins

If working with pandas is an option, awswrangler offers a way more advanced set of operations while writing files to S3, particularly when put next to boto3 that on this case, is just not exactly the very best tool for the job.

# 7.1 Reading Objects (Python)

As similar reasoning applies while attempting to read objects from S3 using boto3: since this library doesn’t offer a in-built read method, the very best option we have now is to perform an API call (get_object()), read the Body of the response after which pass the parquet_object to pandas.

Note that pd.read_parquet() method expects a file-like object as input, which is why we want to pass the content read from the parquet_object as a binary stream.

Indeed, by utilizing io.BytesIO() we create a brief file-like object in memory, avoiding the necessity to save lots of the Parquet file locally before reading it. That is in turn improves performance, especially when working with large files:

object_key = 'account_balances_may2023.parquet'

# BOTO3
client = boto3.client('s3')

# Read the Parquet file
response = client.get_object(Bucket=bucket, Key=object_key)
parquet_object = response['Body'].read()

df = pd.read_parquet(io.BytesIO(parquet_object))
df.head()

As expected, awswrangler as an alternative excels at reading objects from S3, returning a pandas df as an output.

It supports various input formats like csv, json, parquet and more recently delta tables. Also passing the chunked parameter allows to read objects in a memory-friendly way:

# AWS WRANGLER
df = wr.s3.read_parquet(path=f's3://{bucket}/{object_key}')
df.head()

# wr.s3.read_csv()
# wr.s3.read_json()
# wr.s3.read_parquet_table()
# wr.s3.read_deltalake()

Executing the code above returns a pandas df with May data:

Comparison → AWSWrangler Wins

Yes, there are methods around the dearth of proper methods in boto3. Nonetheless, awswrangler is a library conceived to read S3 objects efficiently, hence it also wins this challenge.

# 7.2 Reading Objects (SQL)

Those that managed to read until this point deserve a bonus and that bonus is reading objects from S3 using plain SQL.

Let’s suppose we wished to fetch data from the account_balances_may2023.parquet object using the query below (that filters data by AS_OF_DATE):

object_key = 'account_balances_may2023.parquet'
query = """SELECT * FROM s3object s
WHERE AS_OF_DATE > CAST('2023-05-13T' AS TIMESTAMP)"""

In boto3 this may be achieved via the select_object_content() method. Note how we should always also specify the inputSerialization and OutputSerialization formats:

# BOTO3
client = boto3.client('s3')

resp = client.select_object_content(
Bucket=bucket,
Key=object_key,
Expression= query,
ExpressionType='SQL',
InputSerialization={"Parquet": {}},
OutputSerialization={'JSON': {}},
)

records = []

# Process the response
for event in resp['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'].decode('utf-8'))

# Concatenate the JSON records right into a single string
json_string = ''.join(records)

# Load the JSON data right into a Pandas DataFrame
df = pd.read_json(json_string, lines=True)

# Print the DataFrame
df.head()

If working with pandas df is an option, awswrangler also offers a really handy select_query() method that requires minimal code:

# AWS WRANGLER
df = wr.s3.select_query(
sql=query,
path=f's3://{bucket}/{object_key}',
input_serialization="Parquet",
input_serialization_params={}
)
df.head()

For each libraries, the returned df will appear to be this:

On this tutorial we explored 7 common operations that may be performed on S3 buckets and run a comparative evaluation between boto3 and awswrangler libraries.

Each approaches allow us to interact with S3 buckets, nonetheless the important difference is that the boto3 client provides low-level access to AWS services, while awswrangler offers a simplified and more high-level interface for various data engineering tasks.

Overall, awswrangler is our winner with 3 points (checking objects existence, write objects, read objects) vs 2 points scored by boto3 (listing object, deleting objects). Each the upload/download objects categories were draws and didn’t assign points.

Despite the result above, the reality is that each libraries give their best when used interchangeably, to excel within the tasks they’ve been built for.

Sources

LEAVE A REPLY

Please enter your comment!
Please enter your name here