
To maintain things easy and costs to a minimum

ETL stands for Extract, Transform, and Load. An ETL pipeline is basically just a knowledge transformation process — extracting data from one place, doing something with it, after which loading it back to the identical or a special place.
In the event you are working with natural language processing via APIs, which I’m guessing most will start doing, you’ll be able to easily hit the timeout threshold of AWS Lambda when processing your data, especially if at the very least one function exceeds quarter-hour. So, while Lambda is great since it’s quick and really low-cost, the timeout generally is a hassle.
The alternative here is to deploy your code as a container that has the choice of running so long as it must and run it on a schedule. So, as an alternative of spinning up a function as you do with Lambda, we will spin up a container to run in an ECS cluster using Fargate.
For clarification, Lambda, ECS and EventBridge are all AWS Services.
Just as with Lambda, the fee of running a container for an hour or two is minimal. Nevertheless, it’s a bit more complicated than running a serverless function. But in case you’re reading this, then you definately’ve probably run into the identical issues and are wondering what the simplest technique to transition is.
I even have created a quite simple ETL template that uses Google BigQuery to extract and cargo data. This template will get you up and running inside a couple of minutes in case you follow along.
Using BigQuery is entirely optional but I normally store my long run data there.
As a substitute of constructing something complex here, I’ll show you the way to construct something minimal and keep it really lean.
In the event you don’t have to process data in parallel, you shouldn’t need to incorporate something like Airflow. I’ve seen a couple of articles on the market that unnecessarily arrange complex workflows, which aren’t strictly crucial for straightforward data transformation.
Besides, in case you feel like you ought to add on to this later, that option is yours.
Workflow
We’ll construct our script in Python as we’re doing data transformation, then bundle it up with Docker and push it to an ECR repository.
From here, we will create a task definition using AWS Fargate and run it on a schedule in an ECS cluster.
Don’t worry if this feels foreign; you’ll understand all these services and what they do as we go along.
Technology
In the event you are latest to working with containers, then consider ECS (Elastic Container Service) as something that helps us arrange an environment where we will run a number of containers concurrently.
Fargate, alternatively, helps us simplify the management and setup of the containers themselves using Docker images — that are known as tasks in AWS.
There’s the choice of using EC2 to establish your containers, but you would need to do quite a bit more manual work. Fargate manages the underlying instances for us, whereas with EC2, you might be required to administer and deploy your personal compute instances. Hence, Fargate is also known as the ‘serverless’ option.
I discovered a thread on Reddit discussing this, in case you’re keen to read a bit about how users find using EC2 versus Fargate. It will possibly provide you with an idea of how people compare EC2 and Fargate.
Not that I’m saying Reddit is the source of truth, but it surely’s useful for getting a way of user perspectives.
Costs
The first concern I normally have is to maintain the code running efficiently while also managing the entire cost.
As we’re only running the container when we’d like to, we only pay for the quantity of resources we use. The worth we pay is decided by several aspects, reminiscent of the variety of tasks running, the execution duration of every task, the variety of virtual CPUs (vCPUs) used for the duty, and memory usage.
But to present you a rough idea, on a high level, the entire cost for running one task is around $0.01384 per hour for the EU region, depending on the resources you’ve provisioned.
If we were to match this price with AWS Glue we will get a little bit of perspective if it is sweet or not.
If an ETL job requires 4 DPUs (the default number for an AWS Glue job) and runs for an hour, it might cost 4 DPUs * $0.44 = $1.76. This cost is for less than one hour and is significantly higher than running an easy container.
That is, in fact, a simplified calculation, and the actual variety of DPUs can vary depending on the job. You possibly can take a look at AWS Glue pricing in additional detail on their pricing page.
To run long-running scripts, organising your personal container and deploying it on ECS with Fargate is smart, each when it comes to efficiency and price.
To follow this text, I’ve created an easy ETL template to assist you rise up and running quickly.
This template uses BigQuery to extract and cargo data. It would extract a couple of rows, do something easy after which load it back to BigQuery.
After I run my pipelines I even have other things that transform data — I take advantage of APIs for natural language processing that runs for a couple of hours within the morning — but that’s as much as you so as to add on later. That is just to present you a template that shall be easy to work with.
To follow along this tutorial, the fundamental steps shall be as follows:
- Establishing your local code.
- Establishing an IAM user & the AWS CLI.
- Construct & push Docker image to AWS.
- Create an ECS task definition.
- Create an ECS cluster.
- Schedule to your tasks.
In total it shouldn’t take you longer than 20 minutes to get through this, using the code I’ll give you. This assumes you will have an AWS account ready, and if not, add on 5 to 10 minutes.
The Code
First create a brand new folder locally and locate into it.
mkdir etl-pipelines
cd etl-pipelines
Make certain you will have python installed.
python --version
If not, install it locally.
When you’re ready, you’ll be able to go ahead and clone the template I even have already arrange.
git clone https://github.com/ilsilfverskiold/etl-pipeline-fargate.git
When it has finished fetching the code, open it up in your code editor.
First check the fundamental.py file to look how I’ve structured the code to know what it does.
Essentially, it would fetch all names with “Doe” in it from a table in BigQuery that you specify, transform these names after which insert them back into the identical data table as latest rows.
You possibly can go into each helper function to see how we arrange the SQL Query job, transform the info after which insert it back to the BigQuery table.
The thought is in fact that you simply arrange something more complex but this is a straightforward test run to make it easy to tweak the code.
Setting Up BigQuery
If you ought to proceed with the code I’ve prepared you will want to establish a couple of things in BigQuery. Otherwise you’ll be able to skip this part.
Listed below are the things you will want:
- A BigQuery table with a field of ‘name’ as a string.
- A few rows in the info table with the name “Doe” in it.
- A service account that can have access to this dataset.
To get a service account you will want to navigate to IAM within the Google Cloud Console after which to Service Accounts.
Once there, create a brand new service account.
Once it has been created, you will want to present your service account BigQuery User access globally via IAM.
You will even have to present this service account access to the dataset itself which you do in BigQuery directly via the dataset’s Share button after which by pressing Add Principal.
After you’ve given the user the suitable permissions, be sure you return to the Service Accounts after which download a key. This provides you with a json file that it is advisable to put in your root folder.
Now, an important part is ensuring the code has access to the google credentials and is using the right data table.
You’ll want the json file you’ve downloaded with the Google credentials in your root folder as google_credentials.json after which you ought to specify the right table ID.
Now you may argue that you simply don’t need to store your credentials locally which is barely right.
You possibly can add in the choice of storing your json file in AWS Secrets Manager later. Nevertheless, to start out, this shall be easier.
Run ETL Pipeline Locally
We’ll run this code locally first, just so we will see that it really works.
So, arrange a Python virtual environment and activate it.
python -m venv etl-env
source etl-env/bin/activate # On Windows use `venvScriptsactivate`
Then install dependencies. We only have google-cloud-bigquery in there but ideally you’ll have more dependencies.
pip install -r requirements.txt
Run the fundamental script.
python fundamental.py
This could log ‘Latest rows have been added’ in your terminal. This may then confirm that the code works as we’ve intended.
The Docker Image
Now to push this code to ECS we can have to bundle it up right into a Docker image which implies that you will want Docker installed locally.
In the event you do not need Docker installed, you’ll be able to download it here.
Docker helps us package an application and its dependencies into a picture, which will be easily recognized and run on any system. Using ECS, it’s required of us to bundle our code into Docker images, that are then referenced by a task definition to run as containers.
I even have already arrange a Dockerfile in your folder. You need to give you the option to look into it there.
FROM --platform=linux/amd64 python:3.11-slimWORKDIR /app
COPY . /app
RUN pip install --no-cache-dir -r requirements.txt
CMD ["python", "main.py"]
As you see, I’ve kept this really lean as we’re not connecting web traffic to any ports here.
We’re specifying AMD64 which you could not need in case you aren’t on a Mac with an M1 chip but it surely shouldn’t hurt. This may specify to AWS the architecture of the docker image so we don’t run into compatibility issues.
Create an IAM User
When working with AWS, access will have to be specified. Many of the issues you’ll run into are permission issues. We’ll be working with the CLI locally, and for this to work we’ll must create an IAM user that may need quite broad permissions.
Go to the AWS console after which navigate to IAM. Create a brand new user, add permissions after which create a brand new policy to connect to it.
I even have specified the permissions needed in your code within the aws_iam_user.json file. You’ll see a brief snippet below of what this json file looks like.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"iam:CreateRole",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"ecs:DescribeTaskDefinition",
...more
],
"Resource": "*"
}
]
}
You’ll have to go into this file to get all the permissions you will want to set, that is just a brief snippet. I’ve set it to quite a couple of, which you could wish to tweak to your personal preferences later.
When you’ve created the IAM user and also you’ve added the right permissions to it, you will want to generate an access key. Select ‘Command Line Interface (CLI)’ when asked about your use case.
Download the credentials. We’ll use these to authenticate in a bit.
Arrange the AWS CLI
Next, we’ll connect our terminal to our AWS account.
In the event you don’t have the CLI arrange yet you’ll be able to follow the instructions here. It is basically easy to set this up.
When you’ve installed the AWS CLI you’ll have to authenticate with the IAM user we just created.
aws configure
Use the credentials we downloaded from the IAM user within the previous step.
Create an ECR Repository
Now, we will start with the DevOps of all of it.
We’ll first have to create a repository in Elastic Container Registry. ECR is where we will store and manage our docker images. We’ll give you the option to reference these images from ECR after we arrange our task definitions.
To create a brand new ECR repository run this command in your terminal. This may create a repository called bigquery-etl-pipeline.
aws ecr create-repository — repository-name bigquery-etl-pipeline
Note the repository URI you get back.
From here we have now to construct the docker image after which push this image to this repository.
To do that you’ll be able to technically go into the AWS console and find the ECR repository we just created. Here AWS will allow us to see your entire push commands we’d like to run to authenticate, construct and push our docker image to this ECR repository.
Nevertheless, in case you are on a Mac I might advice you to specify the architecture when constructing the docker image or you could run into issues.
In the event you are following together with me, then start with authenticating your docker client like so.
aws ecr get-login-password --region YOUR_REGION | docker login --username AWS --password-stdin YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com
Be sure you change the values, region and account ID where applicable.
Construct the docker image.
docker buildx construct --platform=linux/amd64 -t bigquery-etl-pipeline .
That is where I even have tweaked the command to specify the linux/amd64 architecture.
Tag the docker image.
docker tag bigquery-etl-pipeline:latest YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:latest
Push the docker image.
docker push YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:latest
If the whole lot worked as planned you’ll see something like this in your terminal.
9f691c4f0216: Pushed
ca0189907a60: Pushed
687f796c98d5: Pushed
6beef49679a3: Pushed
b0dce122021b: Pushed
4de04bd13c4a: Pushed
cf9b23ff5651: Pushed
644fed2a3898: Pushed
Now that we have now pushed the docker image to an ECR repository, we will use it to establish our task definition using Fargate.
In the event you run into EOF issues here it’s most definitely related to IAM permissions. Be sure you give it the whole lot it needs, on this case full access to ECR to tag and push the image.
Roles & Log Groups
Remember what I told you before, the most important issues you’ll run into in AWS pertains to roles between different services.
For this to flow neatly we’ll must be sure we arrange a couple of things before we start organising a task definition and an ECS cluster.
To do that, we first must create a task role — this role is the role that may need access to services within the AWS ecosystem from our container — after which the execution role — so the container will give you the option to tug the docker image from ECR.
aws iam create-role --role-name etl-pipeline-task-role --assume-role-policy-document file://ecs-tasks-trust-policy.json
aws iam create-role - role-name etl-pipeline-execution-role - assume-role-policy-document file://ecs-tasks-trust-policy.json
I even have specified a json file called ecs-tasks-trust-policy.json in your folder locally that it would use to create these roles.
For the script that we’re pushing, it won’t have to have permission to access other AWS services so for now there isn’t any need to connect policies to the duty role. Nevertheless, you could wish to do that later.
Nevertheless, for the execution role though we are going to need to present it ECR access to tug the docker image.
To connect the policy AmazonECSTaskExecutionRolePolicy to the execution role run this command.
aws iam attach-role-policy --role-name etl-pipeline-execution-role --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
We also create one last role while we’re at it, a service role.
aws iam create-service-linked-role - aws-service-name ecs.amazonaws.com
In the event you don’t create the service role you could find yourself with an errors reminiscent of ‘Unable to assume the service linked role. Please confirm that the ECS service linked role exists’ once you attempt to run a task.
The last item we create a log group. Making a log group is important for capturing and accessing logs generated by your container.
To create a log group you’ll be able to run this command.
aws logs create-log-group - log-group-name /ecs/etl-pipeline-logs
When you’ve created the execution role, the duty role, the service role after which the log group we will proceed to establish the ECS task definition.
Create an ECS Task Definition
A task definition is a blueprint to your tasks, specifying what container image to make use of, how much CPU and memory is required, and other configurations. We use this blueprint to run tasks in our ECS cluster.
I even have already arrange the duty definition in your code at task-definition.json. Nevertheless, it is advisable to set your account id in addition to region in there to be sure it runs because it should.
{
"family": "my-etl-task",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/etl-pipeline-task-role",
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/etl-pipeline-execution-role",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "my-etl-container",
"image": "ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/bigquery-etl-pipeline:latest",
"cpu": 256,
"memory": 512,
"essential": true,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/etl-pipeline-logs",
"awslogs-region": "REGION",
"awslogs-stream-prefix": "ecs"
}
}
}
],
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512"
}
Remember the URI we got back after we created the ECR repository? That is where we’ll use it. Remember the execution role, the duty role and the log group? We’ll use it there as well.
In the event you’ve named the ECR repository together with the roles and log group exactly what I named mine then you definately can simply change the account ID and Region on this json otherwise be sure the URI is the right one.
You too can set CPU and memory here for what you’ll have to run your task — i.e. your code. I’ve set .25 vCPU and 512 mb as memory.
When you’re satisfied you’ll be able to register the duty definition in your terminal.
aws ecs register-task-definition --cli-input-json file://task-definition.json
Now you need to give you the option to enter Amazon Elastic Container Service after which find the duty we’ve created under Task Definitions.
This task — i.e. blueprint — won’t run on it’s own, we’d like to invoke it later.
Create an ECS Cluster
An ECS Cluster serves as a logical grouping of tasks or services. You specify this cluster when running tasks or creating services.
To create a cluster via the CLI run this command.
aws ecs create-cluster --cluster-name etl-pipeline-cluster
When you run this command, you’ll give you the option to see this cluster in ECS in your AWS console in case you look there.
We’ll attach the Task Definition we just created to this cluster after we run it for the following part.
Run Task
Before we will run the duty we’d like to get ahold of the subnets which can be available to us together with a security group id.
We will do that directly within the terminal via the CLI.
Run this command within the terminal to get the available subnets.
aws ec2 describe-subnets
You’ll get back an array of objects here, and also you’re on the lookout for the SubnetId for every object.
In the event you run into issues here, be sure your IAM has the suitable permissions. See the aws_iam_user.json file in your root folder for the permissions the IAM user connected to the CLI will need. I’ll stress this, since it’s the fundamental issues that I all the time run into.
To get the safety group ID you’ll be able to run this command.
aws ec2 describe-security-groups
You’re on the lookout for GroupId here within the terminal.
In the event you got at the very least one SubnetId after which a GroupId for a security group, we’re able to run the duty to check that the blueprint — i.e. task definition — works.
aws ecs run-task
--cluster etl-pipeline-cluster
--launch-type FARGATE
--task-definition my-etl-task
--count 1
--network-configuration "awsvpcConfiguration={subnets=[SUBNET_ID],securityGroups=[SECURITY_GROUP_ID],assignPublicIp=ENABLED}"
Do remember to vary the names in case you’ve named your cluster and task definition in a different way. Remember to also set your subnet ID and security group ID.
Now you’ll be able to navigate to the AWS console to see the duty running.
In the event you are having issues you’ll be able to look into the logs.
If successful, you need to see a couple of transformed rows added to BigQuery.
EventBridge Schedule
Now, we’ve managed to establish the duty to run in an ECS cluster. But what we’re fascinated with is to make it run on a schedule. That is where EventBridge is available in.
EventBridge will arrange our scheduled events, and we will set this up using the CLI as well. Nevertheless, before we arrange the schedule we first have to create a brand new role.
That is life when working with AWS, the whole lot must have permission to interact with one another.
On this case, EventBridge will need permission to call the ECS cluster on our behalf.
Within the repository you will have a file called trust-policy-for-eventbridge.json that I even have already put there, we’ll use this file to create this EventBridge role.
Paste this into the terminal and run it.
aws iam create-role
--role-name ecsEventsRole
--assume-role-policy-document file://trust-policy-for-eventbridge.json
We then must attach a policy to this role.
aws iam attach-role-policy
--role-name ecsEventsRole
--policy-arn arn:aws:iam::aws:policy/AmazonECS_FullAccess
We want it to at the very least give you the option to have ecs:RunTask but we’ve given it full access. In the event you prefer to limit the permissions, you’ll be able to create a custom policy with just the crucial permissions as an alternative.
Now let’s arrange the rule to schedule the duty to run with the duty definition each day at 5 am UTC. This is often the time I’d like for it to process data for me so if it fails I can look into it after breakfast.
aws events put-rule
--name "ETLPipelineDailyRun"
--schedule-expression "cron(0 5 * * ? *)"
--state ENABLED
You need to receive back an object with a field called RuleArn here. That is just to substantiate that it worked.
Next step is now to associate the rule with the ECS task definition.
aws events put-targets --rule "ETLPipelineDailyRun"
--targets "[{"Id":"1","Arn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:cluster/etl-pipeline-cluster","RoleArn":"arn:aws:iam::ACCOUNT_NUMBER:role/ecsEventsRole","EcsParameters":{"TaskDefinitionArn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:task-definition/my-etl-task","TaskCount":1,"LaunchType":"FARGATE","NetworkConfiguration":{"awsvpcConfiguration":{"Subnets":["SUBNET_ID"],"SecurityGroups":["SECURITY_GROUP_ID"],"AssignPublicIp":"ENABLED"}}}}]"
Remember to set your personal values here for region, account number, subnet and security group.
Use the subnets and security group that we got earlier. You possibly can set multiple subnets.
When you’ve run the command the duty is scheduled for five am each day and also you’ll find it under Scheduled Tasks within the AWS Console.
AWS Secrets Manager (Optional)
So keeping your Google credentials in the foundation folder isn’t ideal, even in case you’ve limited access to your datasets for the Google service account.
Here we will add on the choice of moving these credentials to a different AWS service after which accessing it from our container.
For this to work you’ll must move the credentials file to Secrets Manager, tweak the code so it may fetch it to authenticate and be sure that the duty role has permissions to access AWS Secrets Manager in your behalf.
Once you’re done you’ll be able to simply push the updated docker image to your ECR repo you arrange before.
The End Result
Now you’ve got a quite simple ETL pipeline running in a container on AWS on a schedule. The thought is that you simply add to it to do your personal data transformations.
Hopefully this was a useful piece for anyone that’s transitioning to organising their long-running data transformation scripts on ECS in an easy, cost effective and easy way.
Let me know in case you run into any issues in case there’s something I missed to incorporate.
❤