Home Artificial Intelligence How one can Construct Easy ETL Pipelines With GitHub Actions

How one can Construct Easy ETL Pipelines With GitHub Actions

0
How one can Construct Easy ETL Pipelines With GitHub Actions

Photo by Roman Synkevych 🇺🇦 on Unsplash

In case you’re into software development, you’d know what GitHub actions are. It’s a utility by GitHub to automate dev tasks. Or, in popular language, a DevOps tool.

But people hardly use it for constructing ETL pipelines.

The very first thing that involves mind when discussing ETLs is Airflow, Prefect, or related tools. They’re, unquestionably, the very best available in the market for task orchestration. But many ETLs we construct are easy, and hosting a separate tool for them is usually overkill.

You should utilize GitHub Actions as a substitute.

This text focuses on GitHub Actions. But when you’re on Bitbucket or GitLab, you possibly can use their respective alternatives too.

We are able to run our Python, R, or Julia scripts on GitHub Actions. In order a knowledge scientist, you don’t must learn a brand new language or tool for this matter. You may even get email notifications when any of your ETL tasks fail.

You possibly can still enjoy 2000min of computation monthly when you’re on a free account. You possibly can try GitHub motion when you can estimate your ETL workload inside this range.

How will we start constructing ETLs on GitHub Actions?

Getting began with the GitHub actions is straightforward. You may follow the official doc. Or the three easy steps are as follows.

In your repository, create a directory at .github/workflows . Then create the YAML config file actions.yaml inside it with the next content.

name: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM each day

jobs:
etl:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Arrange Python
uses: actions/setup-python@v2
with:
python-version: '3.9'

- name: Extract data
run: python extract.py

- name: Transform data
run: python transform.py

- name: Load data
run: python load.py

The above YAML automates an ETL (Extract, Transform, Load) pipeline. The workflow is triggered each day at 12:00 AM UTC, and it consists of a single job that runs on the ubuntu-latest environment (Whatever that’s available on the time.)

The steps of those configurations are easy.

The job has five steps: the primary two steps take a look at the code and arrange the Python environment, respectively, while the following three steps execute the extract.py, transform.py, and load.py Python scripts sequentially.

This workflow provides an automatic and efficient way of extracting, transforming, and loading data every day using GitHub Actions.

The Python scripts may vary depending on the scenario. Here’s one in all some ways.

# extract.py
# --------------------------------
import requests

response = requests.get("https://api.example.com/data")
with open("data.json", "w") as f:
f.write(response.text)

# transform.py
# --------------------------------
import json

with open("data.json", "r") as f:
data = json.load(f)

# Perform transformation
transformed_data = [item for item in data if item["key"] == "value"]

# Save transformed data
with open("transformed_data.json", "w") as f:
json.dump(transformed_data, f)

# load.py
# --------------------------------
import json
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData

# Connect with database
engine = create_engine("postgresql://myuser:mypassword@localhost:5432/mydatabase")

# Create metadata object
metadata = MetaData()

# Define table schema
mytable = Table(
"mytable",
metadata,
Column("id", Integer, primary_key=True),
Column("column1", String),
Column("column2", String),
)

# Read transformed data from file
with open("transformed_data.json", "r") as f:
data = json.load(f)

# Load data into database
with engine.connect() as conn:
for item in data:
conn.execute(
mytable.insert().values(column1=item["column1"], column2=item["column2"])
)

The above scripts read from a dummy API and push it to a Postgres database.

Things to contemplate when deploying ETL pipelines to GitHub Actions.

1. Security: Keep your secrets secure through the use of GitHub’s secret store and avoid hardcoding secrets into your workflows.

Have you ever already noticed that the sample code I’ve given above has database credentials? It’s not right for a production system.

We now have other ways to securely embed secrets, like database credentials.

In case you don’t encrypt your secrets in GitHub Actions, they shall be visible to anyone who has access to the repository’s source code. Because of this if an attacker gains access to the repository or the repository’s source code is leaked; the attacker will give you the chance to see your secret values.

To guard your secrets, GitHub provides a feature called encrypted secrets, which permits you to store your secret values securely within the repository settings. Encrypted secrets are only accessible to authorized users and are never exposed in plaintext in your GitHub Actions workflows.

Here’s how it really works.

Within the repository settings sidebar, you will discover the secrets and variables for Actions. You possibly can create your variables here.

Screenshot by the creator.

Secrets created here are usually not visible to anyone. They’re encrypted and will be utilized in the workflow. Even you possibly can’t read them. But you possibly can update them with a brand new value.

When you created the secrets, you possibly can pass in them using the GitHub Actions configuration as an environment variable. Here’s how it really works:

name: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM each day

jobs:
etl:
runs-on: ubuntu-latest
steps:
...

- name: Load data
env: # Or as an environment variable
DB_USER: ${{ secrets.DB_USER }}
DB_PASS: ${{ secrets.DB_PASS }}
run: python load.py

Now, we are able to modify the Python scripts to read credentials from environment variables.

# load.py
# --------------------------------
import json
import os
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData

# Connect with database
engine = create_engine(
f"postgresql://{os.environ['DB_USER']}:{os.environ['DB_PASS']}@localhost:5432/mydatabase"
)

2. Dependencies: Be sure to make use of the proper version of dependencies to avoid any issues.

Your Python project may have already got a requirements.txt file that specifies dependencies together with their versions. Or, for more sophisticated projects, it’s possible you’ll be using modern dependency management tools like Poetry.

You must have a step to establish your environment before you run the opposite pieces of your ETL. You possibly can do that by specifying the next in your YAML configuration.

- name: Install dependencies
run: pip install -r requirements.txt

3. Timezone settings: GitHub actions use UTC timezone, and as of writing this post, you possibly can’t change it.

Thus it’s essential to make sure you’re using the proper timezone. You should utilize an internet converter or manually adjust your local time to UTC before configuring.

The largest caveat of GitHub motion scheduling is its uncertainty within the execution time. Although you’ve configured it to run at a selected cut-off date, if the demand is high at that time, your job shall be qued. Thus, there shall be a brief delay within the actual job starting time.

In case your job is dependent upon exact execution time, using GitHub Actions scheduling might be not a superb option. Using a self-hosted runner in GitHub actions may help.

4. Resource Usage: Avoid overloading the resources provided by GitHub.

Although GitHub actions, even with a free account, has 2000 minutes of free run time, when you use a special OS than Linux, rules change a bit.

In case you’re using a Windows runtime, you’ll get only half of it. In a MacOS environment, you’ll only get one-tenth of it.

Conclusion

GitHub actions is a DevOps tool. But we are able to use it to run any scheduled tasks. On this post, we’ve discussed create an ETL that periodically fetches an API and pushes the info to a dataframe.

For easy ETLs, this approach is simple to develop and deploy.

But scheduled jobs in GitHub actions don’t must run at the very same time. Hence for time bounded tasks, this isn’t suitable.

LEAVE A REPLY

Please enter your comment!
Please enter your name here