Home Artificial Intelligence Collecting Data with Apache Airflow on a Raspberry Pi

Collecting Data with Apache Airflow on a Raspberry Pi

0
Collecting Data with Apache Airflow on a Raspberry Pi

A Raspberry Pi is All You Need

Towards Data Science
Raspberry Pi Zero (model 2021), Image source Wikipedia

Often, we’d like to gather some data inside a certain time period. It might probably be data from the IoT sensor, statistical data from social networks, or something else. For instance, the YouTube Data API allows us to get the variety of views and subscribers for any channel at the present moment, however the analytics and historical data can be found only to the channel owner. Thus, if we wish to get weekly or monthly summaries about these channels, we’d like to gather this data ourselves. Within the case of the IoT sensor, there could also be no API in any respect, and we also need to gather and save data on our own. In this text, I’ll show the way to configure Apache Airflow on a Raspberry Pi, which allows running tasks for a protracted time period without involving any cloud provider.

Obviously, in the event you’re working for a big company, you will likely not need a Raspberry Pi. In that case, in the event you need an additional cloud instance, just create a Jira ticket in your MLOps department 😉 But for a pet project or a low-budget startup, it may be an interesting solution.

Let’s see how it really works.

Raspberry Pi

What is definitely a Raspberry Pi? For those readers who’ve never been excited about hardware for the last 10 years (the primary Raspberry Pi model was introduced in 2012), I can briefly explain that this can be a single-board computer running full-fledged Linux. Often, a Raspberry Pi has a 1GHz, 2–4-core ARM CPU and 1–8 MB of RAM. It’s small, low-cost, and silent; it has no fans and no disk drive (the OS is running from a Micro SD card). A Raspberry Pi needs only an ordinary USB power supply; it may be connected via Wi-Fi or Ethernet to a network and run different tasks inside months and even years.

For my data science pet project, I wanted to gather the YouTube channel statistics inside 2 weeks. For a task that requires only 30–60 seconds twice per day, a serverless architecture is usually a perfect solution, and we are able to use something like Google Cloud Function for that. But every tutorial from Google began with the phrase “enable billing in your project”. There may be free first credit and free quotas provided by Google, but I didn’t wish to have one other headache of monitoring how much money I…

LEAVE A REPLY

Please enter your comment!
Please enter your name here