Apache Airflow One Shot- Building End To End ETL Pipeline Using AirFlow And Astro
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ETL is implemented as three ordered Airflow tasks: extract from an API, transform into structured fields, then load into PostgreSQL.
Briefing
Apache Airflow plus Astro is presented as a practical way to automate an end-to-end ETL pipeline that pulls live weather data from an API, transforms it into structured key-value fields, and loads it into PostgreSQL—complete with scheduling, a visual DAG UI, and containerized deployment.
The workflow starts with the broader data-project lifecycle: requirements are gathered by domain experts and product stakeholders, data needs are identified by analysts and scientists, and Big Data engineering turns those requirements into pipelines that can run repeatedly. ETL is defined in concrete terms: extract pulls data from multiple sources (the example uses a weather API), transform reshapes and combines that data (the example converts the API response into structured fields such as temperature, wind speed/direction, weather code, and timestamps), and load writes the results into a target datastore (PostgreSQL, with the option of other stores like MongoDB, S3, or MySQL depending on project needs).
Because source data changes over time, the pipeline must run on a schedule rather than as a one-off script. That scheduling and orchestration is handled by Airflow, which runs tasks in a directed order using DAGs (Directed Acyclic Graphs). Each ETL step becomes a task: one task fetches data from the API, another transforms it, and a third loads it into the database. The DAG structure ensures execution order without loops, and Airflow’s UI provides task graphs, durations, and logs.
Astro is introduced as a layer that manages Airflow more smoothly, including Docker-based development and deployment. The tutorial’s implementation uses Astro CLI to scaffold an Airflow project, then creates a new DAG file (ETL_weather.py) under the dags folder. Airflow hooks are used to integrate external systems: an HTTP hook retrieves weather data from Open-Meteo using an API endpoint built from latitude/longitude parameters, and a Postgres hook writes to a PostgreSQL table. The DAG is configured with default arguments and a daily schedule interval (at the “@daily” cadence). The pipeline is designed so the load step creates the target table if it doesn’t exist and then inserts transformed records.
Local execution is containerized. A Docker Compose file spins up PostgreSQL alongside the Airflow stack, with persistent storage via a volume so data survives restarts. Running “astro dev start” brings up the Airflow web UI (default admin/admin in the setup) and reveals a broken DAG until missing imports and Airflow connections are fixed. After adding the required Python imports (notably for JSON/request handling) and configuring two Airflow connections—one for PostgreSQL (hosted in the Docker container) and one for the HTTP API—the DAG turns green and runs successfully.
Once triggered, Airflow logs confirm the API fetch, transformation output is visible in XCom, and the database receives new rows. Verification is done using a database client (DBeaver) to query the weather_data table and observe records updating across multiple runs. For deployment, the approach is straightforward: point the HTTP and Postgres connection settings to AWS endpoints, and the same DAG logic continues writing into the AWS-hosted PostgreSQL database.
Cornell Notes
The pipeline automates ETL for weather data using Apache Airflow orchestrated through Astro. Airflow turns ETL steps into tasks inside a DAG: extract weather from Open-Meteo via an HTTP hook, transform the response into structured fields, then load into PostgreSQL using a Postgres hook. The DAG runs on a schedule (@daily) and provides a UI with task graphs, logs, and XCom outputs for intermediate results. Local development uses Astro CLI plus Docker Compose to run Airflow and PostgreSQL together with persistent storage. After configuring Airflow connections (Postgres host/credentials and the API base URL), the DAG executes successfully and inserts new rows into the weather_data table, which can be verified with DBeaver.
Why does the tutorial emphasize ETL running as a scheduled workflow rather than a one-time script?
How does Airflow’s DAG structure map to the ETL steps in this implementation?
What role do Airflow hooks and connections play in making the pipeline portable?
How does the tutorial verify that data actually landed in PostgreSQL?
What does containerization add to the development workflow here?
What changes are needed to move from local execution to AWS deployment?
Review Questions
- What are the three tasks created inside the DAG, and what does each one do?
- How do Airflow connections prevent hardcoding environment-specific details like database hostnames?
- Why is a Docker Compose volume important for the PostgreSQL container in this setup?
Key Points
- 1
ETL is implemented as three ordered Airflow tasks: extract from an API, transform into structured fields, then load into PostgreSQL.
- 2
Airflow schedules repeated runs using DAG configuration such as schedule_interval='@daily', which fits continuously changing data sources.
- 3
Astro streamlines Airflow development and deployment by managing the Airflow environment through Docker-based tooling.
- 4
HTTP and Postgres hooks rely on Airflow connections; missing or incorrect connections cause DAG failures until fixed.
- 5
Docker Compose runs PostgreSQL locally with persistent storage so database state survives restarts.
- 6
Airflow’s UI, logs, and XCom make it possible to debug each ETL stage and confirm intermediate outputs.
- 7
Deployment to AWS mainly requires updating connection endpoints so the same DAG writes into an AWS PostgreSQL database.