Get AI summaries of any video or article — Sign up free
Apache Airflow One Shot- Building End To End ETL Pipeline Using AirFlow And Astro thumbnail

Apache Airflow One Shot- Building End To End ETL Pipeline Using AirFlow And Astro

Krish Naik·
5 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

ETL is implemented as three ordered Airflow tasks: extract from an API, transform into structured fields, then load into PostgreSQL.

Briefing

Apache Airflow plus Astro is presented as a practical way to automate an end-to-end ETL pipeline that pulls live weather data from an API, transforms it into structured key-value fields, and loads it into PostgreSQL—complete with scheduling, a visual DAG UI, and containerized deployment.

The workflow starts with the broader data-project lifecycle: requirements are gathered by domain experts and product stakeholders, data needs are identified by analysts and scientists, and Big Data engineering turns those requirements into pipelines that can run repeatedly. ETL is defined in concrete terms: extract pulls data from multiple sources (the example uses a weather API), transform reshapes and combines that data (the example converts the API response into structured fields such as temperature, wind speed/direction, weather code, and timestamps), and load writes the results into a target datastore (PostgreSQL, with the option of other stores like MongoDB, S3, or MySQL depending on project needs).

Because source data changes over time, the pipeline must run on a schedule rather than as a one-off script. That scheduling and orchestration is handled by Airflow, which runs tasks in a directed order using DAGs (Directed Acyclic Graphs). Each ETL step becomes a task: one task fetches data from the API, another transforms it, and a third loads it into the database. The DAG structure ensures execution order without loops, and Airflow’s UI provides task graphs, durations, and logs.

Astro is introduced as a layer that manages Airflow more smoothly, including Docker-based development and deployment. The tutorial’s implementation uses Astro CLI to scaffold an Airflow project, then creates a new DAG file (ETL_weather.py) under the dags folder. Airflow hooks are used to integrate external systems: an HTTP hook retrieves weather data from Open-Meteo using an API endpoint built from latitude/longitude parameters, and a Postgres hook writes to a PostgreSQL table. The DAG is configured with default arguments and a daily schedule interval (at the “@daily” cadence). The pipeline is designed so the load step creates the target table if it doesn’t exist and then inserts transformed records.

Local execution is containerized. A Docker Compose file spins up PostgreSQL alongside the Airflow stack, with persistent storage via a volume so data survives restarts. Running “astro dev start” brings up the Airflow web UI (default admin/admin in the setup) and reveals a broken DAG until missing imports and Airflow connections are fixed. After adding the required Python imports (notably for JSON/request handling) and configuring two Airflow connections—one for PostgreSQL (hosted in the Docker container) and one for the HTTP API—the DAG turns green and runs successfully.

Once triggered, Airflow logs confirm the API fetch, transformation output is visible in XCom, and the database receives new rows. Verification is done using a database client (DBeaver) to query the weather_data table and observe records updating across multiple runs. For deployment, the approach is straightforward: point the HTTP and Postgres connection settings to AWS endpoints, and the same DAG logic continues writing into the AWS-hosted PostgreSQL database.

Cornell Notes

The pipeline automates ETL for weather data using Apache Airflow orchestrated through Astro. Airflow turns ETL steps into tasks inside a DAG: extract weather from Open-Meteo via an HTTP hook, transform the response into structured fields, then load into PostgreSQL using a Postgres hook. The DAG runs on a schedule (@daily) and provides a UI with task graphs, logs, and XCom outputs for intermediate results. Local development uses Astro CLI plus Docker Compose to run Airflow and PostgreSQL together with persistent storage. After configuring Airflow connections (Postgres host/credentials and the API base URL), the DAG executes successfully and inserts new rows into the weather_data table, which can be verified with DBeaver.

Why does the tutorial emphasize ETL running as a scheduled workflow rather than a one-time script?

Source data changes over time, so the pipeline must be rerun daily, weekly, or monthly. In the example, the weather API returns current conditions based on latitude/longitude, so the extract-transform-load cycle needs repeated execution. Airflow provides that orchestration by scheduling the DAG and running tasks in the correct order.

How does Airflow’s DAG structure map to the ETL steps in this implementation?

The DAG breaks ETL into three tasks: (1) extract_weather_data pulls from the Open-Meteo API using an HTTP hook, (2) transform_weather_data converts the API response into structured key-value fields (temperature, wind speed/direction, weather code, timestamp, etc.), and (3) load_weather_data inserts the transformed data into PostgreSQL using a Postgres hook. The DAG’s directed, acyclic graph ensures the extract task runs before transform, and transform runs before load.

What role do Airflow hooks and connections play in making the pipeline portable?

Hooks encapsulate how to talk to external systems (HTTP for API calls; Postgres for database writes). Connections store environment-specific details like the Postgres host/port/credentials and the API base URL. In the tutorial, the DAG fails initially because connections aren’t set; once the PostgreSQL connection (pointing to the Docker container name and port 5432) and the HTTP connection (Open-Meteo URL) are configured, the DAG runs without changing the core code.

How does the tutorial verify that data actually landed in PostgreSQL?

After running the DAG, it checks Airflow logs and XCom for intermediate outputs, then queries the database directly. Using DBeaver, it connects to PostgreSQL and runs a SQL query like SELECT * FROM weather_data. Re-running the DAG shows additional records, confirming that the load task inserts new transformed rows.

What does containerization add to the development workflow here?

Docker Compose runs PostgreSQL in a container alongside the Astro/Airflow environment. The tutorial also configures a volume so database data persists even if containers restart. This makes local setup reproducible and reduces “works on my machine” issues when developing and testing the ETL pipeline.

What changes are needed to move from local execution to AWS deployment?

The DAG logic stays the same; the key change is updating Airflow connection settings. After creating an AWS-hosted PostgreSQL instance, the tutorial instructs updating the Postgres connection endpoint (host) in Airflow so the load task writes to the AWS database. With the correct connections, the pipeline continues to extract from the API and load into the new target.

Review Questions

  1. What are the three tasks created inside the DAG, and what does each one do?
  2. How do Airflow connections prevent hardcoding environment-specific details like database hostnames?
  3. Why is a Docker Compose volume important for the PostgreSQL container in this setup?

Key Points

  1. 1

    ETL is implemented as three ordered Airflow tasks: extract from an API, transform into structured fields, then load into PostgreSQL.

  2. 2

    Airflow schedules repeated runs using DAG configuration such as schedule_interval='@daily', which fits continuously changing data sources.

  3. 3

    Astro streamlines Airflow development and deployment by managing the Airflow environment through Docker-based tooling.

  4. 4

    HTTP and Postgres hooks rely on Airflow connections; missing or incorrect connections cause DAG failures until fixed.

  5. 5

    Docker Compose runs PostgreSQL locally with persistent storage so database state survives restarts.

  6. 6

    Airflow’s UI, logs, and XCom make it possible to debug each ETL stage and confirm intermediate outputs.

  7. 7

    Deployment to AWS mainly requires updating connection endpoints so the same DAG writes into an AWS PostgreSQL database.

Highlights

Airflow turns ETL steps into a DAG where extract → transform → load are explicit tasks, making execution order and debugging visible.
Astro dev start plus Docker Compose provides a complete local environment: Airflow web UI, worker containers, and PostgreSQL running together.
The pipeline’s success hinges on configuring two Airflow connections—one for PostgreSQL (Docker container host/port) and one for the Open-Meteo HTTP endpoint.
XCom is used to inspect transformed data, while DBeaver confirms that load inserts rows into the weather_data table.
Moving to AWS is framed as a connection update exercise rather than rewriting the ETL logic.

Topics

  • Apache Airflow
  • Astro
  • ETL Pipeline
  • DAG Scheduling
  • PostgreSQL Loading

Mentioned