We observed a problem where dags did not run at the specified time at all but consistently started at a random time. Let’s dig into it.
Expected Behavior:
We have a job chain of three dags which are scheduled for daily execution and should run in succession. For that we defined the start_date at 22.11.2021 11:30 (12:30 / 13:30 repectively).
Observation:
Airflow dags are running almost at the same time, although they are clearly scheduled 1 hour apart from each other. Moreover they do not start nearly at the same time as indicated by start_date even if you consider different time zones.
Background information:
To solve this kind of mystery you need to know 2 important things about the airflow scheduler, especially about the start_date parameter.
- if you set a start_date in the past with a specific time interval, Airflow will try to fill up and execute all missing runs immediately after you activate the dag
- “Note that Airflow simply looks at the latest
execution_date
and adds theschedule_interval
to determine the nextexecution_date
” – taken from the official docs. If there is no last execution_date (because it is the first run) it will run at the specified start_date instead - optional background information: start date just defines the starting time interval, on when the dag should be executed
Now let’s put the pieces together:
- we deployed the dags, and set the start_date at a point in time in the past
- we activated all the above mentioned dags apparently at around 12:02
- Airflow realizes, that it has been missing some runs and executes the dag at 12:02
- now Airflow calculates the next execution_date which would be 12:02 on the next day – for all the dags! (disregarding the start time of 11.00 /12:00 / 13:00)
So there you have it. We actually figured it out just minutes before stumbling upon the blog post Apache Airflow Gotchas which mentions the execution_date calculation and implicitely hints to the understanding of our problem at hand
Learnings:
- a very obvious best practice is, to create a job chain, which controls and triggers the dags to ensure their dependency
- when deploying a new dag, the start_date should be set to a future point in time to ensure, that execution of the first dag takes place at the specified time
Story time:
Part of why we were so confused was, that we had two deployments with that same issue. One of which would start all dags at around 12:00 pm and the other one at around 03:00 am (!). Those hours were just too wild and arbitrary to have any pattern. After actually understanding what is going on, we were seriously confused on who or how a deployment at 3 am took place. Turns out it was I who deployed it on a Wednesday night to our dev system …
If you found this post to be helpful or encountered a similiar problem, feel free to leave a comment behind 🙂