Although the Spark connection is configured correctly, Airflow fails to retrieve the final driver status after a Spark submit completes. This issue arises because, in standalone mode, Spark exposes two different endpoints—while the SparkSubmitOperator
in Airflow only supports a single conn_id
.
System Configuration
This blog post is based on the following setup:
- Airflow version 2.10.5
- Spark version 3.5.1 deployed in a master-worker setup using Docker
- Spark is running in standalone cluster mode
- The Airflow connection is set to type
Spark
, with the deploy mode set tocluster
, port set to 7077
While the Spark JDBC hook can also be used to store connection information related to a Spark submit, it doesn’t manage or monitor the job lifecycle. Specifically, it cannot remotely kill the application or retrieve the final driver status. What’s especially frustrating is that Airflow may log a driver status of FAILED
, yet still mark the task as green (successful).
Problem description
When Airflow’s log level is set to DEBUG
, we can observe that after a spark-submit
, Airflow attempts to poll the driver status from the Spark cluster. However, this polling fails and exits with an error.
{spark_submit.py:516} DEBUG - Poll driver status cmd: ['spark-submit', '--master', 'spark://zmi-spark-dev.test.med.tu-dresden.de:7077', '--status', 'driver-20250213113217-0020']
{spark_submit.py:664} DEBUG - spark driver status log: 25/02/13 11:32:47 WARN RestSubmissionClient: Unable to connect to server spark://zmi-spark-dev.test.med.tu-dresden.de:7077.
{spark_submit.py:664} DEBUG - spark driver status log: Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
Eventually, this results in the following exception:
packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 728, in _start_driver_status_tracking
raise AirflowException(
airflow.exceptions.AirflowException: Failed to poll for the driver status 10 times: returncode = 1
Running the polling command manually results in the same error, confirming that the request itself is invalid.
Digging into the Airflow code
Airflow uses the following (roughly sketched out) command to poll the driver status, which is defined in the Spark hook source code
spark_host = self._connection["master"]
# Via Spark binary
spark-submit ["--master", spark_host ] ["--status", self._driver_id]
# Or via REST (depending on configuration)
connection_cmd = [
"/usr/bin/curl",
"--max-time",
str(curl_max_wait_time),
f"{spark_host }/v1/submissions/status/{self._driver_id}",
]
Whether or not polling is triggered depends on two conditions:
- The deploy mode is set to
cluster
in the Airflow connection - The
master
URL starts withspark://
Now here’s the crucial part: Spark exposes multiple endpoints in standalone mode
- Port
7077
is used for the master UI and to submit applications. - Port 6066 provides a REST endpoint and can additionally handle status polling
As shown above, Airflow constructs the polling logic using self._connection["master"]
. This _connection
attribute is created based on the conn_id
passed into the SparkSubmitOperator
when instantiating the object, as defined here.
This creates a limitation: Spark in standalone mode relies on two separate endpoints, but the SparkSubmitOperator
only supports a single conn_id
. As a result, you cannot configure both submission and polling URLs through standard means.
For driver status polling to work correctly, the endpoint must point to port 6066
, not 7077
.
Solution
There’s already an open GitHub issue discussing this problem, and several users have resolved it by creating a custom operator.
But what does that actually involve?
By default, Airflow assigns the Spark host like this:
spark_host = self._connection["master"]
(Defined in this line in the Spark hook code)
To work around this, we created a custom operator by subclassing SparkSubmitOperator
and overriding the relevant method1. In our version, we explicitly assign spark_host
to a separate URL that points to port 6066
, but only at the point where driver status polling occurs.
Why Not Just Use Port 6066 in the Airflow Connection?
You might wonder: why not just set the Airflow conn_id
to point to the 6066
REST endpoint?
The reason is that the _build_spark_submit_command
function in the SparkSubmitOPerator (source) constructs the submission command using the spark-submit
binary. By default, DEFAULT_SPARK_BINARY
is set to "spark-submit"
, which I believe cannot handle the REST endpoint at port 6066
.
While both ports 7077
and 6066
are technically capable of handling Spark submits (discussion), they require different handling: 7077
is used with the traditional Spark binary interface, whereas 6066
expects interaction through the REST API.
If this post helped you—or if you have additional insights or suggestions—feel free to leave a comment!
- Special acknowledgment to my team lead, J.H., for identifying the correct API and finding a way to implement the subclass. ↩︎