Airflow Spark Connector: Failed to poll for the driver status 10 times

Although the Spark connection is configured correctly, Airflow fails to retrieve the final driver status after a Spark submit completes. This issue arises because, in standalone mode, Spark exposes two different endpoints—while the SparkSubmitOperator in Airflow only supports a single conn_id.

System Configuration

This blog post is based on the following setup:

Airflow version 2.10.5
Spark version 3.5.1 deployed in a master-worker setup using Docker
Spark is running in standalone cluster mode
The Airflow connection is set to type Spark, with the deploy mode set to cluster, port set to 7077

While the Spark JDBC hook can also be used to store connection information related to a Spark submit, it doesn’t manage or monitor the job lifecycle. Specifically, it cannot remotely kill the application or retrieve the final driver status. What’s especially frustrating is that Airflow may log a driver status of FAILED, yet still mark the task as green (successful).

Problem description

When Airflow’s log level is set to DEBUG, we can observe that after a spark-submit, Airflow attempts to poll the driver status from the Spark cluster. However, this polling fails and exits with an error.

{spark_submit.py:516} DEBUG - Poll driver status cmd: ['spark-submit', '--master', 'spark://zmi-spark-dev.test.med.tu-dresden.de:7077', '--status', 'driver-20250213113217-0020']

{spark_submit.py:664} DEBUG - spark driver status log: 25/02/13 11:32:47 WARN RestSubmissionClient: Unable to connect to server spark://zmi-spark-dev.test.med.tu-dresden.de:7077.

{spark_submit.py:664} DEBUG - spark driver status log: Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server

Eventually, this results in the following exception:

packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 728, in _start_driver_status_tracking

    raise AirflowException(

airflow.exceptions.AirflowException: Failed to poll for the driver status 10 times: returncode = 1

Running the polling command manually results in the same error, confirming that the request itself is invalid.

Digging into the Airflow code

Airflow uses the following (roughly sketched out) command to poll the driver status, which is defined in the Spark hook source code

spark_host = self._connection["master"]

# Via Spark binary
spark-submit ["--master", spark_host ] ["--status", self._driver_id]

# Or via REST (depending on configuration)
connection_cmd = [
                "/usr/bin/curl",
                "--max-time",
                str(curl_max_wait_time),
                f"{spark_host }/v1/submissions/status/{self._driver_id}",
            ]

Whether or not polling is triggered depends on two conditions:

The deploy mode is set to cluster in the Airflow connection
The master URL starts with spark://

Now here’s the crucial part: Spark exposes multiple endpoints in standalone mode

Port 7077 is used for the master UI and to submit applications using the spark-submit binary.
Port 6066 provides a REST endpoint and can additionally handle status polling using standard https requests.

Official Spark docs reference

Related Stack Overflow thread

As shown above, Airflow constructs the polling logic using self._connection["master"]. This _connection attribute is created based on the conn_id passed into the SparkSubmitOperator when instantiating the object, as defined here.

This creates a limitation: Spark in standalone mode relies on two separate endpoints, but the SparkSubmitOperator only supports a single conn_id. As a result, you cannot configure both submission and polling URLs through standard means.

For driver status polling to work correctly, the endpoint must point to port 6066, not 7077.

Solution

There’s already an open GitHub issue discussing this problem, and several users have resolved it by creating a custom operator.

But what does that actually involve?

By default, Airflow assigns the Spark host like this:

spark_host = self._connection["master"]

(Defined in this line in the Spark hook code)

To work around this, we created a custom operator by subclassing SparkSubmitOperator and overriding the relevant method¹. In our version, we explicitly assign spark_host to a separate URL that points to port 6066, but only at the point where driver status polling occurs.

Why Not Just Use Port 6066 in the Airflow Connection?

You might wonder: why not just set the Airflow conn_id to point to the 6066 REST endpoint?

The reason is that the _build_spark_submit_command function in the SparkSubmitOPerator (source) constructs the submission command using the spark-submit binary. By default, DEFAULT_SPARK_BINARY is set to "spark-submit", which I believe cannot handle the REST endpoint at port 6066.

While both ports 7077 and 6066 are technically capable of handling Spark submits (discussion), they require different handling: 7077 is used with the traditional Spark binary interface, whereas 6066 expects interaction through the REST API.

If this post helped you—or if you have additional insights or suggestions—feel free to leave a comment!

Special acknowledgment to my team lead, J.H., for identifying the correct API and finding a way to implement the subclass. ↩︎

System Configuration

Problem description

Digging into the Airflow code

Solution

Why Not Just Use Port 6066 in the Airflow Connection?

Leave a Comment Cancel Reply