WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

TFX Tuner on Vertex AI fails with DEADLINE\_EXCEEDED, no custom job launched #7775

@Wolff-Lucas

Description

@Wolff-Lucas

System information

  • Have I specified the code to reproduce the issue: Yes (pipeline with TFX Tuner on Vertex AI)
  • Environment in which the code is executed: Google Cloud Vertex AI Pipelines
  • TensorFlow version: 2.16
  • TFX Version: 1.16.0
  • Python version: 3.10

Describe the current behavior

When running a pipeline with the TFX Tuner component on Vertex AI, the tuning job gets stuck.
No Vertex AI custom job or study for hyperparameter tuning is launched.

It gets stuck right after returning from:

return TunerFnResult(tuner=tuner, fit_kwargs=fit_kwargs)

I have tested with AI Vizier, Cloud Tuner, RandomSearch, and all options listed in the [TFX Tuner documentation](https://www.tensorflow.org/tfx/guide/tuner).
The result is always the same: the job eventually fails with a grpc._channel._InactiveRpcError caused by a Deadline Exceeded error.

As a workaround, I had to force Keras tuning to run locally (so results are written to /tmp) and then manually copy results to GCS afterwards.


Describe the expected behavior

The Tuner component should successfully launch a Vertex AI hyperparameter tuning job (CustomJob/Study) and report the results back to the pipeline, instead of failing with a timeout.


Standalone code to reproduce the issue

A minimal pipeline containing a Tuner component configured to run on Vertex AI.
(tuner_fn similar to the official TFX [Tuner example](https://www.tensorflow.org/tfx/guide/tuner)).

tuner = Tuner(
    module_file=module_file,
    examples=example_gen.outputs['examples'],
    train_args=trainer_pb2.TrainArgs(num_steps=1000),
    eval_args=trainer_pb2.EvalArgs(num_steps=200),
)

The component runs fine locally, but fails when orchestrated on Vertex AI Pipelines.


Other info / logs

Full traceback from Vertex logs:

Traceback (most recent call last):
  ...
  File "/opt/conda/lib/python3.10/site-packages/keras_tuner/src/distribute/oracle_client.py", line 81, in create_trial
    response = self.stub.CreateTrial(
  ...
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
  status = StatusCode.DEADLINE_EXCEEDED
  details = "Deadline Exceeded"
  debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2025-09-12T08:10:49.033761848+00:00"}"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions