-
Notifications
You must be signed in to change notification settings - Fork 727
Description
System information
- Have I specified the code to reproduce the issue: Yes (pipeline with TFX Tuner on Vertex AI)
- Environment in which the code is executed: Google Cloud Vertex AI Pipelines
- TensorFlow version: 2.16
- TFX Version: 1.16.0
- Python version: 3.10
Describe the current behavior
When running a pipeline with the TFX Tuner component on Vertex AI, the tuning job gets stuck.
No Vertex AI custom job or study for hyperparameter tuning is launched.
It gets stuck right after returning from:
return TunerFnResult(tuner=tuner, fit_kwargs=fit_kwargs)I have tested with AI Vizier, Cloud Tuner, RandomSearch, and all options listed in the [TFX Tuner documentation](https://www.tensorflow.org/tfx/guide/tuner).
The result is always the same: the job eventually fails with a grpc._channel._InactiveRpcError caused by a Deadline Exceeded error.
As a workaround, I had to force Keras tuning to run locally (so results are written to /tmp) and then manually copy results to GCS afterwards.
Describe the expected behavior
The Tuner component should successfully launch a Vertex AI hyperparameter tuning job (CustomJob/Study) and report the results back to the pipeline, instead of failing with a timeout.
Standalone code to reproduce the issue
A minimal pipeline containing a Tuner component configured to run on Vertex AI.
(tuner_fn similar to the official TFX [Tuner example](https://www.tensorflow.org/tfx/guide/tuner)).
tuner = Tuner(
module_file=module_file,
examples=example_gen.outputs['examples'],
train_args=trainer_pb2.TrainArgs(num_steps=1000),
eval_args=trainer_pb2.EvalArgs(num_steps=200),
)The component runs fine locally, but fails when orchestrated on Vertex AI Pipelines.
Other info / logs
Full traceback from Vertex logs:
Traceback (most recent call last):
...
File "/opt/conda/lib/python3.10/site-packages/keras_tuner/src/distribute/oracle_client.py", line 81, in create_trial
response = self.stub.CreateTrial(
...
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2025-09-12T08:10:49.033761848+00:00"}"