-
Notifications
You must be signed in to change notification settings - Fork 541
Description
PKB Bug Report: Incomplete Teardown When Provision Phase Fails
Issue Type: Bug
Severity: Critical
Component: Resource Cleanup / Teardown
Related Issue: #1239
Status: Proposed Fix which implemented and tested successfully on my environment
Fix Summary
The Delete() method in benchmark_spec.py fails to clean up network and firewall resources when:
- The
deletedflag is prematurely set toTrueduring provision failures - Pickle data corruption causes network/firewall objects to be invalid
- Teardown runs with
--run_stage=teardownbut skips cleanup due to thedeletedflag
Implemented Solution: Three-layer defense strategy in perfkitbenchmarker/benchmark_spec.py:
- Ignore deleted flag during teardown-only runs: Don't trust the
deletedflag whenstages.TEARDOWNis inFLAGS.run_stage - Validate objects before use: Check that network/firewall objects have valid
Delete()methods - Fallback to direct cleanup: When objects are invalid, use
gcloudcommands to delete resources by name pattern
Test Results: Successfully cleaned up orphaned resources from run URI e848137d:
- Deleted firewalls:
default-internal-10-0-0-0-8-e848137d,perfkit-firewall-e848137d-22-22 - Deleted network:
pkb-network-e848137d
Original Bug Report
Summary
PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase, leaving orphaned GCP resources that:
- Cost money
- Block future runs with the same
run_uri - Require manual cleanup
Root Cause Analysis
The primary issue is a fatal ControlPath too long error during the SSH connection setup within the WaitForBootCompletion method. This error is triggered because the temporary directory path for the SSH control socket exceeds the maximum allowed length on the user's macOS system. This initial error triggers a KeyboardInterrupt, which then causes the incomplete teardown.
The Delete() method in benchmark_spec.py has a critical flaw: firewall and network deletion is skipped when self.vms is empty, which occurs when a benchmark fails during VM provisioning.
Code Analysis
In perfkitbenchmarker/benchmark_spec.py (lines 958-987):
def Delete(self):
if self.deleted:
return
# ... other resource deletions ...
if self.vms: # ← PROBLEM: This condition gates VM deletion
try:
background_tasks.RunThreaded(self.DeleteVm, self.vms)
background_tasks.RunThreaded(
lambda vm: vm.DeleteScratchDisks(), self.vms
)
except Exception:
logging.exception(...)
# Placement groups deletion (outside if block)
for firewall in self.firewalls.values(): # ← This code runs
try:
firewall.DisallowAllPorts() # ← But only DISABLES ports, doesn't DELETE
except Exception:
logging.exception(...)
# ... container cluster deletion ...
for net in self.networks.values(): # ← Network deletion fails
try:
net.Delete() # ← Fails because firewall rules still exist
except Exception:
logging.exception(...)The Bug:
firewall.DisallowAllPorts()only disables firewall rules, it doesn't delete them- Network deletion fails because firewall rules are dependencies
- When
self.vmsis empty (provision failure), the firewall deletion logic is effectively bypassed
Reproduction Steps
- Run any PKB benchmark that creates networks and VMs
- Kill the process during VM provisioning (after network/firewall creation but before VMs are fully created)
- Run teardown with
--run_stage=teardown --run_uri=<failed_run_uri> - Observe that network and firewall rules are NOT deleted
Example Command
# Start benchmark
./pkb.py --benchmarks=nginx \
--cloud=GCP \
--project=my-project \
--zone=us-central1-a
# Kill process during VM creation (Ctrl+C)
# Attempt cleanup
./pkb.py --benchmarks=nginx \
--cloud=GCP \
--project=my-project \
--zone=us-central1-a \
--run_uri=<failed_run_uri> \
--run_stage=teardownResult: Teardown reports "SUCCEEDED" but resources remain:
gcloud compute networks list | grep <run_uri>
# pkb-network-<run_uri> True
gcloud compute firewall-rules list | grep <run_uri>
# perfkit-firewall-<run_uri>-22-22
# default-internal-10-0-0-0-8-<run_uri>Actual Behavior
- Benchmark fails during provision phase
self.vmslist is empty in pickled spec- Teardown runs but skips effective firewall deletion
- Network deletion fails due to firewall dependencies
- PKB reports teardown as "SUCCEEDED" despite orphaned resources
- Resources remain in GCP, costing money
Expected Behavior
- Teardown should delete ALL created resources regardless of
self.vmsstate - Firewall rules should be explicitly deleted, not just disabled
- Network deletion should succeed after firewall deletion
- PKB should validate that resources were actually deleted
- If resources remain, teardown should report FAILED
Impact
Resource Leakage
- Orphaned networks cost ~$0.01/hour each
- Orphaned firewall rules accumulate over time
- No automatic cleanup mechanism exists
Operational Issues
- Cannot resume benchmarks with same
run_uri - Manual cleanup required for each failed run
- Difficult to track which resources are orphaned
Silent Failure
- PKB reports "SUCCEEDED" for teardown even when resources remain
- No warning or error messages
- Users unaware of orphaned resources until billing alerts
Evidence
Our Case Study
Run URI: e848137d
Timeline:
17:10:38 - Network created successfully
17:11:16 - Firewall rules created successfully
17:11:30 - VM creation started
17:15:47 - VM deleted (benchmark failed)
17:19:18 - Network deletion attempted → FAILED (firewall rules still exist)
20:23:56 - Teardown with --run_stage=teardown → Reports "SUCCEEDED"
Orphaned Resources (verified with gcloud):
$ gcloud compute networks list --project=YOUR_PROJECT_ID | grep e848137d
pkb-network-e848137d True
$ gcloud compute firewall-rules list --project=YOUR_PROJECT_ID | grep e848137d
default-internal-10-0-0-0-8-e848137d pkb-network-e848137d INGRESS
perfkit-firewall-e848137d-22-22 pkb-network-e848137d INGRESSPKB Teardown Log:
2025-11-17 20:23:56,540 e848137d MainThread nginx(1/0) INFO Tearing down resources for benchmark nginx
2025-11-17 20:23:56,546 e848137d MainThread INFO Benchmark run statuses:
Name UID Status Failed Substatus
nginx nginx0 SUCCEEDED UNCATEGORIZED
Success rate: 100.00% (1/1)
Note: Teardown completed in 0.006 seconds - no actual deletion occurred!
My Failed Attempts
My previous attempts to fix the issue were unsuccessful because I was focused on the wrong problem. I initially believed the issue was with the teardown logic itself, but the real problem is the ControlPath too long error that triggers the entire failure cascade. My fixes to the teardown logic were ineffective because the script was never reaching that point in the code.
The following run_uri was used for testing: a25b5b22
When the pickle file is missing, the following error is thrown:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/perfkitbenchmarker/runs/e848137d/cluster_boot0'
Related Issues
This bug is related to Issue #1239 which describes similar cleanup problems in the object_storage_service benchmark. That issue has been open since December 2016 (8+ years) with no resolution.
Both issues stem from the same root cause: incomplete resource tracking and cleanup logic in PKB's teardown phase.
Proposed Fix
Option 1: Always Delete Networks and Firewalls
Modify benchmark_spec.py to delete networks and firewalls regardless of VM state:
def Delete(self):
if self.deleted:
return
# ... other resource deletions ...
if self.vms:
try:
background_tasks.RunThreaded(self.DeleteVm, self.vms)
background_tasks.RunThreaded(
lambda vm: vm.DeleteScratchDisks(), self.vms
)
except Exception:
logging.exception(...)
# ALWAYS delete firewalls, even if no VMs exist
for firewall in self.firewalls.values():
try:
firewall.Delete() # ← Explicit deletion, not just DisallowAllPorts()
except Exception:
logging.exception(...)
# ... container cluster deletion ...
# ALWAYS delete networks, even if no VMs exist
for net in self.networks.values():
try:
net.Delete()
except Exception:
logging.exception(...)Option 2: Comprehensive Resource Tracking
Track all created resources in the pickled spec, not just VMs:
class BenchmarkSpec:
def __init__(self, ...):
# ... existing code ...
self.created_resources = {
'networks': [],
'firewalls': [],
'vms': [],
'disks': [],
# etc.
}
def Delete(self):
# Delete all tracked resources regardless of state
for resource_type, resources in self.created_resources.items():
for resource in resources:
try:
resource.Delete()
except Exception:
logging.exception(...)Option 3: Validate Cleanup
Add validation to ensure resources were actually deleted:
def Delete(self):
# ... existing deletion code ...
# Validate cleanup
remaining_resources = []
for net in self.networks.values():
if net.Exists():
remaining_resources.append(f"Network: {net.name}")
for firewall in self.firewalls.values():
if firewall.Exists():
remaining_resources.append(f"Firewall: {firewall.name}")
if remaining_resources:
raise errors.Resource.CleanupError(
f"Failed to delete resources: {remaining_resources}"
)Workaround
Until this is fixed, users must manually clean up orphaned resources:
# List orphaned resources
RUN_URI="<failed_run_uri>"
PROJECT="<your-project>"
gcloud compute networks list --project=${PROJECT} | grep ${RUN_URI}
gcloud compute firewall-rules list --project=${PROJECT} | grep ${RUN_URI}
# Delete firewall rules first (dependency)
gcloud compute firewall-rules delete \
perfkit-firewall-${RUN_URI}-22-22 \
default-internal-10-0-0-0-8-${RUN_URI} \
--project=${PROJECT} \
--quiet
# Delete network
gcloud compute networks delete \
pkb-network-${RUN_URI} \
--project=${PROJECT} \
--quietEnvironment
- PKB Version: v1.12.0-5933-g20771be8
- Cloud Provider: GCP
- Benchmark: nginx (but affects all benchmarks)
- Platform: macOS (but cloud-agnostic issue)
- Python: 3.11
Additional Context
This bug has significant cost implications for organizations running PKB at scale. Each failed benchmark run leaves orphaned resources that accumulate over time. Without automated cleanup or proper error reporting, these resources can go unnoticed until billing alerts trigger.
The bug is particularly insidious because:
- PKB reports teardown as "SUCCEEDED" even when it fails
- No warning messages are generated
- The pickled spec doesn't track all created resources
- Manual intervention is required for every failed run
Deep Dive Analysis (November 17, 2025 - 21:36)
Pickle File Investigation
Examined the pickled BenchmarkSpec for run_uri=e848137d:
# Pickle file: /tmp/perfkitbenchmarker/runs/e848137d/nginx0
Networks: ['[', ' {', ' "autoCreateSubnetworks": true,', ...] # JSON strings!
Firewalls: [] # Empty
VMs: 0 # No VMs
Deleted flag: True # Already marked as deletedCritical Discovery: Corrupted Pickle Data
The spec.networks dictionary contains raw JSON strings as keys instead of GceNetwork objects. This corruption explains the silent failure:
if self.deleted: return- Thedeletedflag is alreadyTrue, soDelete()returns immediatelyif self.networks:- Evaluates toTrue(dict has JSON string keys)for net in self.networks.values():- Iterates over JSON strings, not network objectsnet.Delete()- Calling.Delete()on a string does nothing (no error, no deletion)- Teardown completes in 0.006 seconds - No actual work performed
Root Cause: Three-Layer Failure
- Layer 1: Provision Failure - Network creation succeeded but stored raw JSON instead of objects
- Layer 2: Pickle Corruption - The
networksdict was pickled with malformed data - Layer 3: Silent Teardown Failure - The
deletedflag prevents re-execution, and invalid objects cause silent no-ops
Why Previous Fix Failed
The previous attempt added resource discovery but only for FileNotFoundError. In this case:
- Pickle file exists at
/tmp/perfkitbenchmarker/runs/e848137d/nginx0 - The
deleted=Trueflag causes immediate return fromDelete() - Resource discovery code never executes
- No cleanup occurs
Revised Proposed Fix
Strategy: Robust Teardown with Validation and Recovery
The fix must handle three scenarios:
Scenario 1: Normal Teardown (provision succeeded)
self.networkscontains validGceNetworkobjectsself.firewallscontains validGceFirewallobjects- Standard deletion works
Scenario 2: Corrupted Pickle (THIS CASE)
self.networkscontains invalid data (JSON strings, empty, etc.)self.firewallsmay be empty or invaliddeletedflag may beTrue- Must bypass early return and force cleanup
Scenario 3: Missing Pickle
- No pickle file exists
- Must create minimal spec and discover resources
Implementation Plan
def Delete(self):
"""Delete all benchmark resources with robust error handling."""
# CHANGE 1: Don't trust the deleted flag during teardown-only runs
if self.deleted and stages.TEARDOWN not in FLAGS.run_stage:
return
# CHANGE 2: Validate network/firewall objects before use
valid_networks = self._ValidateNetworkObjects()
valid_firewalls = self._ValidateFirewallObjects()
# CHANGE 3: If validation fails, fall back to direct GCP cleanup
if not valid_networks or not valid_firewalls:
logging.warning(
'Invalid network/firewall objects detected. '
'Falling back to direct GCP resource cleanup.'
)
self._CleanupOrphanedGCPResources()
# ... rest of deletion logic ...
# CHANGE 4: Always attempt firewall deletion (not just DisallowAllPorts)
if self.firewalls:
for firewall in self.firewalls.values():
try:
if hasattr(firewall, 'Delete') and callable(firewall.Delete):
firewall.Delete()
else:
logging.warning(f'Invalid firewall object: {firewall}')
except Exception:
logging.exception('Got an exception deleting firewalls.')
# CHANGE 5: Always attempt network deletion
if self.networks:
for net in self.networks.values():
try:
if hasattr(net, 'Delete') and callable(net.Delete):
net.Delete()
else:
logging.warning(f'Invalid network object: {net}')
except Exception:
logging.exception('Got an exception deleting networks.')
self.deleted = True
def _ValidateNetworkObjects(self) -> bool:
"""Validate that network objects are actual GceNetwork instances."""
if not self.networks:
return False
for net in self.networks.values():
if not hasattr(net, 'Delete') or not callable(net.Delete):
return False
return True
def _ValidateFirewallObjects(self) -> bool:
"""Validate that firewall objects are actual GceFirewall instances."""
if not self.firewalls:
return False
for fw in self.firewalls.values():
if not hasattr(fw, 'Delete') or not callable(fw.Delete):
return False
return True
def _CleanupOrphanedGCPResources(self) -> None:
"""Directly query and delete GCP resources by run_uri pattern."""
if FLAGS.cloud != provider_info.GCP:
logging.warning('Direct cleanup only supported for GCP')
return
from perfkitbenchmarker.providers.gcp import util as gcp_util
import json
project = FLAGS.project or self.project
if not project:
logging.error('No project specified for cleanup')
return
logging.info(f'Cleaning up orphaned GCP resources for run_uri: {FLAGS.run_uri}')
# Delete firewall rules first (dependency for networks)
try:
cmd = gcp_util.GcloudCommand(
self, 'compute', 'firewall-rules', 'list',
'--filter', f'name~-{FLAGS.run_uri}',
'--format', 'value(name)'
)
cmd.flags['project'] = project
stdout, _, retcode = cmd.Issue(raise_on_failure=False)
if retcode == 0 and stdout.strip():
for firewall_name in stdout.strip().split('\n'):
logging.info(f'Deleting orphaned firewall: {firewall_name}')
del_cmd = gcp_util.GcloudCommand(
self, 'compute', 'firewall-rules', 'delete', firewall_name
)
del_cmd.flags['project'] = project
del_cmd.Issue(raise_on_failure=False)
except Exception as e:
logging.exception(f'Failed to clean up firewalls: {e}')
# Delete networks
try:
cmd = gcp_util.GcloudCommand(
self, 'compute', 'networks', 'list',
'--filter', f'name~pkb-network.*{FLAGS.run_uri}',
'--format', 'value(name)'
)
cmd.flags['project'] = project
stdout, _, retcode = cmd.Issue(raise_on_failure=False)
if retcode == 0 and stdout.strip():
for network_name in stdout.strip().split('\n'):
logging.info(f'Deleting orphaned network: {network_name}')
del_cmd = gcp_util.GcloudCommand(
self, 'compute', 'networks', 'delete', network_name
)
del_cmd.flags['project'] = project
del_cmd.Issue(raise_on_failure=False)
except Exception as e:
logging.exception(f'Failed to clean up networks: {e}')Key Changes
- Don't trust
deletedflag during teardown-only runs - Validate objects before calling methods on them
- Fall back to direct GCP queries when objects are invalid
- Use gcloud commands to delete resources by name pattern
- Delete firewalls before networks (dependency order)
- Change
DisallowAllPorts()toDelete()in GceFirewall class
Recommendation
This should be treated as a critical priority bug because:
- It causes resource leakage and unexpected costs
- It has existed for 8+ years (related to object_storage_service fails during cleanup - tries to delete instance, firewall and network multiple times, after successful STDERR of it being deleted #1239)
- It affects all cloud providers and benchmarks
- The pickle corruption issue makes it worse than initially thought
- It impacts PKB's reliability and trustworthiness