WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Bug Report: Incomplete Teardown When Provision Phase Fails #6223

@jimmycgz

Description

@jimmycgz

PKB Bug Report: Incomplete Teardown When Provision Phase Fails

Issue Type: Bug
Severity: Critical
Component: Resource Cleanup / Teardown
Related Issue: #1239
Status: Proposed Fix which implemented and tested successfully on my environment

Fix Summary

The Delete() method in benchmark_spec.py fails to clean up network and firewall resources when:

  1. The deleted flag is prematurely set to True during provision failures
  2. Pickle data corruption causes network/firewall objects to be invalid
  3. Teardown runs with --run_stage=teardown but skips cleanup due to the deleted flag

Implemented Solution: Three-layer defense strategy in perfkitbenchmarker/benchmark_spec.py:

  1. Ignore deleted flag during teardown-only runs: Don't trust the deleted flag when stages.TEARDOWN is in FLAGS.run_stage
  2. Validate objects before use: Check that network/firewall objects have valid Delete() methods
  3. Fallback to direct cleanup: When objects are invalid, use gcloud commands to delete resources by name pattern

Test Results: Successfully cleaned up orphaned resources from run URI e848137d:

  • Deleted firewalls: default-internal-10-0-0-0-8-e848137d, perfkit-firewall-e848137d-22-22
  • Deleted network: pkb-network-e848137d

Original Bug Report

Summary

PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase, leaving orphaned GCP resources that:

  1. Cost money
  2. Block future runs with the same run_uri
  3. Require manual cleanup

Root Cause Analysis

The primary issue is a fatal ControlPath too long error during the SSH connection setup within the WaitForBootCompletion method. This error is triggered because the temporary directory path for the SSH control socket exceeds the maximum allowed length on the user's macOS system. This initial error triggers a KeyboardInterrupt, which then causes the incomplete teardown.

The Delete() method in benchmark_spec.py has a critical flaw: firewall and network deletion is skipped when self.vms is empty, which occurs when a benchmark fails during VM provisioning.

Code Analysis

In perfkitbenchmarker/benchmark_spec.py (lines 958-987):

def Delete(self):
    if self.deleted:
        return
    
    # ... other resource deletions ...
    
    if self.vms:  # ← PROBLEM: This condition gates VM deletion
        try:
            background_tasks.RunThreaded(self.DeleteVm, self.vms)
            background_tasks.RunThreaded(
                lambda vm: vm.DeleteScratchDisks(), self.vms
            )
        except Exception:
            logging.exception(...)
    
    # Placement groups deletion (outside if block)
    
    for firewall in self.firewalls.values():  # ← This code runs
        try:
            firewall.DisallowAllPorts()  # ← But only DISABLES ports, doesn't DELETE
        except Exception:
            logging.exception(...)
    
    # ... container cluster deletion ...
    
    for net in self.networks.values():  # ← Network deletion fails
        try:
            net.Delete()  # ← Fails because firewall rules still exist
        except Exception:
            logging.exception(...)

The Bug:

  1. firewall.DisallowAllPorts() only disables firewall rules, it doesn't delete them
  2. Network deletion fails because firewall rules are dependencies
  3. When self.vms is empty (provision failure), the firewall deletion logic is effectively bypassed

Reproduction Steps

  1. Run any PKB benchmark that creates networks and VMs
  2. Kill the process during VM provisioning (after network/firewall creation but before VMs are fully created)
  3. Run teardown with --run_stage=teardown --run_uri=<failed_run_uri>
  4. Observe that network and firewall rules are NOT deleted

Example Command

# Start benchmark
./pkb.py --benchmarks=nginx \
  --cloud=GCP \
  --project=my-project \
  --zone=us-central1-a

# Kill process during VM creation (Ctrl+C)

# Attempt cleanup
./pkb.py --benchmarks=nginx \
  --cloud=GCP \
  --project=my-project \
  --zone=us-central1-a \
  --run_uri=<failed_run_uri> \
  --run_stage=teardown

Result: Teardown reports "SUCCEEDED" but resources remain:

gcloud compute networks list | grep <run_uri>
# pkb-network-<run_uri>  True

gcloud compute firewall-rules list | grep <run_uri>
# perfkit-firewall-<run_uri>-22-22
# default-internal-10-0-0-0-8-<run_uri>

Actual Behavior

  1. Benchmark fails during provision phase
  2. self.vms list is empty in pickled spec
  3. Teardown runs but skips effective firewall deletion
  4. Network deletion fails due to firewall dependencies
  5. PKB reports teardown as "SUCCEEDED" despite orphaned resources
  6. Resources remain in GCP, costing money

Expected Behavior

  1. Teardown should delete ALL created resources regardless of self.vms state
  2. Firewall rules should be explicitly deleted, not just disabled
  3. Network deletion should succeed after firewall deletion
  4. PKB should validate that resources were actually deleted
  5. If resources remain, teardown should report FAILED

Impact

Resource Leakage

  • Orphaned networks cost ~$0.01/hour each
  • Orphaned firewall rules accumulate over time
  • No automatic cleanup mechanism exists

Operational Issues

  • Cannot resume benchmarks with same run_uri
  • Manual cleanup required for each failed run
  • Difficult to track which resources are orphaned

Silent Failure

  • PKB reports "SUCCEEDED" for teardown even when resources remain
  • No warning or error messages
  • Users unaware of orphaned resources until billing alerts

Evidence

Our Case Study

Run URI: e848137d

Timeline:

17:10:38 - Network created successfully
17:11:16 - Firewall rules created successfully  
17:11:30 - VM creation started
17:15:47 - VM deleted (benchmark failed)
17:19:18 - Network deletion attempted → FAILED (firewall rules still exist)
20:23:56 - Teardown with --run_stage=teardown → Reports "SUCCEEDED"

Orphaned Resources (verified with gcloud):

$ gcloud compute networks list --project=YOUR_PROJECT_ID | grep e848137d
pkb-network-e848137d  True

$ gcloud compute firewall-rules list --project=YOUR_PROJECT_ID | grep e848137d
default-internal-10-0-0-0-8-e848137d  pkb-network-e848137d  INGRESS
perfkit-firewall-e848137d-22-22       pkb-network-e848137d  INGRESS

PKB Teardown Log:

2025-11-17 20:23:56,540 e848137d MainThread nginx(1/0) INFO Tearing down resources for benchmark nginx
2025-11-17 20:23:56,546 e848137d MainThread INFO Benchmark run statuses:
Name   UID     Status     Failed Substatus
nginx  nginx0  SUCCEEDED  UNCATEGORIZED   
Success rate: 100.00% (1/1)

Note: Teardown completed in 0.006 seconds - no actual deletion occurred!

My Failed Attempts

My previous attempts to fix the issue were unsuccessful because I was focused on the wrong problem. I initially believed the issue was with the teardown logic itself, but the real problem is the ControlPath too long error that triggers the entire failure cascade. My fixes to the teardown logic were ineffective because the script was never reaching that point in the code.

The following run_uri was used for testing: a25b5b22

When the pickle file is missing, the following error is thrown:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/perfkitbenchmarker/runs/e848137d/cluster_boot0'

Related Issues

This bug is related to Issue #1239 which describes similar cleanup problems in the object_storage_service benchmark. That issue has been open since December 2016 (8+ years) with no resolution.

Both issues stem from the same root cause: incomplete resource tracking and cleanup logic in PKB's teardown phase.

Proposed Fix

Option 1: Always Delete Networks and Firewalls

Modify benchmark_spec.py to delete networks and firewalls regardless of VM state:

def Delete(self):
    if self.deleted:
        return
    
    # ... other resource deletions ...
    
    if self.vms:
        try:
            background_tasks.RunThreaded(self.DeleteVm, self.vms)
            background_tasks.RunThreaded(
                lambda vm: vm.DeleteScratchDisks(), self.vms
            )
        except Exception:
            logging.exception(...)
    
    # ALWAYS delete firewalls, even if no VMs exist
    for firewall in self.firewalls.values():
        try:
            firewall.Delete()  # ← Explicit deletion, not just DisallowAllPorts()
        except Exception:
            logging.exception(...)
    
    # ... container cluster deletion ...
    
    # ALWAYS delete networks, even if no VMs exist
    for net in self.networks.values():
        try:
            net.Delete()
        except Exception:
            logging.exception(...)

Option 2: Comprehensive Resource Tracking

Track all created resources in the pickled spec, not just VMs:

class BenchmarkSpec:
    def __init__(self, ...):
        # ... existing code ...
        self.created_resources = {
            'networks': [],
            'firewalls': [],
            'vms': [],
            'disks': [],
            # etc.
        }
    
    def Delete(self):
        # Delete all tracked resources regardless of state
        for resource_type, resources in self.created_resources.items():
            for resource in resources:
                try:
                    resource.Delete()
                except Exception:
                    logging.exception(...)

Option 3: Validate Cleanup

Add validation to ensure resources were actually deleted:

def Delete(self):
    # ... existing deletion code ...
    
    # Validate cleanup
    remaining_resources = []
    for net in self.networks.values():
        if net.Exists():
            remaining_resources.append(f"Network: {net.name}")
    
    for firewall in self.firewalls.values():
        if firewall.Exists():
            remaining_resources.append(f"Firewall: {firewall.name}")
    
    if remaining_resources:
        raise errors.Resource.CleanupError(
            f"Failed to delete resources: {remaining_resources}"
        )

Workaround

Until this is fixed, users must manually clean up orphaned resources:

# List orphaned resources
RUN_URI="<failed_run_uri>"
PROJECT="<your-project>"

gcloud compute networks list --project=${PROJECT} | grep ${RUN_URI}
gcloud compute firewall-rules list --project=${PROJECT} | grep ${RUN_URI}

# Delete firewall rules first (dependency)
gcloud compute firewall-rules delete \
  perfkit-firewall-${RUN_URI}-22-22 \
  default-internal-10-0-0-0-8-${RUN_URI} \
  --project=${PROJECT} \
  --quiet

# Delete network
gcloud compute networks delete \
  pkb-network-${RUN_URI} \
  --project=${PROJECT} \
  --quiet

Environment

  • PKB Version: v1.12.0-5933-g20771be8
  • Cloud Provider: GCP
  • Benchmark: nginx (but affects all benchmarks)
  • Platform: macOS (but cloud-agnostic issue)
  • Python: 3.11

Additional Context

This bug has significant cost implications for organizations running PKB at scale. Each failed benchmark run leaves orphaned resources that accumulate over time. Without automated cleanup or proper error reporting, these resources can go unnoticed until billing alerts trigger.

The bug is particularly insidious because:

  1. PKB reports teardown as "SUCCEEDED" even when it fails
  2. No warning messages are generated
  3. The pickled spec doesn't track all created resources
  4. Manual intervention is required for every failed run

Deep Dive Analysis (November 17, 2025 - 21:36)

Pickle File Investigation

Examined the pickled BenchmarkSpec for run_uri=e848137d:

# Pickle file: /tmp/perfkitbenchmarker/runs/e848137d/nginx0
Networks: ['[', '  {', '    "autoCreateSubnetworks": true,', ...]  # JSON strings!
Firewalls: []  # Empty
VMs: 0  # No VMs
Deleted flag: True  # Already marked as deleted

Critical Discovery: Corrupted Pickle Data

The spec.networks dictionary contains raw JSON strings as keys instead of GceNetwork objects. This corruption explains the silent failure:

  1. if self.deleted: return - The deleted flag is already True, so Delete() returns immediately
  2. if self.networks: - Evaluates to True (dict has JSON string keys)
  3. for net in self.networks.values(): - Iterates over JSON strings, not network objects
  4. net.Delete() - Calling .Delete() on a string does nothing (no error, no deletion)
  5. Teardown completes in 0.006 seconds - No actual work performed

Root Cause: Three-Layer Failure

  1. Layer 1: Provision Failure - Network creation succeeded but stored raw JSON instead of objects
  2. Layer 2: Pickle Corruption - The networks dict was pickled with malformed data
  3. Layer 3: Silent Teardown Failure - The deleted flag prevents re-execution, and invalid objects cause silent no-ops

Why Previous Fix Failed

The previous attempt added resource discovery but only for FileNotFoundError. In this case:

  • Pickle file exists at /tmp/perfkitbenchmarker/runs/e848137d/nginx0
  • The deleted=True flag causes immediate return from Delete()
  • Resource discovery code never executes
  • No cleanup occurs

Revised Proposed Fix

Strategy: Robust Teardown with Validation and Recovery

The fix must handle three scenarios:

Scenario 1: Normal Teardown (provision succeeded)

  • self.networks contains valid GceNetwork objects
  • self.firewalls contains valid GceFirewall objects
  • Standard deletion works

Scenario 2: Corrupted Pickle (THIS CASE)

  • self.networks contains invalid data (JSON strings, empty, etc.)
  • self.firewalls may be empty or invalid
  • deleted flag may be True
  • Must bypass early return and force cleanup

Scenario 3: Missing Pickle

  • No pickle file exists
  • Must create minimal spec and discover resources

Implementation Plan

def Delete(self):
    """Delete all benchmark resources with robust error handling."""
    
    # CHANGE 1: Don't trust the deleted flag during teardown-only runs
    if self.deleted and stages.TEARDOWN not in FLAGS.run_stage:
        return
    
    # CHANGE 2: Validate network/firewall objects before use
    valid_networks = self._ValidateNetworkObjects()
    valid_firewalls = self._ValidateFirewallObjects()
    
    # CHANGE 3: If validation fails, fall back to direct GCP cleanup
    if not valid_networks or not valid_firewalls:
        logging.warning(
            'Invalid network/firewall objects detected. '
            'Falling back to direct GCP resource cleanup.'
        )
        self._CleanupOrphanedGCPResources()
    
    # ... rest of deletion logic ...
    
    # CHANGE 4: Always attempt firewall deletion (not just DisallowAllPorts)
    if self.firewalls:
        for firewall in self.firewalls.values():
            try:
                if hasattr(firewall, 'Delete') and callable(firewall.Delete):
                    firewall.Delete()
                else:
                    logging.warning(f'Invalid firewall object: {firewall}')
            except Exception:
                logging.exception('Got an exception deleting firewalls.')
    
    # CHANGE 5: Always attempt network deletion
    if self.networks:
        for net in self.networks.values():
            try:
                if hasattr(net, 'Delete') and callable(net.Delete):
                    net.Delete()
                else:
                    logging.warning(f'Invalid network object: {net}')
            except Exception:
                logging.exception('Got an exception deleting networks.')
    
    self.deleted = True

def _ValidateNetworkObjects(self) -> bool:
    """Validate that network objects are actual GceNetwork instances."""
    if not self.networks:
        return False
    for net in self.networks.values():
        if not hasattr(net, 'Delete') or not callable(net.Delete):
            return False
    return True

def _ValidateFirewallObjects(self) -> bool:
    """Validate that firewall objects are actual GceFirewall instances."""
    if not self.firewalls:
        return False
    for fw in self.firewalls.values():
        if not hasattr(fw, 'Delete') or not callable(fw.Delete):
            return False
    return True

def _CleanupOrphanedGCPResources(self) -> None:
    """Directly query and delete GCP resources by run_uri pattern."""
    if FLAGS.cloud != provider_info.GCP:
        logging.warning('Direct cleanup only supported for GCP')
        return
    
    from perfkitbenchmarker.providers.gcp import util as gcp_util
    import json
    
    project = FLAGS.project or self.project
    if not project:
        logging.error('No project specified for cleanup')
        return
    
    logging.info(f'Cleaning up orphaned GCP resources for run_uri: {FLAGS.run_uri}')
    
    # Delete firewall rules first (dependency for networks)
    try:
        cmd = gcp_util.GcloudCommand(
            self, 'compute', 'firewall-rules', 'list',
            '--filter', f'name~-{FLAGS.run_uri}',
            '--format', 'value(name)'
        )
        cmd.flags['project'] = project
        stdout, _, retcode = cmd.Issue(raise_on_failure=False)
        if retcode == 0 and stdout.strip():
            for firewall_name in stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned firewall: {firewall_name}')
                del_cmd = gcp_util.GcloudCommand(
                    self, 'compute', 'firewall-rules', 'delete', firewall_name
                )
                del_cmd.flags['project'] = project
                del_cmd.Issue(raise_on_failure=False)
    except Exception as e:
        logging.exception(f'Failed to clean up firewalls: {e}')
    
    # Delete networks
    try:
        cmd = gcp_util.GcloudCommand(
            self, 'compute', 'networks', 'list',
            '--filter', f'name~pkb-network.*{FLAGS.run_uri}',
            '--format', 'value(name)'
        )
        cmd.flags['project'] = project
        stdout, _, retcode = cmd.Issue(raise_on_failure=False)
        if retcode == 0 and stdout.strip():
            for network_name in stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned network: {network_name}')
                del_cmd = gcp_util.GcloudCommand(
                    self, 'compute', 'networks', 'delete', network_name
                )
                del_cmd.flags['project'] = project
                del_cmd.Issue(raise_on_failure=False)
    except Exception as e:
        logging.exception(f'Failed to clean up networks: {e}')

Key Changes

  1. Don't trust deleted flag during teardown-only runs
  2. Validate objects before calling methods on them
  3. Fall back to direct GCP queries when objects are invalid
  4. Use gcloud commands to delete resources by name pattern
  5. Delete firewalls before networks (dependency order)
  6. Change DisallowAllPorts() to Delete() in GceFirewall class

Recommendation

This should be treated as a critical priority bug because:

  1. It causes resource leakage and unexpected costs
  2. It has existed for 8+ years (related to object_storage_service fails during cleanup - tries to delete instance, firewall and network multiple times, after successful STDERR of it being deleted #1239)
  3. It affects all cloud providers and benchmarks
  4. The pickle corruption issue makes it worse than initially thought
  5. It impacts PKB's reliability and trustworthiness

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions