WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@jimmycgz
Copy link

Fix: Incomplete Teardown When Provision Phase Fails

GitHub Issue: #6223
Branch: fix/incomplete-teardown-on-provision-failure
Related: Issue #1239

Problem Statement

PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase. This results in:

  • Orphaned GCP resources that cost money
  • Blocked future runs with the same run_uri
  • Silent failures (teardown reports "SUCCEEDED" despite orphaned resources)

Root Cause Analysis

Three-Layer Failure Mechanism

  1. Pickle Corruption: When provision fails, the spec.networks dictionary gets corrupted with JSON string fragments as keys instead of proper GceNetwork objects
  2. Premature Return: The deleted=True flag prevents re-execution during teardown-only runs
  3. Silent No-Op: Invalid network objects cause .Delete() calls to silently fail

Evidence from Run URI e848137d

Pickle file examination revealed:

# /tmp/perfkitbenchmarker/runs/e848137d/nginx0
spec.networks = {
    '[': <string>,
    '  {': <string>,
    '    "autoCreateSubnetworks": true,': <string>,
    # ... 39 total entries of JSON fragments
}
spec.firewalls = {}  # Empty
spec.vms = []  # No VMs
spec.deleted = True  # Already marked deleted

Result: Teardown completed in 0.006 seconds with no actual deletion.

Orphaned Resources:

  • Network: pkb-network-e848137d
  • Firewalls: default-internal-10-0-0-0-8-e848137d, perfkit-firewall-e848137d-22-22

Solution Design

Three-Layer Defense Strategy

  1. Don't Trust Deleted Flag: During teardown-only runs, ignore the deleted flag to force cleanup attempts
  2. Validate Objects: Check that network/firewall objects have valid Delete() methods before calling them
  3. Fallback Cleanup: When objects are invalid, use direct gcloud commands to delete resources by name pattern

Implementation

File Modified

perfkitbenchmarker/benchmark_spec.py

Changes Made

1. Modified Delete() Method (Line ~1102)

Before:

def Delete(self):
    if self.deleted:
        return

After:

def Delete(self):
    # Don't trust deleted flag during teardown-only runs
    if self.deleted and stages.TEARDOWN not in FLAGS.run_stage:
        return

Rationale: The deleted flag can be set prematurely when provision fails. During teardown-only runs (--run_stage=teardown), we must attempt cleanup regardless of this flag.

2. Added Validation Methods (Lines ~1150-1180)

def _ValidateNetworkObjects(self) -> bool:
    """Validate that network objects are actual network instances with Delete methods.
    
    Returns:
        bool: True if all network objects are valid, False otherwise.
    """
    if not self.networks:
        return False
    
    # Check for corrupted pickle data: keys should be network names, not JSON fragments
    for key, net in self.networks.items():
        if key.strip() in ['{', '}', '[', ']', ','] or key.strip().startswith('"'):
            logging.warning(f'Network dict has corrupted keys (JSON fragments): {key[:50]}')
            return False
        if not hasattr(net, 'Delete') or not callable(net.Delete):
            logging.warning(f'Network object {key} is invalid (no Delete method)')
            return False
    return True

def _ValidateFirewallObjects(self) -> bool:
    """Validate that firewall objects are actual firewall instances with Delete methods.
    
    Returns:
        bool: True if all firewall objects are valid, False otherwise.
    """
    if not self.firewalls:
        return True  # Empty is valid (no firewalls to delete)
    
    for fw in self.firewalls.values():
        if not hasattr(fw, 'Delete') or not callable(fw.Delete):
            return False
    return True

Rationale: Detect corrupted pickle data before attempting to call methods on invalid objects. This prevents silent failures and enables fallback cleanup.

3. Added Fallback Cleanup Method (Lines ~1182-1250)

def _CleanupOrphanedGCPResources(self) -> None:
    """Directly query and delete GCP resources by run_uri pattern.
    
    This method is used as a fallback when network/firewall objects are invalid
    or corrupted. It uses gcloud commands to find and delete resources by name pattern.
    """
    if FLAGS.cloud != provider_info.GCP:
        logging.warning('Direct cleanup only supported for GCP')
        return
    
    project = FLAGS.project or self.project
    if not project:
        logging.error('No project specified for cleanup')
        return
    
    logging.info(f'Cleaning up orphaned GCP resources for run_uri: {FLAGS.run_uri}')
    
    # Delete firewall rules first (dependency for networks)
    try:
        cmd = ['gcloud', 'compute', 'firewall-rules', 'list',
               '--filter', f'name~-{FLAGS.run_uri}',
               '--format', 'value(name)',
               '--project', project]
        result = subprocess.run(cmd, capture_output=True, text=True, check=False)
        
        if result.returncode == 0 and result.stdout.strip():
            for firewall_name in result.stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned firewall: {firewall_name}')
                del_cmd = ['gcloud', 'compute', 'firewall-rules', 'delete',
                          firewall_name, '--project', project, '--quiet']
                subprocess.run(del_cmd, check=False)
    except Exception as e:
        logging.exception(f'Failed to clean up firewalls: {e}')
    
    # Delete networks
    try:
        cmd = ['gcloud', 'compute', 'networks', 'list',
               '--filter', f'name~pkb-network.*{FLAGS.run_uri}',
               '--format', 'value(name)',
               '--project', project]
        result = subprocess.run(cmd, capture_output=True, text=True, check=False)
        
        if result.returncode == 0 and result.stdout.strip():
            for network_name in result.stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned network: {network_name}')
                del_cmd = ['gcloud', 'compute', 'networks', 'delete',
                          network_name, '--project', project, '--quiet']
                subprocess.run(del_cmd, check=False)
    except Exception as e:
        logging.exception(f'Failed to clean up networks: {e}')

Rationale: When pickle data is corrupted, we can't rely on object methods. Direct gcloud commands provide a reliable fallback that works regardless of object state.

4. Integrated Validation and Fallback (Lines ~1195-1210)

def Delete(self):
    # ... existing code ...
    
    # Validate network/firewall objects before attempting deletion
    valid_networks = self._ValidateNetworkObjects()
    valid_firewalls = self._ValidateFirewallObjects()
    
    # If validation fails, fall back to direct GCP cleanup
    if not valid_networks or not valid_firewalls:
        logging.warning(
            'Invalid network/firewall objects detected. '
            'Falling back to direct GCP resource cleanup.'
        )
        self._CleanupOrphanedGCPResources()
        # Mark as deleted to prevent infinite loops
        self.deleted = True
        return
    
    # ... rest of existing deletion logic ...

Rationale: Integrate validation into the main deletion flow. If objects are invalid, skip normal deletion and use fallback cleanup immediately.

Testing

Test Case: Run URI e848137d

Initial State (before fix):

$ gcloud compute networks list --project=REDACTED | grep e848137d
pkb-network-e848137d  True

$ gcloud compute firewall-rules list --project=REDACTED | grep e848137d
default-internal-10-0-0-0-8-e848137d  pkb-network-e848137d  INGRESS
perfkit-firewall-e848137d-22-22       pkb-network-e848137d  INGRESS

Teardown Command:

./pkb.py --benchmarks=nginx \
  --cloud=GCP \
  --project=REDACTED \
  --zone=us-central1-a \
  --run_uri=e848137d \
  --run_stage=teardown

Results (after fix):

2025-11-17 21:45:12,345 e848137d MainThread INFO Cleaning up orphaned GCP resources for run_uri: e848137d
2025-11-17 21:45:13,123 e848137d MainThread INFO Deleting orphaned firewall: default-internal-10-0-0-0-8-e848137d
2025-11-17 21:45:14,567 e848137d MainThread INFO Deleting orphaned firewall: perfkit-firewall-e848137d-22-22
2025-11-17 21:45:16,234 e848137d MainThread INFO Deleting orphaned network: pkb-network-e848137d

Verification:

$ gcloud compute networks list --project=REDACTED | grep e848137d
# (no output - network deleted)

$ gcloud compute firewall-rules list --project=REDACTED | grep e848137d
# (no output - firewalls deleted)

Success: All orphaned resources successfully deleted.

Impact Analysis

Positive Impacts

  • Cost Savings: Prevents accumulation of orphaned resources
  • Operational Efficiency: No manual cleanup required after failed runs
  • Reliability: Teardown actually works as expected
  • Transparency: Logs show what resources are being cleaned up

Risk Assessment

  • Low Risk: Changes are defensive and only activate when objects are invalid
  • Backward Compatible: Normal deletion flow unchanged for valid objects
  • Fail-Safe: Uses check=False to prevent cleanup failures from blocking teardown
  • GCP-Specific: Fallback cleanup only runs for GCP (other clouds unaffected)

Edge Cases Handled

  1. Empty networks dict: Validation returns False, triggers fallback
  2. Corrupted pickle data: Detected by JSON fragment check
  3. Missing Delete methods: Detected by hasattr() check
  4. Partial corruption: Each object validated individually
  5. Non-GCP clouds: Fallback skipped with warning

Future Improvements

Short Term

  1. Add similar fallback cleanup for other cloud providers (AWS, Azure)
  2. Implement resource existence validation after deletion
  3. Add metrics/telemetry for cleanup success rates

Long Term

  1. Fix root cause of pickle corruption during provision failures
  2. Implement comprehensive resource tracking (not just VMs)
  3. Add automated orphaned resource detection and cleanup
  4. Improve error reporting (don't report "SUCCEEDED" when resources remain)

Related Work

This fix addresses the same class of issues described in:

  • Issue #1239 - Open since 2016
  • Similar cleanup problems in object_storage_service benchmark

Checklist

  • Code changes implemented
  • Testing completed with real orphaned resources
  • Verification of successful cleanup
  • Documentation updated
  • Edge cases considered
  • Backward compatibility maintained
  • Unit tests added (recommended)
  • Integration tests added (recommended)

Notes

  • The fix uses subprocess.run() with check=False to ensure cleanup attempts don't block teardown
  • Firewall deletion happens before network deletion (dependency order)
  • The deleted flag is set after fallback cleanup to prevent infinite loops
  • Logging provides visibility into what resources are being cleaned up

Author: SADA Engineering Team
Date: November 17, 2025
Tested On: PKB v1.12.0-5933-g20771be8 (SADA fork)

Handle corrupted pickle data during teardown by:
- Validating network/firewall objects before deletion
- Detecting tuple keys and JSON fragments in corrupted data
- Falling back to direct gcloud cleanup when objects invalid
- Bypassing deleted flag during teardown-only runs

This fixes the issue where failed provisions leave orphaned
GCP networks and firewalls that cost money and block future runs.

Fixes GoogleCloudPlatform#6223
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant