-
Notifications
You must be signed in to change notification settings - Fork 120
Description
Describe the bug
Building my images to run CAA in an AWS EKS environment, with PodVMs running on an m6a instance with SNP enabled, in eu-west-1 (where the prebuilt images are not available - nor can I copy an AMI as the snapshot isn't public..)
Initially I build from confidential-containers/cloud-api-adaptor/tree/main/src/cloud-api-adaptor/podvm -- setting CLOUD_PROVIDER=aws -- CAA spins up the podvm, the workload runs, but AA doesn't have awareness of my TEE env (SNP)
Then reading the docs I saw that confidential-containers/cloud-api-adaptor/tree/main/src/cloud-api-adaptor/podvm-mkosi is the right way to go and that I should use TEE_PLATFORM=amd (or snp)
Building this image works, but the CAA scheduled podvm never boots, no matter what I try
So from confidential-containers/cloud-api-adaptor/tree/main/src/cloud-api-adaptor/podvm-mkosi
TEE_PLATFORM=amd make debug (to get a debug image)
TEE_PLATFORM=amd make (regular image)
Uploaded raw using uplosi but also tried manually uploading, creating the snapshot and registering the ami - just in case uplosi would be the culprit.
None of these AMIs work, none of them boot (doesn't matter if I schedule a peer-pod or I try to create an EC2 instance from the AMI manually) - the EC2 instance stops itself consistently 12 seconds after kernel start, long before systemd reaches the point where process-user-data would run -- it stops at the point where kernel reaches the point where it should pivot to the root and start userspace.
Happy to provide full boot logs privately if needed, but the failure happens consistently at the same stage (≈12 seconds after kernel start) and always ends in Client.InstanceInitiatedShutdown.
Does anyone have any idea what is going on?
I am doing all this on a checked out version of the repo at the v0.16.0 tag
How to reproduce
confidential-containers/cloud-api-adaptor/tree/main/src/cloud-api-adaptor/podvm-mkosi
TEE_PLATFORM=amd make debug (to get a debug image)
TEE_PLATFORM=amd make (regular image)
Uploaded raw using uplosi but also tried manually uploading, creating the snapshot and registering the ami - just in case uplosi would be the culprit.
None of these AMIs work, none of them boot (doesn't matter if I schedule a peer-pod or I try to create an EC2 instance from the AMI manually) - the EC2 instance stops itself consistently 12 seconds after kernel start, long before systemd reaches the point where process-user-data would run -- it stops at the point where kernel reaches the point where it should pivot to the root and start userspace.
CoCo version information
v0.16.0
What TEE are you seeing the problem on
Snp