-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Windows video playback optimisations #1454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdded SIMD-parallel YUV→RGBA conversion with progress/cancellation and rayon-based parallelism; refactored GPU texture allocation to be dynamic and GPU-limit aware with zero-copy fallback on Windows; added hardware-decoder capability detection and software-threading configuration for FFmpeg/Media Foundation. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Conv as YuvToRgbaConverter
participant Tex as TextureAllocation
participant GPU as GPU/WGPU
participant D3D as D3D11 (Windows)
User->>Conv: convert_nv12_with_fallback(...)
Conv->>Conv: validate_dimensions(gpu_max_texture_size)
alt Resize needed
Conv->>Tex: ensure_texture_size(new_w,new_h)
Tex->>GPU: create/reallocate textures
GPU-->>Tex: textures ready
end
Note over Conv,D3D: Zero-copy attempt (Windows)
Conv->>D3D: acquire shared handle
D3D-->>Conv: handle or failure
alt Zero-copy succeeds
Conv-->>User: return texture view
else Zero-copy fails
Conv->>Conv: set zero_copy_failed
Conv->>GPU: staged copy of NV12 planes
GPU-->>Conv: staged texture ready
Conv-->>User: return texture view
end
sequenceDiagram
participant Scheduler as rayon Thread Pool
participant Progress as ConversionProgress
participant SIMD as SIMD Dispatch (AVX2/SSE2/Scalar)
participant Output as Output Buffer
Scheduler->>Scheduler: split rows, par_iter
par Threads
Scheduler->>SIMD: detect SimdLevel (cached)
SIMD-->>Scheduler: chosen variant
Scheduler->>SIMD: convert row(s)
SIMD-->>Output: write RGBA pixels
Scheduler->>Progress: fetch_add rows_completed
Scheduler->>Progress: check cancelled -> break if set
end
Scheduler-->>Output: final buffer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/rendering/src/yuv_converter.rs (1)
1136-1173: Potential mismatch when downscaling is triggered.The
validate_dimensionscall returnseffective_width/effective_heightand_downscaledflag, but the actual conversion still uses the originalwidthandheightfor:
- RGBA buffer allocation (line 1141)
- SIMD conversion (lines 1143-1151)
- Texture write (lines 1153-1171)
If the video exceeds GPU limits and downscaling is indicated, the textures will be sized for the downscaled resolution while the data is at original resolution, which could cause issues.
This also applies to
convert_yuv420p_cpubelow.Consider either:
- Using
effective_width/effective_heightconsistently when downscaling- Or rejecting the conversion with an error if downscaling would be required for CPU paths
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
crates/rendering/Cargo.toml(1 hunks)crates/rendering/src/cpu_yuv.rs(9 hunks)crates/rendering/src/yuv_converter.rs(17 hunks)crates/video-decode/Cargo.toml(1 hunks)crates/video-decode/src/ffmpeg.rs(2 hunks)crates/video-decode/src/lib.rs(1 hunks)crates/video-decode/src/media_foundation.rs(8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.rs: Never use dbg!() macro; use proper logging (tracing::debug!, etc.) instead
Never write let _ = async_fn() which silently drops futures; await or explicitly handle them
Use duration.saturating_sub(other) instead of duration - other to avoid panics on underflow
Merge nested if statements: write 'if a && b { }' instead of 'if a { if b { } }'
Don't call .clone() on Copy types (integers, bools, etc.); copy them directly
Use function references directly: iter.map(foo) instead of iter.map(|x| foo(x))
Accept &[T] or &str instead of &Vec or &String in function parameters for flexibility
Use .is_empty() instead of .len() == 0 or .len() > 0 / .len() != 0
Don't assign () to a variable: write foo(); instead of let _ = foo(); or let x = foo(); when return is unit
Use .unwrap_or(val) instead of .unwrap_or_else(|| val) when the default is a simple/cheap value
Use 'for item in &collection' or 'for (i, item) in collection.iter().enumerate()' instead of 'for i in 0..collection.len()'
Use value.clamp(min, max) instead of manual if chains or .min(max).max(min) patterns
Always handle Result/Option or types marked #[must_use]; never ignore them
**/*.rs: Userustfmtand workspace clippy lints for Rust code formatting and linting
Use snake_case for Rust module names and kebab-case for crate names
Never usedbg!()macro in Rust code; use proper logging instead (Clippy:dbg_macro= deny)
Always handleResult/Optionor types marked#[must_use]; never ignore them (Rust compiler lint:unused_must_use= deny)
Never writelet _ = async_fn()which silently drops futures; await or explicitly handle them (Clippy:let_underscore_future= deny)
Usesaturating_subinstead of-forDurationto avoid panics (Clippy:unchecked_duration_subtraction= deny)
Merge nestedifstatements: useif a && b { }instead ofif a { if b { } }(Clippy:collapsible_if= deny)
Don't call.clone()onCopytypes; just copy them directly (Clippy:clone_on_copy= deny)
U...
Files:
crates/video-decode/src/lib.rscrates/video-decode/src/ffmpeg.rscrates/video-decode/src/media_foundation.rscrates/rendering/src/cpu_yuv.rscrates/rendering/src/yuv_converter.rs
**/*.{ts,tsx,js,jsx,rs}
📄 CodeRabbit inference engine (AGENTS.md)
Never add comments to code (
//,/* */,///,//!,#, etc.); code must be self-explanatory through naming, types, and structure
Files:
crates/video-decode/src/lib.rscrates/video-decode/src/ffmpeg.rscrates/video-decode/src/media_foundation.rscrates/rendering/src/cpu_yuv.rscrates/rendering/src/yuv_converter.rs
🧠 Learnings (2)
📚 Learning: 2025-10-28T08:39:42.230Z
Learnt from: Brendonovich
Repo: CapSoftware/Cap PR: 1305
File: crates/recording/src/output_pipeline/macos.rs:80-90
Timestamp: 2025-10-28T08:39:42.230Z
Learning: In `crates/recording/src/output_pipeline/macos.rs`, the `AVFoundationMp4Muxer` intentionally holds the `Mutex<MP4Encoder>` lock during retry attempts in `send_video_frame()` and `send_audio_frame()`. This blocking behavior is correct because frame processing must happen sequentially to prevent audio and video frames from being interleaved incorrectly in the encoder.
Applied to files:
crates/video-decode/src/lib.rscrates/video-decode/src/ffmpeg.rs
📚 Learning: 2025-12-07T14:29:40.743Z
Learnt from: CR
Repo: CapSoftware/Cap PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-07T14:29:40.743Z
Learning: Applies to **/*.rs : Use `rustfmt` and workspace clippy lints for Rust code formatting and linting
Applied to files:
crates/rendering/Cargo.toml
🧬 Code graph analysis (1)
crates/video-decode/src/lib.rs (1)
crates/video-decode/src/media_foundation.rs (1)
get_mf_decoder_capabilities(138-140)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Clippy (aarch64-apple-darwin, macos-latest)
- GitHub Check: Clippy (x86_64-pc-windows-msvc, windows-latest)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (rust)
🔇 Additional comments (20)
crates/video-decode/Cargo.toml (1)
12-12: Appropriate dependency choice for CPU-aware threading configuration.The
num_cpuscrate version 1.16.0 is a stable choice for dynamically detecting available CPUs to configure software decoder threading, which aligns well with the PR's objective to optimize threading based on available hardware. A newer version 1.17.0 is available, but the caret versioning (^1.16) allows automatic updates within the 1.x series as needed.crates/rendering/Cargo.toml (1)
21-21: Excellent choice for parallelization. Rayon is a data-parallelism library for Rust that is extremely lightweight and guarantees data-race freedom, making it the standard for this use case. Version 1.10 is current and appropriate.crates/video-decode/src/lib.rs (1)
10-14: LGTM!Clean extension of the public API to expose hardware decoder capability querying. The re-export structure follows the existing pattern.
crates/video-decode/src/ffmpeg.rs (4)
13-28: LGTM!Well-structured capability struct with sensible defaults for fallback scenarios.
32-134: LGTM!Solid D3D11 capability detection with proper fallback handling. The tiered resolution testing (8K → 4K) correctly identifies hardware limits.
145-172: LGTM!Good resolution-based threading configuration. The thresholds align well with common video resolutions (1080p, 4K).
213-246: LGTM!Good integration of dynamic hardware capability checking. The fallback paths are well-structured, ensuring software threading is configured appropriately in all scenarios.
crates/video-decode/src/media_foundation.rs (4)
34-140: LGTM!Well-structured capability detection with descending resolution tests. The
Optionreturn fromget_mf_decoder_capabilitiesappropriately signals when initialization hasn't occurred yet.
160-294: LGTM!Efficient texture pooling implementation that correctly handles dimension changes and uses appropriate formats for NV12 planes.
350-388: LGTM!Clean initialization with one-time capability detection. The warning for oversized videos is appropriate without being blocking.
555-624: LGTM!The texture pool integration and plane extraction logic are correct. The D3D11_BOX regions properly extract Y (full res) and UV (half res) planes from the NV12 texture.
Note:
y_handleanduv_handleare alwaysNonein this path. If shared handle support is needed for zero-copy in consumers, this would need to be extended.crates/rendering/src/cpu_yuv.rs (4)
1-33: LGTM!Clean progress tracking with appropriate use of relaxed atomics for UI feedback purposes.
431-475: AVX2/SSE2 functions are placeholders calling scalar code.The
nv12_convert_row_avx2andnv12_convert_row_sse2functions both just delegate to the scalar implementation. The SIMD detection logic is in place, but actual SIMD intrinsics aren't implemented yet.This means the parallelization via rayon provides the optimization, but not SIMD vectorization within each row.
Is this intentional as a first step, with SIMD intrinsics to be added later? If so, consider adding a TODO or tracking issue.
282-335: LGTM!Good band-based parallelization with rayon. The MIN_ROWS_PER_THREAD threshold prevents excessive overhead for small images, and per-row progress updates provide responsive UI feedback.
1038-1224: LGTM!Comprehensive test coverage for the new features: SIMD detection, progress tracking, parallel conversion, and cancellation behavior.
crates/rendering/src/yuv_converter.rs (5)
85-134: LGTM!Solid dimension validation with GPU-awareness. The even dimension enforcement (
& !1) is essential for YUV formats, and the aspect-ratio-preserving downscale is well-implemented.
347-470: LGTM!Well-factored texture creation helpers with correct formats for each YUV plane type. Double-buffered output is good for pipeline efficiency.
472-510: LGTM!Good grow-only texture reallocation strategy that avoids thrashing. The aligned dimensions ensure GPU-friendly allocation sizes.
943-1049: LGTM!Clean implementation of the zero-copy D3D11 shared handle path with proper error handling for both texture imports.
1051-1122: LGTM!Smart fallback strategy with sticky failure flag to avoid repeated zero-copy attempts. The
reset_zero_copy_statemethod allows recovery when conditions change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
crates/rendering/src/decoder/mod.rs (2)
213-216: Refactor to avoid suppressing the redundant_closure lint.The closure can be eliminated by using
.as_deref()followed by a function reference, which aligns with the coding guidelines that specify this lint is set todeny.As per coding guidelines, use function references directly instead of closures.
Apply this diff:
- #[allow(clippy::redundant_closure)] pub fn iosurface_backing(&self) -> Option<&cv::ImageBuf> { - self.iosurface_backing.as_ref().map(|b| b.inner()) + self.iosurface_backing.as_deref().map(SendableImageBuf::inner) }
278-281: Refactor to avoid suppressing the redundant_closure lint.The closure can be eliminated by using
.as_deref()followed by a function reference, which aligns with the coding guidelines that specify this lint is set todeny.As per coding guidelines, use function references directly instead of closures.
Apply this diff:
- #[allow(clippy::redundant_closure)] pub fn d3d11_texture_backing(&self) -> Option<&ID3D11Texture2D> { - self.d3d11_texture_backing.as_ref().map(|b| b.inner()) + self.d3d11_texture_backing.as_deref().map(SendableD3D11Texture::inner) }crates/video-decode/src/media_foundation.rs (2)
57-134: Consider documenting that max_width/max_height represent the maximum resolution supported by any codec.The current implementation sets
max_widthandmax_heightto the highest resolution supported by either H264 or HEVC decoders. This means if H264 supports 8K but HEVC only supports 4K,max_widthwill be 8K. Callers checking dimensions against these limits should be aware that the limits are codec-agnostic and may not reflect per-codec capabilities.
220-291: Validate that dimensions are even for NV12 texture creation.Lines 256-257 create UV textures with half dimensions (
width / 2,height / 2). If the input dimensions are odd, this will truncate, potentially causing misalignment issues with NV12 format which expects even dimensions.Consider adding a validation check:
fn get_or_create_yuv_textures( &mut self, device: &ID3D11Device, width: u32, height: u32, ) -> Result<(&ID3D11Texture2D, &ID3D11Texture2D), String> { + if width % 2 != 0 || height % 2 != 0 { + return Err(format!("NV12 requires even dimensions, got {}x{}", width, height)); + } + use windows::Win32::Graphics::Dxgi::Common::{crates/rendering/src/yuv_converter.rs (1)
622-638: Consider more idiomatic Option handling.The early return pattern works, but could be more concise:
- if self.iosurface_cache.is_none() { - return Err(IOSurfaceTextureError::NoMetalDevice.into()); - } - let io_surface = image_buf .io_surf() .ok_or(IOSurfaceTextureError::NoIOSurface)?; let width = image_buf.width() as u32; let height = image_buf.height() as u32; let (effective_width, effective_height, _downscaled) = validate_dimensions(width, height, self.gpu_max_texture_size)?; self.ensure_texture_size(device, effective_width, effective_height); self.swap_output_buffer(); - let cache = self.iosurface_cache.as_ref().unwrap(); + let cache = self.iosurface_cache.as_ref() + .ok_or(IOSurfaceTextureError::NoMetalDevice)?;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
crates/frame-converter/src/d3d11.rs(11 hunks)crates/recording/src/output_pipeline/win.rs(2 hunks)crates/rendering/src/cpu_yuv.rs(9 hunks)crates/rendering/src/decoder/mod.rs(2 hunks)crates/rendering/src/yuv_converter.rs(18 hunks)crates/video-decode/src/ffmpeg.rs(2 hunks)crates/video-decode/src/media_foundation.rs(8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.rs: Never use dbg!() macro; use proper logging (tracing::debug!, etc.) instead
Never write let _ = async_fn() which silently drops futures; await or explicitly handle them
Use duration.saturating_sub(other) instead of duration - other to avoid panics on underflow
Merge nested if statements: write 'if a && b { }' instead of 'if a { if b { } }'
Don't call .clone() on Copy types (integers, bools, etc.); copy them directly
Use function references directly: iter.map(foo) instead of iter.map(|x| foo(x))
Accept &[T] or &str instead of &Vec or &String in function parameters for flexibility
Use .is_empty() instead of .len() == 0 or .len() > 0 / .len() != 0
Don't assign () to a variable: write foo(); instead of let _ = foo(); or let x = foo(); when return is unit
Use .unwrap_or(val) instead of .unwrap_or_else(|| val) when the default is a simple/cheap value
Use 'for item in &collection' or 'for (i, item) in collection.iter().enumerate()' instead of 'for i in 0..collection.len()'
Use value.clamp(min, max) instead of manual if chains or .min(max).max(min) patterns
Always handle Result/Option or types marked #[must_use]; never ignore them
**/*.rs: Userustfmtand workspace clippy lints for Rust code formatting and linting
Use snake_case for Rust module names and kebab-case for crate names
Never usedbg!()macro in Rust code; use proper logging instead (Clippy:dbg_macro= deny)
Always handleResult/Optionor types marked#[must_use]; never ignore them (Rust compiler lint:unused_must_use= deny)
Never writelet _ = async_fn()which silently drops futures; await or explicitly handle them (Clippy:let_underscore_future= deny)
Usesaturating_subinstead of-forDurationto avoid panics (Clippy:unchecked_duration_subtraction= deny)
Merge nestedifstatements: useif a && b { }instead ofif a { if b { } }(Clippy:collapsible_if= deny)
Don't call.clone()onCopytypes; just copy them directly (Clippy:clone_on_copy= deny)
U...
Files:
crates/recording/src/output_pipeline/win.rscrates/video-decode/src/ffmpeg.rscrates/rendering/src/decoder/mod.rscrates/video-decode/src/media_foundation.rscrates/rendering/src/yuv_converter.rscrates/frame-converter/src/d3d11.rscrates/rendering/src/cpu_yuv.rs
**/*.{ts,tsx,js,jsx,rs}
📄 CodeRabbit inference engine (AGENTS.md)
Never add comments to code (
//,/* */,///,//!,#, etc.); code must be self-explanatory through naming, types, and structure
Files:
crates/recording/src/output_pipeline/win.rscrates/video-decode/src/ffmpeg.rscrates/rendering/src/decoder/mod.rscrates/video-decode/src/media_foundation.rscrates/rendering/src/yuv_converter.rscrates/frame-converter/src/d3d11.rscrates/rendering/src/cpu_yuv.rs
🧠 Learnings (2)
📚 Learning: 2025-10-28T08:39:42.230Z
Learnt from: Brendonovich
Repo: CapSoftware/Cap PR: 1305
File: crates/recording/src/output_pipeline/macos.rs:80-90
Timestamp: 2025-10-28T08:39:42.230Z
Learning: In `crates/recording/src/output_pipeline/macos.rs`, the `AVFoundationMp4Muxer` intentionally holds the `Mutex<MP4Encoder>` lock during retry attempts in `send_video_frame()` and `send_audio_frame()`. This blocking behavior is correct because frame processing must happen sequentially to prevent audio and video frames from being interleaved incorrectly in the encoder.
Applied to files:
crates/recording/src/output_pipeline/win.rscrates/video-decode/src/ffmpeg.rs
📚 Learning: 2025-12-07T14:29:40.743Z
Learnt from: CR
Repo: CapSoftware/Cap PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-07T14:29:40.743Z
Learning: Applies to **/*.rs : Use function references directly instead of redundant closures: `iter.map(foo)` instead of `iter.map(|x| foo(x))` (Clippy: `redundant_closure` = deny)
Applied to files:
crates/rendering/src/decoder/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Clippy (x86_64-pc-windows-msvc, windows-latest)
- GitHub Check: Clippy (aarch64-apple-darwin, macos-latest)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (rust)
🔇 Additional comments (27)
crates/recording/src/output_pipeline/win.rs (1)
215-215: LGTM! Improved error message clarity.The formatting change from positional to named parameters (
{poisoned}) makes the error messages more explicit and easier to understand during debugging.Also applies to: 543-543
crates/frame-converter/src/d3d11.rs (1)
116-116: LGTM! Consistent error formatting improvements.The switch from positional placeholders (
{:?}) to named parameters ({e:?}) across all error sites improves code clarity and follows Rust formatting conventions.Also applies to: 120-120, 124-124, 168-168, 195-195, 200-200, 226-228, 236-236, 349-349, 381-381, 407-407, 431-431, 455-455, 527-527
crates/video-decode/src/ffmpeg.rs (5)
10-30: LGTM!The
HwDecoderCapabilitiesstruct and caching mechanism are well-designed. UsingOnceLockensures thread-safe lazy initialization of hardware capabilities, and the optimistic default values (8192x8192) are safe since the decoder logic validates dimensions against actual capabilities before attempting hardware decode.
32-133: LGTM - Hardware capability detection is robust.The Windows D3D11 video decoder capability query is well-structured with proper error handling. The logic correctly:
- Tests for 8K (8192x8192) support first and updates max dimensions to 8192 if successful
- Falls back to testing 4K (4096x4096) support if 8K fails
- Retains the baseline 4096x4096 if neither check succeeds
- Returns safe defaults on any query failure
The use of unsafe Windows APIs is necessary for D3D11 device creation and has appropriate error handling.
135-142: LGTM!The non-Windows stub and public accessor are appropriate. The
get_or_initpattern ensures the capability query runs once and the result is cached for subsequent calls.
144-171: LGTM - Software threading configuration is well-designed.The function appropriately:
- Uses
u64for pixel count calculation to prevent overflow- Applies resolution-based heuristics: auto-threading for 4K+, half CPU cores (min 2) for Full HD, and 2 threads for smaller resolutions
- Uses frame-level threading (
FF_THREAD_FRAME) which is appropriate for video decoding- Has a null pointer check before accessing the FFmpeg codec context
- Logs the configuration for observability
212-245: LGTM - Hardware/software decode logic is correct.The integration properly handles all decode paths:
- Hardware requested, within limits → Uses hardware acceleration (line 228-232)
- Hardware requested, exceeds limits → Falls back to software with threading (line 223)
- Hardware requested, initialization fails → Falls back to software with threading (line 236)
- No hardware requested → Configures software threading (line 244)
The logic correctly avoids duplicate software threading configuration by checking both
hw_device.is_none()andhw_device_type.is_none()on line 243. While the pure software decode path (whenhw_device_typeisNone) is not exercised by the current codebase, the implementation handles it correctly.crates/rendering/src/yuv_converter.rs (11)
85-92: LGTM! Sensible texture dimension constants.The increased GPU texture limits (8K) and initial allocation at 1080p with 64-byte padding alignment are appropriate for modern hardware.
97-134: LGTM! Robust dimension validation with GPU-aware downscaling.The validation logic correctly:
- Respects both hardware and configured limits
- Scales down proportionally when needed
- Rejects extreme downscaling (< 0.1)
- Ensures even dimensions for YUV compatibility
- Provides clear logging
152-154: LGTM! Appropriate state tracking fields.The new fields correctly track:
- Current texture allocation dimensions
- GPU capability limits
- Zero-copy failure state (Windows)
Also applies to: 163-164
169-174: LGTM! Proper GPU capability detection.The initialization correctly queries GPU limits and logs the configuration.
305-345: LGTM! Sensible initial texture allocation.Starting at 1080p avoids over-allocation while being sufficient for common use cases. The texture will grow dynamically as needed.
347-470: LGTM! Well-structured texture creation helpers.The helper functions correctly:
- Create appropriately sized textures for each plane
- Handle Y, UV, U, V planes with correct dimensions (half-size for chroma)
- Create double-buffered output textures
- Return both texture and view for convenience
512-514: LGTM! Clean accessor method.
540-542: LGTM! Consistent dimension validation pattern.All conversion entry points correctly validate dimensions against GPU limits before processing. The
_downscaledflag is currently unused but may be useful for future metrics or warnings.
961-1049: LGTM! Well-implemented zero-copy path with proper error handling.The implementation correctly:
- Validates dimensions before processing
- Attempts zero-copy import for both Y and UV planes
- Provides detailed logging for success and failure cases
- Returns appropriate errors with context
1051-1112: LGTM! Smart fallback strategy with persistent failure tracking.The implementation intelligently:
- Attempts zero-copy when available and not previously failed
- Permanently disables zero-copy after first failure to avoid repeated overhead
- Provides clear trace logging for path selection
- Falls back gracefully to staging copy
The persistent failure tracking (
zero_copy_failed) is a good optimization to avoid repeated failed attempts on incompatible hardware.
1114-1122: LGTM! Useful zero-copy state management API.The accessors allow callers to:
- Query current zero-copy status
- Reset state if hardware configuration changes
crates/rendering/src/cpu_yuv.rs (9)
1-33: LGTM! Well-designed progress tracking with atomic state.The implementation correctly:
- Uses atomic operations for thread-safe progress tracking
- Handles edge case (zero total_rows) in
progress_fraction- Uses
Relaxedordering appropriately for non-critical progress updates- Provides cancellation support
129-160: LGTM! Clean SIMD capability detection.The implementation correctly:
- Detects AVX2/SSE2 support on x86 architectures
- Falls back to scalar on non-x86
- Provides useful metadata via
pixels_per_iteration
162-165: LGTM! Sensible parallelism thresholds.Using 1080p as the cutoff for parallel processing and 16 rows per thread are appropriate values that balance overhead with parallelism benefits.
177-255: LGTM! Robust SIMD conversion with proper validation.The implementation correctly:
- Validates buffer sizes with overflow protection (saturating arithmetic)
- Detects optimal SIMD level
- Chooses parallel/sequential based on frame size
- Falls back to scalar conversion if validation fails
257-285: LGTM! Sequential conversion with cancellation support.Properly checks cancellation before each row and updates progress atomically after processing.
287-342: LGTM! Efficient parallel conversion using rayon.The implementation correctly:
- Divides work into bands based on thread count
- Uses
par_chunks_mutfor parallel processing- Checks cancellation within each band
- Guards against out-of-bounds with
global_row >= heightcheck- Updates progress atomically
344-413: LGTM! Clean SIMD dispatch pattern.The dispatch correctly routes to AVX2/SSE2/Scalar implementations based on detected capabilities. Unsafe blocks are properly contained.
537-621: LGTM! Consistent pattern with NV12 conversion.The YUV420P implementation follows the same robust pattern as NV12 with proper validation and parallel/sequential selection.
Note: Same SIMD placeholder issue applies (see comment on lines 415-495).
623-711: LGTM! Consistent parallel and sequential conversion implementations.The YUV420P conversion correctly mirrors the NV12 pattern with proper cancellation and progress tracking.
| #[cfg(any(target_arch = "x86_64", target_arch = "x86"))] | ||
| #[target_feature(enable = "avx2")] | ||
| unsafe fn nv12_convert_row_avx2( | ||
| y_data: &[u8], | ||
| uv_data: &[u8], | ||
| width: usize, | ||
| y_row_start: usize, | ||
| uv_row_start: usize, | ||
| out_row_start: usize, | ||
| output: &mut [u8], | ||
| ) { | ||
| unsafe { | ||
| nv12_convert_row_sse2( | ||
| y_data, | ||
| uv_data, | ||
| width, | ||
| y_row_start, | ||
| uv_row_start, | ||
| out_row_start, | ||
| output, | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| let r = clamp_u8((298 * c + 409 * e + 128) >> 8); | ||
| let g = clamp_u8((298 * c - 100 * d - 208 * e + 128) >> 8); | ||
| let b = clamp_u8((298 * c + 516 * d + 128) >> 8); | ||
| #[cfg(any(target_arch = "x86_64", target_arch = "x86"))] | ||
| #[target_feature(enable = "sse2")] | ||
| unsafe fn nv12_convert_row_sse2( | ||
| y_data: &[u8], | ||
| uv_data: &[u8], | ||
| width: usize, | ||
| y_row_start: usize, | ||
| uv_row_start: usize, | ||
| out_row_start: usize, | ||
| output: &mut [u8], | ||
| ) { | ||
| nv12_convert_row_scalar( | ||
| y_data, | ||
| uv_data, | ||
| width, | ||
| y_row_start, | ||
| uv_row_start, | ||
| out_row_start, | ||
| output, | ||
| ); | ||
| } | ||
|
|
||
| let out_idx = out_row_start + col * 4; | ||
| if out_idx + 3 < output.len() { | ||
| output[out_idx] = r; | ||
| output[out_idx + 1] = g; | ||
| output[out_idx + 2] = b; | ||
| output[out_idx + 3] = 255; | ||
| } | ||
| } | ||
| #[cfg(any(target_arch = "x86_64", target_arch = "x86"))] | ||
| fn nv12_convert_row_scalar( | ||
| y_data: &[u8], | ||
| uv_data: &[u8], | ||
| width: usize, | ||
| y_row_start: usize, | ||
| uv_row_start: usize, | ||
| out_row_start: usize, | ||
| output: &mut [u8], | ||
| ) { | ||
| for col in 0..width { | ||
| let y_idx = y_row_start + col; | ||
| let uv_idx = uv_row_start + (col / 2) * 2; | ||
|
|
||
| let y = y_data.get(y_idx).copied().unwrap_or(0) as i32; | ||
| let u = uv_data.get(uv_idx).copied().unwrap_or(128) as i32; | ||
| let v = uv_data.get(uv_idx + 1).copied().unwrap_or(128) as i32; | ||
|
|
||
| let c = y - 16; | ||
| let d = u - 128; | ||
| let e = v - 128; | ||
|
|
||
| let r = clamp_u8((298 * c + 409 * e + 128) >> 8); | ||
| let g = clamp_u8((298 * c - 100 * d - 208 * e + 128) >> 8); | ||
| let b = clamp_u8((298 * c + 516 * d + 128) >> 8); | ||
|
|
||
| let out_idx = out_row_start + col * 4; | ||
| if out_idx + 3 < output.len() { | ||
| output[out_idx] = r; | ||
| output[out_idx + 1] = g; | ||
| output[out_idx + 2] = b; | ||
| output[out_idx + 3] = 255; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMD functions are placeholder implementations calling scalar code.
The AVX2 and SSE2 functions currently just call the scalar implementation, meaning no actual SIMD acceleration is occurring. The dispatch infrastructure is in place, but the vectorized implementations are missing.
This should be documented as known limitation, or actual SIMD implementations should be provided. If this is intentional (staged rollout), consider:
- Adding a comment explaining that SIMD implementations are placeholders
- Updating the PR description to clarify the current state
- Or implementing the actual SIMD intrinsics for performance
Based on the PR objectives mentioning "SIMD-parallel YUV→RGBA conversion", users may expect actual SIMD acceleration in this release.
| #[cfg(not(any(target_arch = "x86_64", target_arch = "x86")))] | ||
| #[allow(clippy::too_many_arguments)] | ||
| pub fn nv12_to_rgba_simd_with_progress( | ||
| y_data: &[u8], | ||
| uv_data: &[u8], | ||
| width: u32, | ||
| height: u32, | ||
| y_stride: u32, | ||
| uv_stride: u32, | ||
| output: &mut [u8], | ||
| _progress: Option<Arc<ConversionProgress>>, | ||
| ) { | ||
| nv12_to_rgba(y_data, uv_data, width, height, y_stride, uv_stride, output); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress tracking not supported on non-x86 architectures.
The non-x86 fallback ignores the progress parameter, meaning cancellation and progress monitoring won't work on ARM or other architectures. Consider documenting this limitation if it's intentional.
If progress support is desired on all platforms, implement it for the scalar path:
#[cfg(not(any(target_arch = "x86_64", target_arch = "x86")))]
#[allow(clippy::too_many_arguments)]
pub fn nv12_to_rgba_simd_with_progress(
y_data: &[u8],
uv_data: &[u8],
width: u32,
height: u32,
y_stride: u32,
uv_stride: u32,
output: &mut [u8],
- _progress: Option<Arc<ConversionProgress>>,
+ progress: Option<Arc<ConversionProgress>>,
) {
- nv12_to_rgba(y_data, uv_data, width, height, y_stride, uv_stride, output);
+ let height_usize = height as usize;
+ for row in 0..height_usize {
+ if let Some(ref p) = progress && p.is_cancelled() {
+ return;
+ }
+ // Process row (inline the conversion logic or refactor nv12_to_rgba)
+ if let Some(ref p) = progress {
+ p.rows_completed.fetch_add(1, Ordering::Relaxed);
+ }
+ }
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
crates/rendering/src/cpu_yuv.rs lines 510-523: the non-x86 fallback
nv12_to_rgba_simd_with_progress currently ignores the progress parameter, so
progress reporting/cancellation doesn't work on ARM; either document this
limitation or implement progress support in the scalar path by removing the
unused underscore from the parameter, propagate the
Option<Arc<ConversionProgress>> into the scalar conversion routine, and update
the scalar nv12_to_rgba implementation to periodically report progress and check
for cancellation (e.g., compute rows/blocks processed, call progress.report(...)
and abort/return early on cancellation). Ensure the function signature stays
compatible and add a short doc comment if you choose to leave it unsupported.
| #[test] | ||
| fn test_cancellation() { | ||
| let progress = Arc::new(ConversionProgress::new(1080)); | ||
|
|
||
| let width = 1920u32; | ||
| let height = 1080u32; | ||
| let y_stride = 1920u32; | ||
| let uv_stride = 1920u32; | ||
|
|
||
| let y_data: Vec<u8> = vec![128; (y_stride * height) as usize]; | ||
| let uv_data: Vec<u8> = vec![128; (uv_stride * height / 2) as usize]; | ||
|
|
||
| let mut output = vec![0u8; (width * height * 4) as usize]; | ||
|
|
||
| progress.cancel(); | ||
|
|
||
| nv12_to_rgba_simd_with_progress( | ||
| &y_data, | ||
| &uv_data, | ||
| width, | ||
| height, | ||
| y_stride, | ||
| uv_stride, | ||
| &mut output, | ||
| Some(progress.clone()), | ||
| ); | ||
|
|
||
| let rows_done = progress.rows_completed.load(Ordering::Relaxed); | ||
| assert!(rows_done < height as usize); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancellation test may be non-deterministic.
The test cancels before starting conversion (line 1172) and expects rows_done < height (line 1186). While this should work since cancellation is checked at the start of each row, there's a theoretical race condition if the check happens after row processing begins.
Consider making the test more robust:
#[test]
fn test_cancellation() {
let progress = Arc::new(ConversionProgress::new(1080));
let width = 1920u32;
let height = 1080u32;
let y_stride = 1920u32;
let uv_stride = 1920u32;
let y_data: Vec<u8> = vec![128; (y_stride * height) as usize];
let uv_data: Vec<u8> = vec![128; (uv_stride * height / 2) as usize];
let mut output = vec![0u8; (width * height * 4) as usize];
progress.cancel();
nv12_to_rgba_simd_with_progress(
&y_data,
&uv_data,
width,
height,
y_stride,
uv_stride,
&mut output,
Some(progress.clone()),
);
let rows_done = progress.rows_completed.load(Ordering::Relaxed);
- assert!(rows_done < height as usize);
+ assert!(
+ rows_done < height as usize,
+ "Expected early termination due to cancellation, but {} rows completed",
+ rows_done
+ );
}Or test cancellation mid-stream by canceling from another thread after a delay.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn test_cancellation() { | |
| let progress = Arc::new(ConversionProgress::new(1080)); | |
| let width = 1920u32; | |
| let height = 1080u32; | |
| let y_stride = 1920u32; | |
| let uv_stride = 1920u32; | |
| let y_data: Vec<u8> = vec![128; (y_stride * height) as usize]; | |
| let uv_data: Vec<u8> = vec![128; (uv_stride * height / 2) as usize]; | |
| let mut output = vec![0u8; (width * height * 4) as usize]; | |
| progress.cancel(); | |
| nv12_to_rgba_simd_with_progress( | |
| &y_data, | |
| &uv_data, | |
| width, | |
| height, | |
| y_stride, | |
| uv_stride, | |
| &mut output, | |
| Some(progress.clone()), | |
| ); | |
| let rows_done = progress.rows_completed.load(Ordering::Relaxed); | |
| assert!(rows_done < height as usize); | |
| } | |
| #[test] | |
| fn test_cancellation() { | |
| let progress = Arc::new(ConversionProgress::new(1080)); | |
| let width = 1920u32; | |
| let height = 1080u32; | |
| let y_stride = 1920u32; | |
| let uv_stride = 1920u32; | |
| let y_data: Vec<u8> = vec![128; (y_stride * height) as usize]; | |
| let uv_data: Vec<u8> = vec![128; (uv_stride * height / 2) as usize]; | |
| let mut output = vec![0u8; (width * height * 4) as usize]; | |
| progress.cancel(); | |
| nv12_to_rgba_simd_with_progress( | |
| &y_data, | |
| &uv_data, | |
| width, | |
| height, | |
| y_stride, | |
| uv_stride, | |
| &mut output, | |
| Some(progress.clone()), | |
| ); | |
| let rows_done = progress.rows_completed.load(Ordering::Relaxed); | |
| assert!( | |
| rows_done < height as usize, | |
| "Expected early termination due to cancellation, but {} rows completed", | |
| rows_done | |
| ); | |
| } |
| fn align_dimension(dim: u32) -> u32 { | ||
| dim.div_ceil(TEXTURE_SIZE_PADDING) * TEXTURE_SIZE_PADDING | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Read the file to understand align_dimension and its usage
cat -n crates/rendering/src/yuv_converter.rs | head -120Repository: CapSoftware/Cap
Length of output: 4337
🏁 Script executed:
#!/bin/bash
# Search for GPU-related limit constants and validations
rg -i "gpu.*max|max.*texture|texture.*size.*limit" --type=rs -C2Repository: CapSoftware/Cap
Length of output: 85
🏁 Script executed:
#!/bin/bash
# Continue reading the file to see more of validate_dimensions and how align_dimension is used
cat -n crates/rendering/src/yuv_converter.rs | sed -n '97,150p'Repository: CapSoftware/Cap
Length of output: 2262
🏁 Script executed:
#!/bin/bash
# Search for calls to align_dimension to see where it's used
rg "align_dimension" crates/rendering/src/yuv_converter.rs -C3Repository: CapSoftware/Cap
Length of output: 524
🏁 Script executed:
#!/bin/bash
# Search for where ensure_texture_size is called to confirm validation happens
rg "ensure_texture_size|validate_dimensions" crates/rendering/src/yuv_converter.rs -C3Repository: CapSoftware/Cap
Length of output: 3252
🏁 Script executed:
#!/bin/bash
# Read more of the file to see the full context of ensure_texture_size and how it's used
cat -n crates/rendering/src/yuv_converter.rs | sed -n '150,250p'Repository: CapSoftware/Cap
Length of output: 5229
🏁 Script executed:
#!/bin/bash
# Check the full ensure_texture_size implementation to see if it re-validates after alignment
cat -n crates/rendering/src/yuv_converter.rs | sed -n '260,310p'Repository: CapSoftware/Cap
Length of output: 2701
🏁 Script executed:
#!/bin/bash
# Find where allocated_width and allocated_height are set to understand the bounds
rg "allocated_width\s*=" crates/rendering/src/yuv_converter.rs -C2Repository: CapSoftware/Cap
Length of output: 240
🏁 Script executed:
#!/bin/bash
# Find the complete ensure_texture_size function
rg -A30 "fn ensure_texture_size" crates/rendering/src/yuv_converter.rsRepository: CapSoftware/Cap
Length of output: 1438
🏁 Script executed:
#!/bin/bash
# Check what the limits are for texture creation and if alignment could violate them
rg "MAX_TEXTURE" crates/rendering/src/yuv_converter.rs -C1Repository: CapSoftware/Cap
Length of output: 343
🏁 Script executed:
#!/bin/bash
# Check if texture creation functions handle dimension checks or clamping
rg -A5 "create_y_texture|create_uv_texture|create_u_texture|create_v_texture|create_output_textures" crates/rendering/src/yuv_converter.rs | head -40Repository: CapSoftware/Cap
Length of output: 1557
🏁 Script executed:
#!/bin/bash
# Verify the math: what is 4320 aligned?
python3 << 'EOF'
max_height = 4320
padding = 64
aligned = ((max_height + padding - 1) // padding) * padding
print(f"MAX_TEXTURE_HEIGHT: {max_height}")
print(f"TEXTURE_SIZE_PADDING: {padding}")
print(f"Aligned height: {aligned}")
print(f"Exceeds original? {aligned > max_height}")
print()
print("Calculation breakdown:")
print(f" {max_height} / {padding} = {max_height / padding}")
print(f" ceil({max_height / padding}) = {-(-max_height // padding)}")
print(f" {-(-max_height // padding)} * {padding} = {-(-max_height // padding) * padding}")
EOFRepository: CapSoftware/Cap
Length of output: 145
🏁 Script executed:
#!/bin/bash
# Get the full create_y_texture to see if there's any validation
rg -A15 "fn create_y_texture" crates/rendering/src/yuv_converter.rsRepository: CapSoftware/Cap
Length of output: 617
🏁 Script executed:
#!/bin/bash
# Check if create_texture validates the dimensions against GPU limits
rg -B5 -A10 "device.create_texture" crates/rendering/src/yuv_converter.rs | head -50Repository: CapSoftware/Cap
Length of output: 1790
Alignment causes validated dimensions to exceed GPU limits.
The validation ensures dimensions don't exceed effective_max_height (4320 in this case), but align_dimension(4320) returns 4352, which exceeds the limit. Since alignment happens after validation in ensure_texture_size, dimensions that pass validation can become invalid. Re-validate or clamp dimensions to GPU limits after alignment before creating textures.
🤖 Prompt for AI Agents
In crates/rendering/src/yuv_converter.rs around lines 93 to 95, align_dimension
currently rounds up (e.g., 4320 -> 4352) after validation which can push
dimensions past GPU limits; change ensure_texture_size to perform alignment
first and then re-validate or clamp the aligned dimensions against
effective_max_width/effective_max_height (or return an error) so the final
dimensions used to create textures never exceed GPU limits; apply this for both
width and height and ensure any error messages mention the aligned size when
rejecting or clamping.
| fn ensure_texture_size(&mut self, device: &wgpu::Device, width: u32, height: u32) { | ||
| let required_width = align_dimension(width); | ||
| let required_height = align_dimension(height); | ||
|
|
||
| if required_width <= self.allocated_width && required_height <= self.allocated_height { | ||
| return; | ||
| } | ||
|
|
||
| let new_width = required_width.max(self.allocated_width); | ||
| let new_height = required_height.max(self.allocated_height); | ||
|
|
||
| tracing::info!( | ||
| old_width = self.allocated_width, | ||
| old_height = self.allocated_height, | ||
| new_width = new_width, | ||
| new_height = new_height, | ||
| "Reallocating YUV converter textures for larger video" | ||
| ); | ||
|
|
||
| let (y_texture, y_view) = Self::create_y_texture(device, new_width, new_height); | ||
| let (uv_texture, uv_view) = Self::create_uv_texture(device, new_width, new_height); | ||
| let (u_texture, u_view) = Self::create_u_texture(device, new_width, new_height); | ||
| let (v_texture, v_view) = Self::create_v_texture(device, new_width, new_height); | ||
| let (output_textures, output_views) = | ||
| Self::create_output_textures(device, new_width, new_height); | ||
|
|
||
| self.y_texture = y_texture; | ||
| self.y_view = y_view; | ||
| self.uv_texture = uv_texture; | ||
| self.uv_view = uv_view; | ||
| self.u_texture = u_texture; | ||
| self.u_view = u_view; | ||
| self.v_texture = v_texture; | ||
| self.v_view = v_view; | ||
| self.output_textures = output_textures; | ||
| self.output_views = output_views; | ||
| self.allocated_width = new_width; | ||
| self.allocated_height = new_height; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify aligned dimensions don't exceed GPU limits.
The grow-only strategy is efficient, but consider this edge case:
- Caller validates dimensions → gets (effective_width, effective_height) within GPU limits
ensure_texture_sizealigns these withalign_dimension→ potentially exceeds limits
For example, if effective_width = gpu_max - 10 and TEXTURE_SIZE_PADDING = 64, the aligned dimension would exceed the GPU limit.
Apply this approach to validate post-alignment:
fn ensure_texture_size(&mut self, device: &wgpu::Device, width: u32, height: u32) {
let required_width = align_dimension(width);
let required_height = align_dimension(height);
+
+ let effective_max = MAX_TEXTURE_WIDTH.min(self.gpu_max_texture_size);
+ let required_width = required_width.min(effective_max);
+ let required_height = required_height.min(effective_max);
if required_width <= self.allocated_width && required_height <= self.allocated_height {Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/rendering/src/yuv_converter.rs around lines 472 to 510, after
computing required_width/required_height with align_dimension, validate those
post-alignment dimensions against the device GPU limits
(device.limits().max_texture_dimension_2d); if either aligned dimension exceeds
the limit, do not attempt to allocate the larger textures — log an error/tracing
with the attempted and max values and return early (or clamp to the max and log
if you prefer a best-effort allocation), otherwise proceed with the existing
reallocation logic; ensure allocated_width/allocated_height are only updated
when allocation actually succeeds.
Significant improvements to hardware and software video decoding, particularly by dynamically detecting hardware decoder capabilities on Windows and optimizing software decoding threading based on video resolution.
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.