random orthogonal transformation preprocess #1525

enp1s0 · 2025-11-10T14:57:36Z

This PR introduces a random orthogonal transformation as a preprocessor for CAGRA-Q, etc. By using this preprocessor, we can achieve a higher recall in the CAGRA-Q search depending on the datasets. This implementation generates an orthogonal matrix through the QR decomposition of a random matrix and then multiplies it by the dataset matrix.

Add the aggregate reporting of NVTX ranges in the output of benchmark executable. ### Usage ```bash # Measure the CPU and GPU runtime of all NVTX ranges nsys launch --trace=cuda,nvtx <ANN_BENCH with arguments> # Measure only the CPU runtime of all NVTX ranges nsys launch --trace=nvtx <ANN_BENCH with arguments> # Do not measure/report any NVTX ranges <ANN_BENCH with arguments> # Do not measure/report any NVTX ranges within benchmark, but use nsys profiling as usual nsys profile ... <ANN_BENCH with arguments> ``` ### Implementation The PR adds a single module `nvtx_stats.hpp` to the benchmark executable; there are no changes to the library at all. The program leverages NVIDIA Nsight Systems CLI to collect and export NVTX statistics and then SQLite API to aggregate it into the benchmark state: 1. Detect if run via `nsys launch`; if so, call `nsys start` / `nsys stop` around benchmark loop; otherwise do nothing. 2. If the report is generated, read it and query all NVTX events and the GPU correlation data using SQLite 3. Aggregate the NVTX events by their short names (without arguments to reduce the number of columns) 4. Add them to the benchmark performance counters with the same averaging strategy as the global CPU/GPU runtime. ### Performance cost If the benchmark is **not** run using `nsys launch`, there's virtually zero overhead in the new functionality. Otherwise, there are overheads: 1. Usual nsys profiling overheads (minimized by disabling unused information via `nsys start` CLI internally). This affects the reported performance the same way as normal nsys profiling does (especially if cuda tracing is enabled). 2. One or more data collection/exporting events per benchmark case. These add some extra time to the benchmark time, but do not affect the counters (they are not the part of the benchmark loop) Closes rapidsai#1367 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#1529

When converting from a DLManagedTensor to a mdspan in our c-api, we weren't checking the stride information on the dlmanaged tensor is the c-api. This caused invalid results when passing a strided matrix to functions like cuvsCagraBuild. Fix and add a unittest. Authors: - Ben Frederickson (https://github.com/benfred) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1458

…ai#1535) This PR supports handling the new main branch strategy outlined below: * [RSN 47 - Changes to RAPIDS branching strategy in 25.12](https://docs.rapids.ai/notices/rsn0047/) The `update-version.sh` script should now supports two modes controlled via `CLI` params or `ENV` vars: CLI arguments: `--run-context=main|release` ENV var `RAPIDS_RUN_CONTEXT=main|release` xref: rapidsai/build-planning#224 Authors: - Nate Rock (https://github.com/rockhowse) Approvers: - Jake Awe (https://github.com/AyodeAwe) - Corey J. Nolet (https://github.com/cjnolet) - MithunR (https://github.com/mythrocks) URL: rapidsai#1535

@anaruse

This PR introduces **Augmented Core Extraction (ACE)**, an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required. This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4 ## Algorithm Description 1. **Dataset Partitioning**: The dataset is partitioned using balanced k-means clustering on sampled data. Each vector is assigned to its two closest partition centroids (primary and augmented). The primary partitions are non-overlapping. The augmentation ensures that cross-partition edges are captured in the final graph. Partitions smaller than a minimum threshold are automatically merged with larger partitions to ensure computational efficiency and graph quality. Vectors from small partitions are reassigned to the nearest valid partitions. 2. **Per-Partition Graph Building**: For each partition, a sub-index is built independently (regular `build_knn_graph()` flow) with its primary vectors plus augmented vectors from neighboring partitions. 3. **Graph Combining**: The per-partition graphs are combined into a single unified CAGRA index. Merging is not needed since the primary partitions are non-overlapping. The in-memory variant remaps the local partition IDs to global dataset IDs to create a correct index. The disk variant stores the backward index mappings (`dataset_mapping.bin`), the reordered dataset (`reordered_dataset.bin`) and the optimized CAGRA graph (`cagra_graph.bin`) on disk. The index is then incomplete as show by `cuvs::neighbors::index::on_disk()`. The files are stored in `cuvs::neighbors::index::file_directory()`. The HNSW index serialization was provided by @mfoerste4 in rapidsai#1410, which was merged here. This adds the `serialize_to_hnsw()` serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though. ## Core Components - **`ace_build()`**: Main routine which users should call. - **`ace_get_partition_labels()`**: Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging. - **`ace_create_forward_and_backward_lists()`**: Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices. - **`ace_set_index_params()`**: Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building. - **`ace_gather_partition_dataset()`**: In-memory only: gather the partition and augmented dataset. - **`ace_adjust_sub_graph_ids`**: In-memory only: Adjust ids in sub search graph and store them into the main search graph. - **`ace_adjust_final_graph_ids`**: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs. - **`ace_reorder_and_store_dataset`**: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance. - **`ace_load_partition_dataset_from_disk`**: Disk only: Load partition dataset and augmented dataset from disk. - **`file_descriptor` and `ace_read_large_file()` / `ace_write_large_file()`**: RAII file handle and chunked file I/O operations. - **CAGRA index changes**: Added `on_disk_` flag and `file_directory_` to the CAGRA index structure to support disk-backed indices. - **CAGRA parameter changes**: Added `ace_npartitions` and `ace_build_dir` to the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required. ## Usage ### C++ API ```cpp #include <cuvs/neighbors/cagra.hpp> using namespace cuvs::neighbors; // Configure index parameters cagra::index_params params; params.ace_npartitions = 10; // Number of partitions (unset or <= 1 to disable ACE) params.ace_build_dir = "/tmp/ace_build"; // Directory for intermediate files (should be a fast NVMe) params.graph_degree = 64; params.intermediate_graph_degree = 128; // Build ACE index (dataset can be on host memory) auto dataset = raft::make_host_matrix<float, int64_t>(n_rows, n_cols); // ... load dataset ... auto index = cagra::build_ace(res, params, dataset.view(), params.ace_npartitions); // Search works identically to standard CAGRA if the host has enough memory (index.on_disk() == false) cagra::search_params search_params; auto neighbors = raft::make_device_matrix<uint32_t>(res, n_queries, k); auto distances = raft::make_device_matrix<float>(res, n_queries, k); cagra::search(res, search_params, index, queries, neighbors.view(), distances.view()); ``` ### Storage Requirements 1. `cagra_graph.bin`: `n_rows * graph_degree * sizeof(IdxT)` 2. `dataset_mapping.bin`: `n_rows * sizeof(IdxT)` 2. `reordered_dataset.bin`: Size of the input dataset 3. `augmented_dataset.bin`: Size of the input dataset Authors: - Julian Miller (https://github.com/julianmi) - Anupam (https://github.com/aamijar) - Tarang Jain (https://github.com/tarang-jain) - Malte Förster (https://github.com/mfoerste4) - Jake Awe (https://github.com/AyodeAwe) - Bradley Dice (https://github.com/bdice) - Artem M. Chirkin (https://github.com/achirkin) - Jinsol Park (https://github.com/jinsolp) Approvers: - MithunR (https://github.com/mythrocks) - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1404

…i#1538) This updates RMM memory resource includes to use the header path `<rmm/mr/*>` instead of `<rmm/mr/device/*>`. xref: rapidsai/rmm#2141 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1538

Adds new `rocky8-clib-standalone-build` and `rocky8-clib-tests` PR jobs that validate that the C api binaries can be built and run all C tests correctly. Also adds a new nightly build job that produces the C api binaries. Authors: - Robert Maynard (https://github.com/robertmaynard) - Ben Frederickson (https://github.com/benfred) Approvers: - Jake Awe (https://github.com/AyodeAwe) - Bradley Dice (https://github.com/bdice) URL: rapidsai#1524

Issue: rapidsai/build-planning#130 Ops-Bot-Merge-Barrier: true Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Gil Forsyth (https://github.com/gforsyth) - Bradley Dice (https://github.com/bdice) - Ben Frederickson (https://github.com/benfred) URL: rapidsai#1500

Admin merge as part of NBS cleanup. Replaces rapidsai#1558 --------- Co-authored-by: Nate Rock <[email protected]> Co-authored-by: Bradley Dice <[email protected]> Co-authored-by: Paul Taylor <[email protected]> Co-authored-by: Gil Forsyth <[email protected]>

Forward merge 25.12

Forward-merge release/25.12 into main

…#1566) Updates FAISS patch for RMM memory resource header migration. xref: rapidsai/rmm#2141 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1566

This ensures that people are properly assigned to code review any changes to the C API Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: rapidsai#1573

Forward-merge release/25.12 into main

This PR sets conda to use `strict` priority in CI tests. Mixing channel priority is frequently a cause of unexpected errors. Our CI jobs should always use strict priority in order to enforce that conda packages come from local channels with the artifacts built in CI, not mixing with older nightly artifacts from the `rapidsai-nightly` channel or other sources. xref: rapidsai/build-planning#14 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - https://github.com/jakirkham URL: rapidsai#1583

## Summary - Update FAISS dependency from 1.12.0 to 1.13.0 - Remove thrust include patches already present in FAISS 1.13.0 - All other RMM API compatibility patches still apply cleanly Verified that updated patches apply cleanly to FAISS v1.13.0. Follow-up to rapidsai#1566. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1585

CUVS_ANN_BENCH_USE_FAISS is now set to OFF if all relevant flags are set OFF. The status is reported in the cmake log: -- Finding or building hnswlib -- Checking for FAISS use in benchmarks... -- CUVS_ANN_BENCH_USE_FAISS is OFF closes rapidsai#1590. Authors: - https://github.com/irina-resh-nvda Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#1591

Extend the `rocky8-clib-standalone-build` to include arm64 builds Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - James Lamb (https://github.com/jameslamb) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: rapidsai#1570

This PR sets conda to use `strict` priority in CI tests. Mixing channel priority is frequently a cause of unexpected errors. Our CI jobs should always use strict priority in order to enforce that conda packages come from local channels with the artifacts built in CI, not mixing with older nightly artifacts from the `rapidsai-nightly` channel or other sources. xref: rapidsai/build-planning#14 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#1606

`preprocess_data_kernel` in NN Descent had overflow issues. casting `blockIdx.x` to `size_t` to avoid overflow. Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1596

Closes rapidsai#1578 This PR refactors the code so that we have pre-compiled code for launching `{unpack/pack}_list_data_kernel`. `.cuh`: with declarations for including in other files `*_impl.cuh`, `*.cu`: actual implementation CUDA 12: 1104.44 MB -> 1100.26 MB CUDA 13: 437.47 MB -> 435.32 MB Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Divye Gala (https://github.com/divyegala) URL: rapidsai#1609

Closes rapidsai#1586 NN Descent python wrapper fails the `_check_input_array` check when given fp16 data. Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1616

This makes sure we don't leak unneeded dependencies in our `PUBLIC` target_link_libraries for cuvs_c Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Divye Gala (https://github.com/divyegala) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: rapidsai#1614

The sparse/gram apis were moved from raft in rapidsai#463. However, the cmake has not been updated to compile the tests. Authors: - Anupam (https://github.com/aamijar) Approvers: - Robert Maynard (https://github.com/robertmaynard) URL: rapidsai#1611

Based on rapidsai/raft#2836 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Bradley Dice (https://github.com/bdice) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#1605

This PR removes pre-release upper bound pinnings from non-RAPIDS dependencies. The presence of pre-release indicators like `<...a0` tells pip "pre-releases are OK, even if `--pre` was not passed to pip install." RAPIDS projects currently use such constraints in situations where it's not actually desirable to get pre-releases. xref: rapidsai/build-planning#144 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#1618

## Summary - Update FAISS dependency from 1.12.0 to 1.13.0 - Remove thrust include patches already present in FAISS 1.13.0 - All other RMM API compatibility patches still apply cleanly Verified that updated patches apply cleanly to FAISS v1.13.0. Follow-up to rapidsai#1566. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1585

CUVS_ANN_BENCH_USE_FAISS is now set to OFF if all relevant flags are set OFF. The status is reported in the cmake log: -- Finding or building hnswlib -- Checking for FAISS use in benchmarks... -- CUVS_ANN_BENCH_USE_FAISS is OFF closes rapidsai#1590. Authors: - https://github.com/irina-resh-nvda Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#1591

Add random orth transformation

31efa89

enp1s0 requested review from a team as code owners November 10, 2025 14:57

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Nov 10, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Nov 10, 2025

enp1s0 self-assigned this Nov 10, 2025

enp1s0 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 10, 2025

Merge branch 'main' into orth-transform-preprocess

48a7729

enp1s0 changed the title ~~[WIP] random orthogonal transformation preprocess~~ random orthogonal transformation preprocess Nov 10, 2025

enp1s0 added 2 commits November 11, 2025 11:08

Merge branch 'main' into orth-transform-preprocess

cfacc83

Add docs

13d99d7

enp1s0 requested a review from a team as a code owner November 11, 2025 02:13

enp1s0 and others added 2 commits November 11, 2025 14:57

Update linear trandform docs

fc69fea

Merge branch 'main' into orth-transform-preprocess

757e073

enp1s0 added feature request New feature or request and removed improvement Improves an existing functionality labels Nov 12, 2025

enp1s0 and others added 8 commits November 12, 2025 16:25

Merge branch 'main' into orth-transform-preprocess

7c8645d

enp1s0 requested review from a team as code owners November 16, 2025 15:40

enp1s0 requested a review from bdice November 16, 2025 15:40

enp1s0 and others added 29 commits November 19, 2025 00:02

Merge branch 'main' into orth-transform-preprocess

9aa0635

Merge branch 'release/25.12' into forward-merge-25.12

2385f48

Merge pull request rapidsai#1564 from vyasr/forward-merge-25.12

7216139

Forward merge 25.12

Merge pull request rapidsai#1567 from rapidsai/release/25.12

e2e70ae

Forward-merge release/25.12 into main

Integrate random orth transform into CAGRA-Q

b8eba1d

Merge branch 'main' into orth-transform-preprocess

454ac46

Merge pull request rapidsai#1569 from rapidsai/release/25.12

4e8e0a5

Forward-merge release/25.12 into main

Merge pull request rapidsai#1572 from rapidsai/release/25.12

923da8d

Forward-merge release/25.12 into main

Merge pull request rapidsai#1576 from rapidsai/release/25.12

0136dd0

Forward-merge release/25.12 into main

Merge pull request rapidsai#1579 from rapidsai/release/25.12

9b371e4

Forward-merge release/25.12 into main

Merge branch 'main' into orth-transform-preprocess

cf38f2a

Use CCCL's mdspan implementation (rapidsai#1605)

94795b0

Based on rapidsai/raft#2836 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Bradley Dice (https://github.com/bdice) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#1605

Merge branch 'main' into orth-transform-preprocess

26f251a

enp1s0 requested a review from a team as a code owner December 6, 2025 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

random orthogonal transformation preprocess #1525

random orthogonal transformation preprocess #1525

Uh oh!

enp1s0 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

random orthogonal transformation preprocess #1525

Are you sure you want to change the base?

random orthogonal transformation preprocess #1525

Uh oh!

Conversation

enp1s0 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants