WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
365 changes: 364 additions & 1 deletion src/bam/mod.rs
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
//! Types and utilities for working with BAM/SAM alignment data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does this need to be re-exported in src/lib.rs?

//!
//! This module provides the [`Template`] struct for grouping alignment records
//! by query name, following the pattern established by fgbio (Scala) and fgpyo (Python).
//! by query name, and [`TemplateIterator`] for iterating over templates from a
//! query-name grouped BAM file. These follow the patterns established by
//! fgbio (Scala) and fgpyo (Python).

use rust_htslib::bam::Record;
use std::iter::Peekable;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that this is part of the std lib!

use thiserror::Error;

/// Errors that can occur when building a [`Template`].
Expand Down Expand Up @@ -33,6 +36,18 @@ pub enum TemplateError {
},
}

/// Errors that can occur when iterating over templates.
#[derive(Error, Debug)]
pub enum TemplateIteratorError {
/// Error building a template from records.
#[error("Failed to build template: {0}")]
TemplateBuildError(#[from] TemplateError),

/// Error reading a BAM record.
#[error("Failed to read BAM record: {0}")]
BamReadError(#[from] rust_htslib::errors::Error),
}

Comment on lines +39 to +50
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @tfenne I am not sure as to your preference for Error specificity in Rust - is this okay, or would you prefer they be wrapped into FgError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, and fwiw, I would have done what you've done

/// A collection of alignment records for a single template (query name).
///
/// A `Template` groups all BAM records sharing the same query name, organizing them
Expand Down Expand Up @@ -229,6 +244,147 @@ impl Template {
}
}

/// An iterator that yields [`Template`] instances from a query-name grouped BAM reader.
///
/// This iterator consumes records from a BAM reader and groups them by query name,
/// yielding one `Template` per unique query name.
///
/// # Requirements
///
/// The input BAM must be **query-name sorted or grouped** (i.e., all records with the
/// same query name must be adjacent). The iterator does NOT sort records internally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Can we be more explicit about what happens with unsorted input?

/// # Requirements
///
/// The input BAM must be **query-name sorted or grouped** (i.e., all records with the
/// same query name must be adjacent). The iterator does NOT sort records internally.
///
/// **Warning:** If records are not properly grouped, templates may be split across
/// multiple `Template` instances, or worse, different templates' records may be
/// incorrectly combined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Is the updated documentation sufficient, or would you also like to add a strict flag to raise an error if R1 and R2 do not both appear? (maybe paired_strict?)

or worse, different templates' records may be incorrectly combined.

I don't think this is true, fwiw - I think the only consequence would be a Template containing a partial subset of the alignments associated with a template. I don't think there's a code path that would permit multiple alignments with different querynames being assigned to the same Template.

///
/// # Example
///
/// ```ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: shall we use no_run instead to at least compile this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this runnable too.

/// use fgoxide::bam::TemplateIterator;
/// use rust_htslib::bam::Reader;
///
/// let reader = Reader::from_path("input.bam")?;
/// for template in TemplateIterator::new(reader.records()) {
/// let template = template?;
/// println!("Template: {:?}", template.name());
/// }
/// ```
pub struct TemplateIterator<I>
where
I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't convinced that this should be an iterator over Result<Record, E> instead of an iterator over Record, because the return type of Reader.records() is Records, but claude pointed out that where Records implements Iterator, it's over Result<Record, E>.

https://docs.rs/rust-htslib/latest/rust_htslib/bam/struct.Records.html

https://github.com/rust-bio/rust-htslib/blob/8f1cdd75c301cb1e0ae54125a33de065c2550225/src/tbx/mod.rs#L328-L346

{
inner: Peekable<I>,
}

impl<I> TemplateIterator<I>
where
I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>,
{
/// Creates a new `TemplateIterator` from a record iterator.
///
/// The input iterator must yield records in query-name sorted or grouped order.
/// All records with the same query name must be adjacent in the input.
///
/// # Arguments
///
/// * `iter` - An iterator over BAM records (e.g., from `Reader::records()`)
///
/// # Example
///
/// ```ignore
/// use fgoxide::bam::TemplateIterator;
/// use rust_htslib::bam::Reader;
///
/// let reader = Reader::from_path("queryname_sorted.bam")?;
/// let templates = TemplateIterator::new(reader.records());
/// ```
pub fn new(iter: I) -> Self {
Self { inner: iter.peekable() }
}
}

impl<I> Iterator for TemplateIterator<I>
where
I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>,
{
type Item = Result<Template, TemplateIteratorError>;

fn next(&mut self) -> Option<Self::Item> {
// Peek to get the current query name
let current_name = match self.inner.peek() {
Some(Ok(rec)) => rec.qname().to_vec(),
Some(Err(_)) => {
// Consume and propagate the error
let err = self.inner.next().unwrap().unwrap_err();
return Some(Err(TemplateIteratorError::BamReadError(err)));
}
None => return None,
};

// Collect all records with the same query name
let mut recs = Vec::new();
loop {
match self.inner.peek() {
Some(Ok(rec)) if rec.qname() == current_name.as_slice() => {
// Safe to unwrap: we just peeked and confirmed it's Ok
recs.push(self.inner.next().unwrap().unwrap());
}
Some(Ok(_)) => {
// Different query name, stop collecting
break;
}
Some(Err(_)) => {
// Error encountered - but we have records collected already
// Build template from what we have, error will be returned on next call
break;
}
None => {
// End of iterator
break;
}
}
}

// Build the template from collected records
Some(Template::build(recs).map_err(TemplateIteratorError::TemplateBuildError))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:
We could add size_hint() as it helps collect() pre-allocate capacity (eg. https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.size_hint)

size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.

Suggested change
}
}
fn size_hint(&self) -> (usize, Option<usize>) {
// Lower bound is 0 (could be all same qname)
// Upper bound is inner's upper bound (each record could be its own template)
let (_, upper) = self.inner.size_hint();
(0, upper)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggested I do this as well, but I didn't know enough to determine if it was worthwhile or common practice!

Is this a good habit to be getting in? cc @theJasonFan

}

/// Extension trait for creating a [`TemplateIterator`] from BAM record iterators.
///
/// This trait provides an ergonomic way to convert a BAM record iterator into
/// a template iterator using method chaining.
///
/// # Example
///
/// ```ignore
/// use fgoxide::bam::IntoTemplateIterator;
/// use rust_htslib::bam::Reader;
///
/// let reader = Reader::from_path("queryname_sorted.bam")?;
/// for template in reader.records().templates() {
/// let template = template?;
/// // process template...
/// }
/// ```
Comment on lines +339 to +355
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cute and I would not have discovered it on my own for a while. Thanks Claude!

pub trait IntoTemplateIterator: Sized {
/// The type of the inner iterator.
type InnerIter: Iterator<Item = Result<Record, rust_htslib::errors::Error>>;

/// Converts this iterator into a [`TemplateIterator`].
///
/// The input must be query-name sorted or grouped.
fn templates(self) -> TemplateIterator<Self::InnerIter>;
}

impl<I> IntoTemplateIterator for I
where
I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>,
{
type InnerIter = I;

fn templates(self) -> TemplateIterator<Self::InnerIter> {
TemplateIterator::new(self)
}
}

#[cfg(test)]
mod tests {
use super::*;
Expand Down Expand Up @@ -607,4 +763,211 @@ mod tests {
assert!(recs[5].is_supplementary());
}
}

mod template_iterator_tests {
use super::*;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better if we had one test that read from an actual BAM file. @nh13 @tfenne I know the general preference is to synthesize test data on the fly, but would you be open to having one test case to read a small BAM? Maybe two read pairs? Otherwise we're not directly testing that TemplateIterator can wrap rust-htslib's Reader::from_path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cents is that a small SAM is good enough for now, as long as we log an issue that we should replace this with a Sam Builder later.


/// Helper to wrap records in Ok() results for the iterator
fn records_to_results(
recs: Vec<Record>,
) -> Vec<Result<Record, rust_htslib::errors::Error>> {
recs.into_iter().map(Ok).collect()
}

#[test]
fn test_single_template_paired_end() {
let r1 = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r2 = make_record(b"read1", PAIRED | LAST_IN_PAIR);

let results = records_to_results(vec![r1, r2]);
let mut iter = TemplateIterator::new(results.into_iter());

let template = iter.next().unwrap().unwrap();
assert_eq!(template.name(), Some(b"read1".as_slice()));
assert!(template.r1.is_some());
assert!(template.r2.is_some());

assert!(iter.next().is_none());
}

#[test]
fn test_multiple_templates() {
let r1a = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r2a = make_record(b"read1", PAIRED | LAST_IN_PAIR);
let r1b = make_record(b"read2", PAIRED | FIRST_IN_PAIR);
let r2b = make_record(b"read2", PAIRED | LAST_IN_PAIR);
let r1c = make_record(b"read3", PAIRED | FIRST_IN_PAIR);
let r2c = make_record(b"read3", PAIRED | LAST_IN_PAIR);

let results = records_to_results(vec![r1a, r2a, r1b, r2b, r1c, r2c]);
let templates: Vec<_> =
TemplateIterator::new(results.into_iter()).map(|r| r.unwrap()).collect();

assert_eq!(templates.len(), 3);
assert_eq!(templates[0].name(), Some(b"read1".as_slice()));
assert_eq!(templates[1].name(), Some(b"read2".as_slice()));
assert_eq!(templates[2].name(), Some(b"read3".as_slice()));
}

#[test]
fn test_single_end_reads() {
let r1 = make_record(b"frag1", 0);
let r2 = make_record(b"frag2", 0);

let results = records_to_results(vec![r1, r2]);
let templates: Vec<_> =
TemplateIterator::new(results.into_iter()).map(|r| r.unwrap()).collect();

assert_eq!(templates.len(), 2);
assert!(templates[0].r1.is_some());
assert!(templates[0].r2.is_none());
assert!(templates[1].r1.is_some());
assert!(templates[1].r2.is_none());
}

#[test]
fn test_template_with_secondaries_and_supplementaries() {
let r1_primary = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r2_primary = make_record(b"read1", PAIRED | LAST_IN_PAIR);
let r1_sec = make_record(b"read1", PAIRED | FIRST_IN_PAIR | SECONDARY);
let r1_supp = make_record(b"read1", PAIRED | FIRST_IN_PAIR | SUPPLEMENTARY);

let results = records_to_results(vec![r1_primary, r2_primary, r1_sec, r1_supp]);
let mut iter = TemplateIterator::new(results.into_iter());

let template = iter.next().unwrap().unwrap();
assert!(template.r1.is_some());
assert!(template.r2.is_some());
assert_eq!(template.r1_secondaries.len(), 1);
assert_eq!(template.r1_supplementaries.len(), 1);

assert!(iter.next().is_none());
}

#[test]
fn test_empty_iterator() {
let results: Vec<Result<Record, rust_htslib::errors::Error>> = vec![];
let mut iter = TemplateIterator::new(results.into_iter());

assert!(iter.next().is_none());
}

#[test]
fn test_single_record() {
let r1 = make_record(b"read1", PAIRED | FIRST_IN_PAIR);

let results = records_to_results(vec![r1]);
let mut iter = TemplateIterator::new(results.into_iter());

let template = iter.next().unwrap().unwrap();
assert!(template.r1.is_some());
assert!(template.r2.is_none());

assert!(iter.next().is_none());
}

#[test]
fn test_extension_trait() {
let r1 = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r2 = make_record(b"read1", PAIRED | LAST_IN_PAIR);

let results = records_to_results(vec![r1, r2]);
// Use the extension trait
let templates: Vec<_> = results.into_iter().templates().map(|r| r.unwrap()).collect();

assert_eq!(templates.len(), 1);
assert!(templates[0].r1.is_some());
assert!(templates[0].r2.is_some());
}

#[test]
fn test_mixed_templates_varying_sizes() {
// Template 1: paired with secondaries
let t1_r1 = make_record(b"template1", PAIRED | FIRST_IN_PAIR);
let t1_r2 = make_record(b"template1", PAIRED | LAST_IN_PAIR);
let t1_sec = make_record(b"template1", PAIRED | FIRST_IN_PAIR | SECONDARY);

// Template 2: single-end
let t2_r1 = make_record(b"template2", 0);

// Template 3: paired only
let t3_r1 = make_record(b"template3", PAIRED | FIRST_IN_PAIR);
let t3_r2 = make_record(b"template3", PAIRED | LAST_IN_PAIR);

let results = records_to_results(vec![t1_r1, t1_r2, t1_sec, t2_r1, t3_r1, t3_r2]);
let templates: Vec<_> =
TemplateIterator::new(results.into_iter()).map(|r| r.unwrap()).collect();

assert_eq!(templates.len(), 3);

// Template 1
assert_eq!(templates[0].name(), Some(b"template1".as_slice()));
assert!(templates[0].r1.is_some());
assert!(templates[0].r2.is_some());
assert_eq!(templates[0].r1_secondaries.len(), 1);

// Template 2
assert_eq!(templates[1].name(), Some(b"template2".as_slice()));
assert!(templates[1].r1.is_some());
assert!(templates[1].r2.is_none());

// Template 3
assert_eq!(templates[2].name(), Some(b"template3".as_slice()));
assert!(templates[2].r1.is_some());
assert!(templates[2].r2.is_some());
}

#[test]
fn test_template_build_error_propagates() {
// Two primary R1s should cause a TemplateBuildError
let r1a = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r1b = make_record(b"read1", PAIRED | FIRST_IN_PAIR);

let results = records_to_results(vec![r1a, r1b]);
let mut iter = TemplateIterator::new(results.into_iter());

let result = iter.next().unwrap();
assert!(result.is_err());
assert!(matches!(
result.unwrap_err(),
TemplateIteratorError::TemplateBuildError(TemplateError::MultiplePrimaryR1 { .. })
));
}

#[test]
fn test_iterator_continues_after_valid_templates() {
// First template is valid, second has records too
let r1a = make_record(b"read1", PAIRED | FIRST_IN_PAIR);
let r2a = make_record(b"read1", PAIRED | LAST_IN_PAIR);
let r1b = make_record(b"read2", PAIRED | FIRST_IN_PAIR);

let results = records_to_results(vec![r1a, r2a, r1b]);
let mut iter = TemplateIterator::new(results.into_iter());

// First template
let t1 = iter.next().unwrap().unwrap();
assert_eq!(t1.name(), Some(b"read1".as_slice()));

// Second template
let t2 = iter.next().unwrap().unwrap();
assert_eq!(t2.name(), Some(b"read2".as_slice()));

// No more
assert!(iter.next().is_none());
}

#[test]
fn test_only_r2_template() {
// A template with only R2 (missing R1)
let r2 = make_record(b"read1", PAIRED | LAST_IN_PAIR);

let results = records_to_results(vec![r2]);
let mut iter = TemplateIterator::new(results.into_iter());

let template = iter.next().unwrap().unwrap();
assert!(template.r1.is_none());
assert!(template.r2.is_some());
assert_eq!(template.name(), Some(b"read1".as_slice()));
}
}
}
Loading