WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@BradyAJohnston
Copy link
Member

@BradyAJohnston BradyAJohnston commented Dec 22, 2025

Changes made in this Pull Request:

This updates the get_hbond_map function to use MDAnalysis.lib.distances.capped_distance to speed up distance calculations. Nothing changes in the API, but rather than computing a pairwise distance matrix of every residue against every other residue, we can first construct a KDTree for both the N and O atoms and use that for finding relevant pairs within a certain cutoff distance. The returned value from get_hbond_map remains the same but we have avoided having to compute distances and energies between residues that would never form an HBond to begin with.

In my benchmarks running just the get_hbond_map function this produces a speedup from 2-5x, with that difference continuing to increase with size of the protein system:

develop

Scaling benchmark:
Residues     Time (ms)       Time/Res (μs)
------------------------------------------------------------
50           0.175           3.505
100          0.559           5.593
150          1.201           8.007
200          2.537           12.686
250          3.966           15.864
300          5.882           19.608
============================================================

This PR

Scaling benchmark:
Residues     Time (ms)       Time/Res (μs)
------------------------------------------------------------
50           0.077           1.541
100          0.164           1.638
150          0.297           1.980
200          0.490           2.450
250          0.712           2.849
300          1.161           3.870
============================================================

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?
  • Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5182.org.readthedocs.build/en/5182/

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.72%. Comparing base (9fb8b0a) to head (bdaadab).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5182      +/-   ##
===========================================
- Coverage    92.72%   92.72%   -0.01%     
===========================================
  Files          180      180              
  Lines        22472    22470       -2     
  Branches      3188     3189       +1     
===========================================
- Hits         20838    20835       -3     
- Misses        1176     1177       +1     
  Partials       458      458              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hmap = np.tile(h_1, (1, 1, n)).reshape(n, n, 3)
# We can use KDTree to find candidate pairs within cutoff distance without having to
# compute a full pairwise distance matrix, much faster especially for larger proteins
n_kdtree = KDTree(n_atoms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the original code was vendored from somewhere else, so it explains the heavy reliance on linalg.norm and the lack of PBC awareness - however, now that we're making changes, should we be using one of our lib.distances calls (maybe capped_distance) instead?

Copy link
Member Author

@BradyAJohnston BradyAJohnston Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not at all experienced with lib.distances are you talking about for the KDTree / potential pair generation or for the actual distance computation? If just for computing distances then on the KDTree generation we would already be discarding some potential across-boundary bonds.

We should be able to provide simple periodic checking with KDTree you can specify a boxsize parameter. But this wouldn't take into account the more complex boundaries we might come across?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I just dug in and found capped_distance which should do exactly what we are after with better periodic boundary checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance isn't as good but also as expected with more checks. Still a large improvement over current.

============================================================
Scaling benchmark:
Residues     Time (ms)       Time/Res (μs)
------------------------------------------------------------
50           0.142           2.835
100          0.291           2.906
150          0.489           3.257
200          0.736           3.682
250          1.056           4.226
300          1.629           5.429
============================================================

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've currently updated the pair search to use capped_distances - but the actual distance computation still isn't taking into account periodic boundaries. Is there a periodic boundary aware distance calculation that doesn't do "everything vs everything" which distance_array seems to do? Ideally we would avoid repeatedly calling capped_distance as we have already constructed the KDTree once and can just use those results for computation later on, but still taking into account periodic boundary crossing.

@marinegor
Copy link
Contributor

@BradyAJohnston wow, that's super impressive results, thanks! Some comments:

but the actual distance computation still isn't taking into account periodic boundaries. Is there a periodic boundary aware distance calculation that doesn't do "everything vs everything" which distance_array seems to do?

Just to make sure I understand this correctly: the current implementation is plain wrong when the boundary condition is crossed, since it just computes the wrong distance, right? I mean this part:

    # compute distances and energy as previously but only for the potential pairs
    # still returning the same energy matrix that would have otherwise been made
    o_indices = pairs[:, 1]
    n_indices = pairs[:, 0]
    d_on = np.linalg.norm(o_atoms[o_indices] - n_atoms[n_indices], axis=-1)
    d_ch = np.linalg.norm(c_atoms[o_indices] - h_1[n_indices], axis=-1)
    d_oh = np.linalg.norm(o_atoms[o_indices] - h_1[n_indices], axis=-1)
    d_cn = np.linalg.norm(c_atoms[o_indices] - n_atoms[n_indices], axis=-1)

so we're looking for periodic contition aware np.linalg.norm replacement? Or can we just put capped_distance or smth instead of -, like

    d_on = np.linalg.norm(capped_distance(o_atoms[o_indices], n_atoms[n_indices]))
    ...

When defaulting to self._trajectory.ts.dimensions, I'd perhaps consider changing / removing it if the structure comes from RCSB -- cryoEM structures default to some hardcoded unit cell which can even be too small, so it might break things.

@BradyAJohnston
Copy link
Member Author

Hi @marinegor yes if DSSP is currently run on a protein that has secondary structure over the periodic cell then it would incorrectly not assign it.

We could just have multiple capped_distance calls which would solve the issue with periodic crossing, but each time we would be re-building a KDTree when ideally we can just re-use the pairs from the first capped_distance call which gives the relevant pairs to check already.

If there was just a periodic_distance(c_atoms[o_indices] - h_1[n_indices]) that we can call then we compute the distance and allow for periodic crossing.

In the current setup while we do detect periodic bonds to check - the results end up being wrong like you say.

@marinegor
Copy link
Contributor

If there was just a periodic_distance(c_atoms[o_indices] - h_1[n_indices]) that we can call then we compute the distance and allow for periodic crossing.

perhaps there are some functions in distopia (namely, distopia.distances_ortho that you could use directly, or am I wrong: https://www.mdanalysis.org/distopia/api)

In the current setup while we do detect periodic bonds to check - the results end up being wrong like you say.

do you see this being a separate issue, perhaps if you don't find a quick solution to use among distopia library? I feel like this PR doesn't have to be blocked by this discussion.

@IAlibay
Copy link
Member

IAlibay commented Dec 23, 2025

If there was just a periodic_distance(c_atoms[o_indices] - h_1[n_indices]) that we can call then we compute the distance and allow for periodic crossing.

@BradyAJohnston have a look at minimize_vectors https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/distances.py#L2106

@IAlibay
Copy link
Member

IAlibay commented Dec 23, 2025

If there was just a periodic_distance(c_atoms[o_indices] - h_1[n_indices]) that we can call then we compute the distance and allow for periodic crossing.

perhaps there are some functions in distopia (namely, distopia.distances_ortho that you could use directly, or am I wrong: https://www.mdanalysis.org/distopia/api)

Please don't try to use distopia directly, it's meant to be an (optional for now) backend replacement.

@BradyAJohnston
Copy link
Member Author

Thanks for the suggestion @IAlibay - this would then just become as such:

    d_on = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)
    d_ch = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
    d_oh = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
    d_cn = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)

The usage of np.linalg.norm should now be fine we are taking into account the periodic crossing?

@marinegor
Copy link
Contributor

marinegor commented Dec 23, 2025

The usage of np.linalg.norm should now be fine we are taking into account the periodic crossing?

I guess so, yeah -- it's just a norm calculation, there should not be any problem with that as long as the vector itself is boundary-corrected.

As for cryoEM default, the placeholder unit cell is (1, 1, 1, 90, 90, 90), so you can explicitly compare with that and e.g. default to no box calculations if it's set.

UPD: added this remark to changes request.

dssp = assign(
coords,
donor_mask=self._donor_mask,
box=self._trajectory.ts.dimensions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dimensions are (1, 1, 1, 90, 90, 90) (cryoEM placeholder), just set to None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be done here, but rather when loading the PDB - there is an a bad edge case where you might actually just have that unit box.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict

@IAlibay
Copy link
Member

IAlibay commented Dec 23, 2025

    d_on = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)
    d_ch = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
    d_oh = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
    d_cn = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)

The usage of np.linalg.norm should now be fine we are taking into account the periodic crossing?

Yep, don't trust me fully (I'm sleep deprived), but that sounds reasonable.

@marinegor
Copy link
Contributor

marinegor commented Dec 23, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants