-
Notifications
You must be signed in to change notification settings - Fork 761
ENH: Change get_hbond_map to use KDTree for distance search
#5182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5182 +/- ##
===========================================
- Coverage 92.72% 92.72% -0.01%
===========================================
Files 180 180
Lines 22472 22470 -2
Branches 3188 3189 +1
===========================================
- Hits 20838 20835 -3
- Misses 1176 1177 +1
Partials 458 458 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| hmap = np.tile(h_1, (1, 1, n)).reshape(n, n, 3) | ||
| # We can use KDTree to find candidate pairs within cutoff distance without having to | ||
| # compute a full pairwise distance matrix, much faster especially for larger proteins | ||
| n_kdtree = KDTree(n_atoms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the original code was vendored from somewhere else, so it explains the heavy reliance on linalg.norm and the lack of PBC awareness - however, now that we're making changes, should we be using one of our lib.distances calls (maybe capped_distance) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at all experienced with lib.distances are you talking about for the KDTree / potential pair generation or for the actual distance computation? If just for computing distances then on the KDTree generation we would already be discarding some potential across-boundary bonds.
We should be able to provide simple periodic checking with KDTree you can specify a boxsize parameter. But this wouldn't take into account the more complex boundaries we might come across?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I just dug in and found capped_distance which should do exactly what we are after with better periodic boundary checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance isn't as good but also as expected with more checks. Still a large improvement over current.
============================================================
Scaling benchmark:
Residues Time (ms) Time/Res (μs)
------------------------------------------------------------
50 0.142 2.835
100 0.291 2.906
150 0.489 3.257
200 0.736 3.682
250 1.056 4.226
300 1.629 5.429
============================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've currently updated the pair search to use capped_distances - but the actual distance computation still isn't taking into account periodic boundaries. Is there a periodic boundary aware distance calculation that doesn't do "everything vs everything" which distance_array seems to do? Ideally we would avoid repeatedly calling capped_distance as we have already constructed the KDTree once and can just use those results for computation later on, but still taking into account periodic boundary crossing.
|
@BradyAJohnston wow, that's super impressive results, thanks! Some comments:
Just to make sure I understand this correctly: the current implementation is plain wrong when the boundary condition is crossed, since it just computes the wrong distance, right? I mean this part: # compute distances and energy as previously but only for the potential pairs
# still returning the same energy matrix that would have otherwise been made
o_indices = pairs[:, 1]
n_indices = pairs[:, 0]
d_on = np.linalg.norm(o_atoms[o_indices] - n_atoms[n_indices], axis=-1)
d_ch = np.linalg.norm(c_atoms[o_indices] - h_1[n_indices], axis=-1)
d_oh = np.linalg.norm(o_atoms[o_indices] - h_1[n_indices], axis=-1)
d_cn = np.linalg.norm(c_atoms[o_indices] - n_atoms[n_indices], axis=-1)so we're looking for periodic contition aware d_on = np.linalg.norm(capped_distance(o_atoms[o_indices], n_atoms[n_indices]))
...When defaulting to |
|
Hi @marinegor yes if We could just have multiple If there was just a In the current setup while we do detect periodic bonds to check - the results end up being wrong like you say. |
perhaps there are some functions in distopia (namely,
do you see this being a separate issue, perhaps if you don't find a quick solution to use among |
@BradyAJohnston have a look at |
Please don't try to use distopia directly, it's meant to be an (optional for now) backend replacement. |
|
Thanks for the suggestion @IAlibay - this would then just become as such: d_on = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)
d_ch = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
d_oh = np.linalg.norm(minimize_vectors(o_atoms[o_indices] - h_1[n_indices], box=box), axis=-1)
d_cn = np.linalg.norm(minimize_vectors(c_atoms[o_indices] - n_atoms[n_indices], box=box), axis=-1)The usage of |
I guess so, yeah -- it's just a norm calculation, there should not be any problem with that as long as the vector itself is boundary-corrected. As for cryoEM default, the placeholder unit cell is UPD: added this remark to changes request. |
| dssp = assign( | ||
| coords, | ||
| donor_mask=self._donor_mask, | ||
| box=self._trajectory.ts.dimensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dimensions are (1, 1, 1, 90, 90, 90) (cryoEM placeholder), just set to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be done here, but rather when loading the PDB - there is an a bad edge case where you might actually just have that unit box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we do this already: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/coordinates/PDB.py#L228-L231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, I wouldn't be completely against this check - maybe a warning would be ok too if we don't want to be too strict
Yep, don't trust me fully (I'm sleep deprived), but that sounds reasonable. |
|
One thing, it probably shouldn't happen in `_single_frame` but rather in `__init__`, otherwise i'm ok with a warning.
|
Changes made in this Pull Request:
This updates the
get_hbond_mapfunction to useMDAnalysis.lib.distances.capped_distanceto speed up distance calculations. Nothing changes in the API, but rather than computing a pairwise distance matrix of every residue against every other residue, we can first construct a KDTree for both theNandOatoms and use that for finding relevant pairs within a certain cutoff distance. The returned value fromget_hbond_mapremains the same but we have avoided having to compute distances and energies between residues that would never form an HBond to begin with.In my benchmarks running just the
get_hbond_mapfunction this produces a speedup from 2-5x, with that difference continuing to increase with size of the protein system:developThis PR
PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5182.org.readthedocs.build/en/5182/