-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: speed up random access by file name by changing how we index into vnodes #490
base: main
Are you sure you want to change the base?
Conversation
…des. NOTE: Will require a migration, which is not implemented yet
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 43 files out of 135 files are above the max files limit of 75. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the WalkthroughThis pull request introduces several enhancements across the Oxen CLI and core library, focusing on improving command-line functionality, progress tracking, and node management. The changes span multiple files, adding new arguments to CLI commands, implementing progress tracking for push operations, and refining node retrieval and logging mechanisms. The modifications aim to provide more flexible and informative interactions with the Oxen version control system. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Command
participant Repo as Local Repository
participant Remote as Remote Repository
participant Progress as Progress Tracker
CLI->>Repo: Prepare push operation
Repo->>Progress: Initialize progress tracking
Repo->>Remote: Create nodes
Remote-->>Progress: Update progress
Progress->>CLI: Report progress status
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
src/lib/src/api/client/tree.rs (1)
Line range hint
665-674
: Switching from node hash to path-based hashingReplacing
// let bucket = child.node.hash.to_u128() % num_vnodes;
with
let bucket = hasher::hash_buffer_128bit( child.node.maybe_path() .unwrap() .to_str() .unwrap() .as_bytes(), ) % num_vnodes;may yield a more uniform distribution across vnodes and faster random access by file name.
However, note the double
unwrap()
:
- If
maybe_path()
returnsNone
or invalid UTF-8, this code will panic.Consider handling invalid paths gracefully. For instance:
- let bucket = hasher::hash_buffer_128bit( - child.node.maybe_path().unwrap().to_str().unwrap().as_bytes(), - ) % num_vnodes; + if let Some(path) = child.node.maybe_path() { + if let Some(path_str) = path.to_str() { + let bucket = hasher::hash_buffer_128bit(path_str.as_bytes()) % num_vnodes; + vnode_children[bucket as usize].entries.push(child.clone()); + } else { + return Err(OxenError::basic_str("Invalid UTF-8 in path")); + } + } else { + return Err(OxenError::basic_str("Path not found in node")); + }
🧹 Nitpick comments (21)
src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs (5)
269-269
: Potential verbosityA debug statement at line 269 clarifies the depth read completion. This is beneficial for tracing, but consider adjusting log levels if runtime logs risk becoming too large.
554-555
: Unused debug statementLine 554 is a commented-out debug statement. If it’s no longer needed, you can remove it to keep the codebase clean. Otherwise, re-enable it selectively if you plan to use this for troubleshooting.
581-583
: Hash-based indexingThe approach of hashing the path and taking the remainder modulo
num_vnodes
is a smart, uniform distribution strategy. Ensure collisions are acceptable or accounted for, especially in large repos.
584-585
: Commented debugAs with other commented-out debug statements, consider cleaning up if it’s not intended for future debugging.
624-629
: New debug message inread_children_until_depth
Logging the number of children is often helpful. Ensure the log statements remain at a debug level if many directories exist, to avoid excessive logs in production.
src/cli/src/cmd/node.rs (4)
32-38
: Introducing--node
argument
- The short name
-n
withlong("node")
is intuitive.- Consider clarifying if it’s mutually exclusive with
--file
.
39-45
: Introducing--file
argument
- The short name
-f
is a standard choice for specifying a file.- If it’s intended to be exclusive with
--node
, consider adding a conflict rule in Clap arguments.
51-51
: Whitespace lineLine 51 is purely whitespace. Keep it or remove it to maintain style consistency across the code.
60-61
: Ensuring node hash is providedLine 61 calls
expect("Must supply node")
. This is a direct approach. If--node
is optional, consider a more descriptive error message or Clap-level check.src/lib/src/core/v0_19_0/push.rs (1)
203-206
: Updating progress message during node collectionThis real-time feedback assists in long-running operations. Confirm that the message does not spam logs if very large sets of nodes exist.
src/lib/src/core/v0_19_0/branches.rs (1)
424-427
: Enhance context or unify style within error messages.Great job adding details about the file name, source path, and destination. For completeness, consider including even more context (e.g., repo path or commit ID) or ensuring the style of this message remains consistent with similar error messages in the codebase.
src/lib/src/core/v0_19_0/entries.rs (8)
324-329
: Remove or clarify commented-out logs.Commented-out code can cause confusion for future maintainers. Consider removing it or adding a note explaining why it needs to remain commented out.
333-337
: Remove or clarify commented-out logs.Same feedback applies here. If these debug statements are no longer relevant, removing them can make the code cleaner.
349-354
: Remove or clarify commented-out logs.Consider removing them unless they are kept for potential troubleshooting. Documenting their purpose would help future readers.
356-359
: Remove or clarify commented-out logs.Similar to the above. Making a final decision on whether to keep or remove them will streamline maintenance.
367-367
: Remove or clarify commented-out logs.This single commented-out line could be removed if it’s no longer necessary.
381-387
: Remove or clarify commented-out logs.If there is no plan to reuse these log lines, removing them can reduce clutter.
390-394
: Remove or clarify commented-out logs.Same recommendation here. A well-maintained codebase avoids leaving behind commented code unless it’s needed.
397-401
: Remove or clarify commented-out logs.Repeated recommendation: either remove them or add documentation as to why they’re commented out.
src/lib/src/core/v0_19_0/fetch.rs (1)
370-370
: Adding debug log for missing entries fetch looks good.This is a useful log for tracking the commit state. Consider including additional details (e.g., the repo name or size of the commit) to make debugging easier.
src/lib/src/api/client/tree.rs (1)
76-82
: Progress message enrichedUpdating the progress message with the current node index and total children is helpful for user-facing feedback. As a minor enhancement, consider adding timing or throughput details if needed for large sets of nodes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
src/cli/src/cmd/delete_remote.rs
(2 hunks)src/cli/src/cmd/node.rs
(2 hunks)src/lib/src/api/client/tree.rs
(3 hunks)src/lib/src/core/refs/ref_reader.rs
(1 hunks)src/lib/src/core/v0_19_0/branches.rs
(1 hunks)src/lib/src/core/v0_19_0/commits.rs
(1 hunks)src/lib/src/core/v0_19_0/entries.rs
(4 hunks)src/lib/src/core/v0_19_0/fetch.rs
(2 hunks)src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs
(6 hunks)src/lib/src/core/v0_19_0/index/commit_writer.rs
(3 hunks)src/lib/src/core/v0_19_0/push.rs
(2 hunks)src/lib/src/model/merkle_tree/node/vnode.rs
(2 hunks)src/lib/src/repositories/tree.rs
(1 hunks)src/server/src/controllers/file.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- src/server/src/controllers/file.rs
- src/lib/src/core/refs/ref_reader.rs
- src/lib/src/core/v0_19_0/commits.rs
🔇 Additional comments (33)
src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs (13)
20-20
: New import usage confirmed
The use crate::util::{self, hasher};
import is necessary for the hashing functionality introduced below. This appears valid and consistent with the new code references.
300-304
: Debugging internal state
These debug statements track how many directory hashes are loaded. They look useful for troubleshooting. Continually confirm that sensitive or large-volume data is not exposed in logs.
541-546
: Additional debug clarity for file reading
Logging the path, parent_path, and filename helps diagnose file loading issues. This is beneficial and seems safe as no user credentials or sensitive data appear exposed.
556-558
: Reading directory node at depth 0
Using CommitMerkleTree::read_depth(repo, &node_hash, 0)
efficiently retrieves the directory node without children, aligning with the new strategy of loading child vnodes on demand.
562-567
: Separation of directory node and vnodes
Storing merkle_node.children
into vnodes
clarifies the subsequent logic for calculating child sizes. This step is straightforward and correct.
568-580
: Bucket calculation logic
- Summing
vnode.vnode().unwrap().num_entries.unwrap()
to gettotal_children
is correct, givennum_entries
presumably stores each vnode’s total child count. - Dividing
total_children
byrepo.vnode_size()
and rounding up to getnum_vnodes
is a clear approach for partitioning into vnodes.
Make sure the unwrap calls do not panic in unexpected scenarios (e.g., missingnum_entries
values).
586-588
: Accessing the target vnode
vnodes[bucket as usize]
picks the vnode that should contain the file. This logic is consistent with the hashing approach. Confirm that bucket
never exceeds vnodes.len() - 1
.
589-591
: Depth loading for the vnode
CommitMerkleTree::read_depth(repo, &vnode_without_children.hash, 0)
ensures minimal data loading, aligning with the on-demand approach.
592-595
: Handling missing vnode data
Early return upon None
is good for avoiding runtime errors. The debug lines, if re-enabled, will be helpful for diagnosing missing child data.
596-598
: Binary search for the file node
vnode_with_children.get_by_path(file_name)
finalizes the single-file lookup. This is an effective approach. Confirm that file names are validated or sanitized if external input is possible.
692-697
: Deep read logging
This debug statement in read_children_from_node
clarifies the node DB path, node type, and recursion state. It's helpful for diagnosing. No immediate concerns.
702-702
: Skip recursion check
Adding || !recurse
in the condition prevents load of deeper children when recurse
is false. This is an effective short-circuit, preserving the intended partial load.
708-708
: Child count debugging
log::debug!("read_children_from_node Got {} children", children.len());
effectively informs the user about fetched nodes. Keep it at debug level.
src/lib/src/model/merkle_tree/node/vnode.rs (2)
15-15
: Adding num_entries
field to VNode
The Option<u64>
design allows for dynamic or undefined entries when unavailable. If num_entries
is crucial, ensure the call sites handle None
gracefully.
30-30
: Default implementation for num_entries
Defaulting to None
is consistent with Option<u64>
. This approach helps avoid uninitialized numeric values. No immediate issues.
src/cli/src/cmd/node.rs (3)
6-6
: Import required for repositories usage
The use liboxen::repositories;
import aligns with the new calls to repositories::commits::head_commit
and repositories::entries::get_file
. Looks good.
49-49
: Inline comment for repository loading
The comment “// Find the repository” is succinct. No issues found here.
52-58
: Handling --file
for node retrieval
- Fetching the head commit and retrieving the node associated with the file is straightforward.
- The direct print of
node
provides an immediate user-facing output. - Ensure that
file
is a valid path if user input is untrusted.
src/cli/src/cmd/delete_remote.rs (3)
37-42
: New scheme
argument
- Accepting a user-provided scheme (HTTP/HTTPS) is flexible.
- Validate the scheme if it must be restricted (e.g., only https).
64-68
: Default scheme to “https”
Defaulting to a secure scheme is a strong choice. If plaintext connections are permissible, consider prompting a warning for http
.
70-71
: Constructing remote URL with scheme
Including the scheme directly in the format string is correct. Be cautious if user input might inject unexpected strings.
src/lib/src/core/v0_19_0/push.rs (1)
237-237
: Passing progress
to create_nodes
This integration solidifies the unified progress tracking approach. Ensure that create_nodes
consistently updates progress
for user clarity.
src/lib/src/core/v0_19_0/entries.rs (1)
52-58
: Skip retrieval if path is a directory.
This logic properly checks if the path is a directory before attempting file-based operations. It helps prevent incorrect lookups and potential errors in retrieving file nodes.
Also applies to: 60-60
src/lib/src/core/v0_19_0/fetch.rs (1)
396-401
: Helpful debug info on recursion steps.
Logging the number of child nodes, their hash, and node details is very valuable for debugging. The addition is well-structured.
src/lib/src/repositories/tree.rs (1)
174-174
: Verify depth-based read logic.
Switching to read_depth(repo, hash, 1)
can miss deeper children if the node has multiple nested levels. Ensure that one-level depth reading suffices for your use case or consider verifying that deeper levels are handled elsewhere.
src/lib/src/api/client/tree.rs (6)
8-8
: Dependency on Arc
introduced
Looks good. This addition allows for reference-counted concurrency in handling the newly integrated progress structures.
15-15
: New import for tracking push progress
Importing PushProgress
is appropriate for monitoring operations during node creation.
62-63
: Children count instrumentation
Initializing children_count
and using enumerate()
for indexing helps keep track of node progress. This aligns with the newly introduced progress
parameter.
84-84
: Incrementing total children
children_count += node.children.len();
properly accumulates the total child count for accurate progress reporting.
Line range hint 821-821
: Storing number of entries
Adding num_entries: Some(vnode.entries.len() as u64)
to the VNode
struct provides more metadata for indexing or debugging. Ensure downstream consumers correctly handle this new field.
57-57
: Signature change to accept progress
Adding progress: &Arc<PushProgress>
ensures real-time status updates. Verify that all call-sites to create_nodes
are updated to pass a valid progress tracker.
✅ Verification successful
Let me gather more context about the create_nodes
function and its parameters.
Let me try a different approach to get the function signatures.
Let me check the actual call site to verify the parameter is being passed correctly.
All call sites are properly updated with the progress parameter
Based on the verification results:
- The
create_nodes
function insrc/lib/src/api/client/tree.rs
has been updated to accept the newprogress: &Arc<PushProgress>
parameter - The only call site in
src/lib/src/core/v0_19_0/push.rs
correctly passes the progress parameter - The
create_nodes
function insrc/server/src/controllers/tree.rs
is a different function (HTTP endpoint controller) and is unrelated to this change
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify that all create_nodes function calls now include the progress parameter.
rg -A 2 'create_nodes\s*\('
Length of output: 657
Script:
#!/bin/bash
# Get the full function signature and implementation of create_nodes
ast-grep --pattern 'pub async fn create_nodes($$$)'
Length of output: 53
Script:
#!/bin/bash
# Get more context around create_nodes function definitions
rg -B 2 -A 5 'pub async fn create_nodes'
Length of output: 1047
Script:
#!/bin/bash
# Get more context around the call site in push.rs
rg -B 5 -A 5 'create_nodes\(.*progress' src/lib/src/core/v0_19_0/push.rs
Length of output: 639
src/lib/src/core/v0_19_0/index/commit_writer.rs (2)
35-35
: Importing hasher
from util
Bringing in crate::util::hasher
is aligned with the updated vnode bucket calculation.
821-821
: Added num_entries
in VNode
Populating the num_entries
field captures subtree size within each vnode. This can facilitate improved logging, caching, or balancing.
// let bucket = child.node.hash.to_u128() % num_vnodes; | ||
let bucket = hasher::hash_buffer_128bit( | ||
child | ||
.node | ||
.maybe_path() | ||
.unwrap() | ||
.to_str() | ||
.unwrap() | ||
.as_bytes(), | ||
) % num_vnodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Utilizing path-based bucket hashing
Same change as in tree.rs
, transitioning from node.hash
to hash_buffer_128bit
over the path, which can produce a better distribution for large numbers of files.
Again, watch out for potential panics from unwrap()
on invalid UTF-8.
NOTE: Will require a migration, which is not implemented yet
Summary by CodeRabbit
Release Notes
New Features
--scheme
for specifying URL scheme in remote deletion.--node
and--file
flags for enhanced node inspection.Bug Fixes
Documentation
Refactor
Style