WIP: speed up random access by file name by changing how we index into vnodes #490

gschoeni · 2024-12-28T00:01:19Z

NOTE: Will require a migration, which is not implemented yet

Summary by CodeRabbit

Release Notes

New Features
- Added command-line argument --scheme for specifying URL scheme in remote deletion.
- Introduced --node and --file flags for enhanced node inspection.
- Implemented progress tracking in node creation and push operations.
Bug Fixes
- Improved error messaging for file restoration when source files are not found.
Documentation
- Enhanced logging for better traceability in various functions.
Refactor
- Streamlined logic for retrieving file nodes and updated argument handling in multiple commands.
Style
- Commented out unnecessary debug log statements to reduce verbosity.

…des. NOTE: Will require a migration, which is not implemented yet

coderabbitai · 2024-12-28T00:01:27Z

Important

Review skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

43 files out of 135 files are above the max files limit of 75. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces several enhancements across the Oxen CLI and core library, focusing on improving command-line functionality, progress tracking, and node management. The changes span multiple files, adding new arguments to CLI commands, implementing progress tracking for push operations, and refining node retrieval and logging mechanisms. The modifications aim to provide more flexible and informative interactions with the Oxen version control system.

Changes

File	Change Summary
`src/cli/src/cmd/delete_remote.rs`	Added `scheme` argument to specify URL scheme for remote repository deletion
`src/cli/src/cmd/node.rs`	Introduced `--node` and `--file` flags for more flexible node inspection
`src/lib/src/api/client/tree.rs`	Added progress tracking to `create_nodes` function
`src/lib/src/core/v0_19_0/branches.rs`	Enhanced error message for file restoration
`src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs`	Optimized file reading and added debug logging
`src/lib/src/core/v0_19_0/push.rs`	Added progress tracking to push commits operation
`src/lib/src/model/merkle_tree/node/vnode.rs`	Added `num_entries` field to `VNode` struct

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Command
    participant Repo as Local Repository
    participant Remote as Remote Repository
    participant Progress as Progress Tracker

    CLI->>Repo: Prepare push operation
    Repo->>Progress: Initialize progress tracking
    Repo->>Remote: Create nodes
    Remote-->>Progress: Update progress
    Progress->>CLI: Report progress status

Possibly related PRs

add rename file feature #484: Rename feature implementation, which shares similarities in command-line argument enhancements

Poem

🐰 Hopping through code with glee,
Oxen's CLI now more free!
Nodes and remotes dance with grace,
Progress tracked at lightning pace.
A rabbit's smile, the changes bright! 🚀

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

src/lib/src/api/client/tree.rs (1)

Line range hint 665-674: Switching from node hash to path-based hashing

Replacing

// let bucket = child.node.hash.to_u128() % num_vnodes;

with

let bucket = hasher::hash_buffer_128bit(
    child.node.maybe_path()
        .unwrap()
        .to_str()
        .unwrap()
        .as_bytes(),
) % num_vnodes;

may yield a more uniform distribution across vnodes and faster random access by file name.

However, note the double unwrap():

If maybe_path() returns None or invalid UTF-8, this code will panic.

Consider handling invalid paths gracefully. For instance:

- let bucket = hasher::hash_buffer_128bit(
-     child.node.maybe_path().unwrap().to_str().unwrap().as_bytes(),
- ) % num_vnodes;

+ if let Some(path) = child.node.maybe_path() {
+     if let Some(path_str) = path.to_str() {
+         let bucket = hasher::hash_buffer_128bit(path_str.as_bytes()) % num_vnodes;
+         vnode_children[bucket as usize].entries.push(child.clone());
+     } else {
+         return Err(OxenError::basic_str("Invalid UTF-8 in path"));
+     }
+ } else {
+     return Err(OxenError::basic_str("Path not found in node"));
+ }

🧹 Nitpick comments (21)

src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs (5)

269-269: Potential verbosity

A debug statement at line 269 clarifies the depth read completion. This is beneficial for tracing, but consider adjusting log levels if runtime logs risk becoming too large.

554-555: Unused debug statement

Line 554 is a commented-out debug statement. If it’s no longer needed, you can remove it to keep the codebase clean. Otherwise, re-enable it selectively if you plan to use this for troubleshooting.

581-583: Hash-based indexing

The approach of hashing the path and taking the remainder modulo num_vnodes is a smart, uniform distribution strategy. Ensure collisions are acceptable or accounted for, especially in large repos.

584-585: Commented debug

As with other commented-out debug statements, consider cleaning up if it’s not intended for future debugging.

624-629: New debug message in read_children_until_depth

Logging the number of children is often helpful. Ensure the log statements remain at a debug level if many directories exist, to avoid excessive logs in production.

src/cli/src/cmd/node.rs (4)

32-38: Introducing --node argument

The short name -n with long("node") is intuitive.

Consider clarifying if it’s mutually exclusive with --file.

39-45: Introducing --file argument

The short name -f is a standard choice for specifying a file.

If it’s intended to be exclusive with --node, consider adding a conflict rule in Clap arguments.

51-51: Whitespace line

Line 51 is purely whitespace. Keep it or remove it to maintain style consistency across the code.

60-61: Ensuring node hash is provided

Line 61 calls expect("Must supply node"). This is a direct approach. If --node is optional, consider a more descriptive error message or Clap-level check.

src/lib/src/core/v0_19_0/push.rs (1)

203-206: Updating progress message during node collection

This real-time feedback assists in long-running operations. Confirm that the message does not spam logs if very large sets of nodes exist.

src/lib/src/core/v0_19_0/branches.rs (1)

424-427: Enhance context or unify style within error messages.

Great job adding details about the file name, source path, and destination. For completeness, consider including even more context (e.g., repo path or commit ID) or ensuring the style of this message remains consistent with similar error messages in the codebase.

src/lib/src/core/v0_19_0/entries.rs (8)

324-329: Remove or clarify commented-out logs.

Commented-out code can cause confusion for future maintainers. Consider removing it or adding a note explaining why it needs to remain commented out.

333-337: Remove or clarify commented-out logs.

Same feedback applies here. If these debug statements are no longer relevant, removing them can make the code cleaner.

349-354: Remove or clarify commented-out logs.

Consider removing them unless they are kept for potential troubleshooting. Documenting their purpose would help future readers.

356-359: Remove or clarify commented-out logs.

Similar to the above. Making a final decision on whether to keep or remove them will streamline maintenance.

367-367: Remove or clarify commented-out logs.

This single commented-out line could be removed if it’s no longer necessary.

381-387: Remove or clarify commented-out logs.

If there is no plan to reuse these log lines, removing them can reduce clutter.

390-394: Remove or clarify commented-out logs.

Same recommendation here. A well-maintained codebase avoids leaving behind commented code unless it’s needed.

397-401: Remove or clarify commented-out logs.

Repeated recommendation: either remove them or add documentation as to why they’re commented out.

src/lib/src/core/v0_19_0/fetch.rs (1)

370-370: Adding debug log for missing entries fetch looks good.

This is a useful log for tracking the commit state. Consider including additional details (e.g., the repo name or size of the commit) to make debugging easier.

src/lib/src/api/client/tree.rs (1)

76-82: Progress message enriched

Updating the progress message with the current node index and total children is helpful for user-facing feedback. As a minor enhancement, consider adding timing or throughput details if needed for large sets of nodes.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e969f02 and 8b40663.

📒 Files selected for processing (14)

src/cli/src/cmd/delete_remote.rs (2 hunks)
src/cli/src/cmd/node.rs (2 hunks)
src/lib/src/api/client/tree.rs (3 hunks)
src/lib/src/core/refs/ref_reader.rs (1 hunks)
src/lib/src/core/v0_19_0/branches.rs (1 hunks)
src/lib/src/core/v0_19_0/commits.rs (1 hunks)
src/lib/src/core/v0_19_0/entries.rs (4 hunks)
src/lib/src/core/v0_19_0/fetch.rs (2 hunks)
src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs (6 hunks)
src/lib/src/core/v0_19_0/index/commit_writer.rs (3 hunks)
src/lib/src/core/v0_19_0/push.rs (2 hunks)
src/lib/src/model/merkle_tree/node/vnode.rs (2 hunks)
src/lib/src/repositories/tree.rs (1 hunks)
src/server/src/controllers/file.rs (1 hunks)

✅ Files skipped from review due to trivial changes (3)

src/server/src/controllers/file.rs
src/lib/src/core/refs/ref_reader.rs
src/lib/src/core/v0_19_0/commits.rs

🔇 Additional comments (33)

src/lib/src/core/v0_19_0/index/commit_merkle_tree.rs (13)

20-20: New import usage confirmed

The use crate::util::{self, hasher}; import is necessary for the hashing functionality introduced below. This appears valid and consistent with the new code references.

300-304: Debugging internal state

These debug statements track how many directory hashes are loaded. They look useful for troubleshooting. Continually confirm that sensitive or large-volume data is not exposed in logs.

541-546: Additional debug clarity for file reading

Logging the path, parent_path, and filename helps diagnose file loading issues. This is beneficial and seems safe as no user credentials or sensitive data appear exposed.

556-558: Reading directory node at depth 0

Using CommitMerkleTree::read_depth(repo, &node_hash, 0) efficiently retrieves the directory node without children, aligning with the new strategy of loading child vnodes on demand.

562-567: Separation of directory node and vnodes

Storing merkle_node.children into vnodes clarifies the subsequent logic for calculating child sizes. This step is straightforward and correct.

568-580: Bucket calculation logic

Summing vnode.vnode().unwrap().num_entries.unwrap() to get total_children is correct, given num_entries presumably stores each vnode’s total child count.
Dividing total_children by repo.vnode_size() and rounding up to get num_vnodes is a clear approach for partitioning into vnodes.
Make sure the unwrap calls do not panic in unexpected scenarios (e.g., missing num_entries values).

586-588: Accessing the target vnode

vnodes[bucket as usize] picks the vnode that should contain the file. This logic is consistent with the hashing approach. Confirm that bucket never exceeds vnodes.len() - 1.

589-591: Depth loading for the vnode

CommitMerkleTree::read_depth(repo, &vnode_without_children.hash, 0) ensures minimal data loading, aligning with the on-demand approach.

592-595: Handling missing vnode data

Early return upon None is good for avoiding runtime errors. The debug lines, if re-enabled, will be helpful for diagnosing missing child data.

596-598: Binary search for the file node

vnode_with_children.get_by_path(file_name) finalizes the single-file lookup. This is an effective approach. Confirm that file names are validated or sanitized if external input is possible.

692-697: Deep read logging

This debug statement in read_children_from_node clarifies the node DB path, node type, and recursion state. It's helpful for diagnosing. No immediate concerns.

702-702: Skip recursion check

Adding || !recurse in the condition prevents load of deeper children when recurse is false. This is an effective short-circuit, preserving the intended partial load.

708-708: Child count debugging

log::debug!("read_children_from_node Got {} children", children.len()); effectively informs the user about fetched nodes. Keep it at debug level.

src/lib/src/model/merkle_tree/node/vnode.rs (2)

15-15: Adding num_entries field to VNode

The Option<u64> design allows for dynamic or undefined entries when unavailable. If num_entries is crucial, ensure the call sites handle None gracefully.

30-30: Default implementation for num_entries

Defaulting to None is consistent with Option<u64>. This approach helps avoid uninitialized numeric values. No immediate issues.

src/cli/src/cmd/node.rs (3)

6-6: Import required for repositories usage

The use liboxen::repositories; import aligns with the new calls to repositories::commits::head_commit and repositories::entries::get_file. Looks good.

49-49: Inline comment for repository loading

The comment “// Find the repository” is succinct. No issues found here.

52-58: Handling --file for node retrieval

Fetching the head commit and retrieving the node associated with the file is straightforward.
The direct print of node provides an immediate user-facing output.
Ensure that file is a valid path if user input is untrusted.

src/cli/src/cmd/delete_remote.rs (3)

37-42: New scheme argument

Accepting a user-provided scheme (HTTP/HTTPS) is flexible.
Validate the scheme if it must be restricted (e.g., only https).

64-68: Default scheme to “https”

Defaulting to a secure scheme is a strong choice. If plaintext connections are permissible, consider prompting a warning for http.

70-71: Constructing remote URL with scheme

Including the scheme directly in the format string is correct. Be cautious if user input might inject unexpected strings.

src/lib/src/core/v0_19_0/push.rs (1)

237-237: Passing progress to create_nodes

This integration solidifies the unified progress tracking approach. Ensure that create_nodes consistently updates progress for user clarity.

src/lib/src/core/v0_19_0/entries.rs (1)

52-58: Skip retrieval if path is a directory.

This logic properly checks if the path is a directory before attempting file-based operations. It helps prevent incorrect lookups and potential errors in retrieving file nodes.

Also applies to: 60-60

src/lib/src/core/v0_19_0/fetch.rs (1)

396-401: Helpful debug info on recursion steps.

Logging the number of child nodes, their hash, and node details is very valuable for debugging. The addition is well-structured.

src/lib/src/repositories/tree.rs (1)

174-174: Verify depth-based read logic.

Switching to read_depth(repo, hash, 1) can miss deeper children if the node has multiple nested levels. Ensure that one-level depth reading suffices for your use case or consider verifying that deeper levels are handled elsewhere.

src/lib/src/api/client/tree.rs (6)

8-8: Dependency on Arc introduced

Looks good. This addition allows for reference-counted concurrency in handling the newly integrated progress structures.

15-15: New import for tracking push progress

Importing PushProgress is appropriate for monitoring operations during node creation.

62-63: Children count instrumentation

Initializing children_count and using enumerate() for indexing helps keep track of node progress. This aligns with the newly introduced progress parameter.

84-84: Incrementing total children

children_count += node.children.len(); properly accumulates the total child count for accurate progress reporting.

Line range hint 821-821: Storing number of entries

Adding num_entries: Some(vnode.entries.len() as u64) to the VNode struct provides more metadata for indexing or debugging. Ensure downstream consumers correctly handle this new field.

57-57: Signature change to accept progress

Adding progress: &Arc<PushProgress> ensures real-time status updates. Verify that all call-sites to create_nodes are updated to pass a valid progress tracker.

✅ Verification successful

Let me gather more context about the create_nodes function and its parameters.

Let me try a different approach to get the function signatures.

Let me check the actual call site to verify the parameter is being passed correctly.

All call sites are properly updated with the progress parameter

Based on the verification results:

The create_nodes function in src/lib/src/api/client/tree.rs has been updated to accept the new progress: &Arc<PushProgress> parameter
The only call site in src/lib/src/core/v0_19_0/push.rs correctly passes the progress parameter
The create_nodes function in src/server/src/controllers/tree.rs is a different function (HTTP endpoint controller) and is unrelated to this change

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that all create_nodes function calls now include the progress parameter.
rg -A 2 'create_nodes\s*\('

Length of output: 657

Script:

#!/bin/bash
# Get the full function signature and implementation of create_nodes
ast-grep --pattern 'pub async fn create_nodes($$$)'

Length of output: 53

Script:

#!/bin/bash
# Get more context around create_nodes function definitions
rg -B 2 -A 5 'pub async fn create_nodes'

Length of output: 1047

Script:

#!/bin/bash
# Get more context around the call site in push.rs
rg -B 5 -A 5 'create_nodes\(.*progress' src/lib/src/core/v0_19_0/push.rs

Length of output: 639

src/lib/src/core/v0_19_0/index/commit_writer.rs (2)

35-35: Importing hasher from util

Bringing in crate::util::hasher is aligned with the updated vnode bucket calculation.

821-821: Added num_entries in VNode

Populating the num_entries field captures subtree size within each vnode. This can facilitate improved logging, caching, or balancing.

coderabbitai · 2024-12-28T00:05:07Z

src/lib/src/core/v0_19_0/index/commit_writer.rs

+            // let bucket = child.node.hash.to_u128() % num_vnodes;
+            let bucket = hasher::hash_buffer_128bit(
+                child
+                    .node
+                    .maybe_path()
+                    .unwrap()
+                    .to_str()
+                    .unwrap()
+                    .as_bytes(),
+            ) % num_vnodes;


🛠️ Refactor suggestion

Utilizing path-based bucket hashing

Same change as in tree.rs, transitioning from node.hash to hash_buffer_128bit over the path, which can produce a better distribution for large numbers of files.

Again, watch out for potential panics from unwrap() on invalid UTF-8.

speed up random access by file name by changing how we index into vno…

8b40663

…des. NOTE: Will require a migration, which is not implemented yet

coderabbitai bot reviewed Dec 28, 2024

View reviewed changes

gschoeni added 6 commits December 31, 2024 12:22

WIP: Changing to v_latest

c43b815

Merge branch 'main' into feat/speedup-random-access-by-path

696939a

remove redis dependency

b16b116

have different impl's for VNode

7ebf8a3

make some older views load with v19

183d515

unit tests passing for v0.25.0

6c729d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: speed up random access by file name by changing how we index into vnodes #490

WIP: speed up random access by file name by changing how we index into vnodes #490

gschoeni commented Dec 28, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 28, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 28, 2024

WIP: speed up random access by file name by changing how we index into vnodes #490

Are you sure you want to change the base?

WIP: speed up random access by file name by changing how we index into vnodes #490

Conversation

gschoeni commented Dec 28, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Dec 28, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 28, 2024

Choose a reason for hiding this comment

gschoeni commented Dec 28, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 28, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)