-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make arrange_nodes more memory efficient #415
Conversation
2 similar comments
c3300b0
to
6e9e861
Compare
6e9e861
to
512f457
Compare
Can you give a high level on the algorithm? It seems that this changes more than just arrange_nodes, and I'm curious if the changes in the other methods are compatible. |
I feel the reason to only have partial results is pagination. (this can be generalized for any filtering method instead of
In these tests our ordering is:
The Let me know if this is clear (I fear I've become a broken record around the issue of sorting with a rank and not having parent nodes) |
512f457
to
432be60
Compare
WIP: I'm splitting this up. I also don't like the way the tests are all in 2 methods - which forces comments. |
432be60
to
e61229a
Compare
@kbrock I faced the issue when I wrote a loop implementation, maybe you could adapt it to your changes so it could be merged with your patch. thanks def self.sort_by_ancestry(nodes, &block)
nodes_index = {}
nodes.each { |node| nodes_index[node.id] = node }
root_key = -1
tree = { root_key => [] }
nodes.each do |node|
tree[node.parent_id || root_key] ||= []
tree[node.parent_id || root_key] << node.id
end
tree.keys
.select { |node_id| nodes_index[node_id].blank? && node_id != root_key }
.each { |node_id| tree[root_key] += tree.delete(node_id) }
tree[root_key].delete(root_key)
presorted_tree = tree.each do |parent_id, children_nodes|
sorted_children_nodes = children_nodes.sort do |a_id, b_id|
a_node = nodes_index[a_id]
b_node = nodes_index[b_id]
a_cestry = a_node.ancestry || '0'
b_cestry = b_node.ancestry || '0'
if block && a_cestry == b_cestry
block.call(a_node, b_node)
else
a_cestry <=> b_cestry
end
end
tree[parent_id] = sorted_children_nodes
end
__result = []
stack = [-1]
while stack.any?
current_parent_id = stack.pop
next if presorted_tree[current_parent_id].blank?
node_id = presorted_tree[current_parent_id].shift
node = nodes_index[node_id]
__result << node if node.present?
stack.push(current_parent_id) if presorted_tree[current_parent_id].present?
stack.push(node_id) if presorted_tree[node_id].present?
end
__result
end it builds a normalized tree which represents all nodes |
92a8d99
to
592350a
Compare
@kbrock I don't quite understand what the hell is even happening here - arranging a tree and then flattening it instead of just doing |
@kbrock this is an old PR and I remember I was adding some changes to my local project patch. I’ll check it and comment later. Thanks |
592350a
to
9daee98
Compare
@estepnv Thanks. I stumbled again across your OLD comment (sorry) when trying to close old PRs/Issues. We had already gotten a bunch of this PR in, so wanted to just close this and not think about it anymore. Also feel like it can be a pre-cursor to getting closer to pure database sorting. @kshnurov This has been part of the code base for over 10 years. It was really necessary back before hashes were ordered (pre ruby 1.9) and I'm looking to see what we can remove here. Something like this is necessary for all implementations of materialized path, and it is an advertised advantage in the closure_trees gem.
I am reworking this method because it was left open and it was addressing a memory issue that a few people like @estepnv were seeing. Users who use ancestry for things like threaded comment want order and pagination to work. They want sorting on partial trees returned (read: paginated results). I would like to get to a solution that sorts in the database, but it is currently not as simple as Also: {1 => {2 => {}, 3 => {}}}.flatten
# => [1, {2=>{}, 3=>{}}]
flatten_arranged_nodes({1 => {2 => {}, 3 => {}}})
# => [1, 2, 3] I also tried just returning the nodes and not throwing into/out of a tree, but as I expected, they come back in a different order. I think possibly introducing ancestry derived columns (e.g.: It would be best to not have to materializing the orders (and forcing a large database update when node orders change). Stuff like this is why I would like to find a way to better join from ancestry_ids to the associated ancestry records in pure SQL. |
All of it? I don't see how any of this legacy code could be useful when you can do native DB/Ruby sorting.
Pagination is done on top-level elements (you don't want to count nested comments), so you just do something like this:
Why not? It's as simple as that. |
Oh yeah, flatten doesn't flatten hashes. Well:
|
c999289
to
9daee98
Compare
@kshnurov Thank you for enthusiasm and expertise. As a gem maintainer, I do my best to keep a gem working in a consistent manner for all applications that use it. Changing the way sorting and pagination works is not viable. The goal was to remove recursion and not create extra arrays. That way it would alleviate the out of memory error.
I too wish the second one worked. In theory, we could just sort by ancestry and rank and be done with it. If you can find a way to get that test working, that would be great. |
I'm not suggesting to change it, I'm just telling you can do that on DB level if your sorting is based on db columns.
works perfectly,
But your method is still recursive! |
9daee98
to
b812927
Compare
I felt like the flatten arranged nodes are no longer recursive. I must have been looking at my code for too long because I just don't see the recursion. I revisited the tests. I'm questioning some of the use cases. It is hard to know if some of these use cases are due to locale messing up the ordering, ruby 1.8 messing up the ordering, or |
This is a pure recursion:
|
re recursion: (not really the problem at hand) Do you mean traversal? I agree with you that the approach to solving this problem is to traverse the children. And it needs to be done by looking at each level at a time. By recursion, I mean specifically using the call stack to store temporary sate, and calling your own function to setup new state for the next level. And in our case, using recursion creates a lot of temporary variables. So I had rewritten it to not create so many objects. But then I read the suggestion and went ahead to remove recursion all together. My solution was close to yours (I'll use yours as a base since I don't have mine handy): def flatten_arranged_nodes(h, final = [])
h.each { |k, v| final << k ; flatten_arranged_nodes(v, final) unless v.empty? }
final
end I hope we are in agreement that the above one is recursive. It is calling itself. But this all seems like an academic debate. |
Oh, so the problem is memory usage, not recursion.
Well, ruby is a black box here. IDK if |
The core problem we are trying to solve: What is the most efficient way to sort nodes using rank? (memory and time) For this conversation, lets assume that rank is in the database.
proper sort: 2, 6, 5, 1, 3, 4
Question, can we assume the caller has sorted the nodes into ancestry,rank coming in? |
And what exactly is the goal of such "sorting" (flattening in fact) and why this gem should do it after all? It sounds like some really custom "sorting". If I have e.g. nested comments - I need a tree, not a flat array, and a tree is perfectly sorted with |
@kshnurov I appreciate that your use cases allow you to paginate by the parent node, and you do not want to do sql only updates. But please respect that other people have different constraints. Also appreciate that backwards compatibility is concern. Yes, we are sorting/arranging a tree and then flattening it. sort_by_ancestry(tree_nodes.take(50)).each do |node|
puts "#{node.depth * ' '}- #{node.name}"
end
# vs
def print_tree(node_hash, count)
node_hash.each do |n, children|
puts "#{node.depth * ' '}- #{node.name}"
count -= 1
break if count < 0
count = print_tree(children, count)
break if count < 0
end
count
end
print_tree(tree_nodes.each_with_object({}) { |n,t| t.merge(n.arrange) }, 50) Even Interestingly, if ancestry stored "/ancestry/id", then |
It's just completely broken, I've already explained why. If you want to keep broken functionality in this gem, as you did with 4.2.0/materialized_path2 - I'm not gonna stop you.
I sort everything with a database, arrange into a tree, and then flatten/output with a simple recursion/partials however I want:
This is SO broken, I'm surprised you even write this. What are you sorting here, 50 random nodes? It's as bad as doing |
We are sorting records that come back from a query. That means certain assumptions can be made about the records coming back It will not contain cousins without at least one parent node in there.
- Only preforms one compare (<=>) vs two ( == and <=> ) - null ancestry is a space (ensures it sorts first when using materialized path 2) But it does not actually change the sorting algorithm It is still sorting levels by id (not optimal) And still issues when missing nodes
this is the part responsible for taking a tree and producing an array of nodes with the parent node before the children Since children is always a hash, this will never enter the sorting code and never require the block. so the sorting and the &block was dropped. This works because ruby enumerates the hashes with insert order (no need to sort again)
Since this is in its own method, we no longer have such a big call stack Also, we are no longer creating a 2 new arrays for every set of children
b812927
to
719a457
Compare
Ran benchmarks on the various implementations of I ran:
I was not able to get a good number on the non recursive one. It modifies the array so benchmark-ips was not practical. I agree that it is too complicated and I had trouble making it non modifying without allocating a bunch of arrays. My first attempt still used inject, but I think I redid it without data 19 wide 3 deep = 6859 nodes
data 2 wide 12 deep = 4095 nodes
data 6 wide 5 deep = 7776 nodes
removing the array allocation made all the difference. @estepnv This is still recursive, but for any reasonable depth, it takes up 96% less memory. Deeper will have even bigger savings. Let me know if you think this will work for you |
Let me reiterate: Everyone has a reason for how they act. Sql only updatesRails provides callback friendly and database only approaches: class Model
has_many :orders, :dependent => :delete # DB
has_many :orders, :dependent => :destroy # Ruby
end
Model.delete # DB
Model.destroy # Ruby Some problems can be done in pure sql while others require ruby. Ancestry provides both options, as well. Materialized Path 2Even though most (all?) presentations out there on tree theory use a single format (i.e.: The code base didn't properly handle globally changing the delimiter, let alone changing the format. So it required a lot of changes ( #296 #333 #455 #457 #458 #459 #460 #472 ) If I had unlimited resources and time then maybe I could have kept the branches up to date. But you saw how hard it would be to merge #481 and that is trivial compared to #571 Thank you for #597 but that doesn't give you the right to attack me. Keep in mind that a lot has changed in both rails and ancestry since ruby 1.8 and rails 3.x. So my POC ended up with a few problems. sort_by_ancestry(tree_nodes.take(50)).each do |node|
puts "#{node.depth * ' '}- #{node.name}"
end
It would be nice if you could focus on trying to get a solution rather than saying that other people don't know what they are doing. sort_by_ancestry(tree_nodes.limit(50).order(Arel.sql("name || '/' || id")).each do |node|
puts "#{node.depth * ' '}- #{node.name}"
end Yea, I'm starting to like that |
Resurrecting this old PR
The main work was pulled out to a few prs:
When this PR started, hashes had an indeterminant order. So it was tricky to arrange these in a hash and have a consistent output order.
When a sorting block is not provided, the current goal is to have the output match the input order as closely as possible, just with the parent nodes coming before the children nodes.
This is working well, even when partial trees are provide with missing nodes. There are a few issues when the parent nodes are necessary to determine a proper order, but this is really only necessary when a sorting block is provided. As far as I'm concerned, we should be sorting in sql (using
arranged_by_ancestry(additional, columns)
) and not using a ruby block.So the only remaining work left in this PR is fixing the code that converts from an arranged tree back to an array. This creates too many array objects. Thanks to the suggestions from @estepnv I was able to reduce the number of arrays and remove the recursive calls.
For future me:
FUTURE: This has me wondering if we want to convert most of the recursive calls to linear ones across the codebase. Using an explicit stack instead of the call stack. Would need to run a few benchmarks to determine if it actually saved us that much memory. I can't imagine that a depth of 100 would blow this out of memory if we cut down on the temporary arrays created at each level.
FUTURE: Now that we have nailed down the sort order for canonical using binary columns / C sorting, we probably want to remove the ambiguous sorting and ensure the STRICT cases work. TheDONECORRECT
cases will probably never be implemented.FUTURE: Do we want to enforce that this is called with
ordered_by_ancestry_and(fields)
and notorder(fields)
? I do like the way the sorting works for the&block
case and it would be nice to get that into sql. Maybe make the casenamed_ancestry = ancestry.map(&:name).join('/')
orranked_ancestry =
ancestry.map(&:rank).join('/')`