Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make arrange_nodes more memory efficient #415

Merged
merged 4 commits into from
Mar 5, 2023

Conversation

kbrock
Copy link
Collaborator

@kbrock kbrock commented Nov 1, 2018

Resurrecting this old PR

The main work was pulled out to a few prs:

When this PR started, hashes had an indeterminant order. So it was tricky to arrange these in a hash and have a consistent output order.

When a sorting block is not provided, the current goal is to have the output match the input order as closely as possible, just with the parent nodes coming before the children nodes.

This is working well, even when partial trees are provide with missing nodes. There are a few issues when the parent nodes are necessary to determine a proper order, but this is really only necessary when a sorting block is provided. As far as I'm concerned, we should be sorting in sql (using arranged_by_ancestry(additional, columns)) and not using a ruby block.

So the only remaining work left in this PR is fixing the code that converts from an arranged tree back to an array. This creates too many array objects. Thanks to the suggestions from @estepnv I was able to reduce the number of arrays and remove the recursive calls.


For future me:

FUTURE: This has me wondering if we want to convert most of the recursive calls to linear ones across the codebase. Using an explicit stack instead of the call stack. Would need to run a few benchmarks to determine if it actually saved us that much memory. I can't imagine that a depth of 100 would blow this out of memory if we cut down on the temporary arrays created at each level.

FUTURE: Now that we have nailed down the sort order for canonical using binary columns / C sorting, we probably want to remove the ambiguous sorting and ensure the STRICT cases work. The CORRECT cases will probably never be implemented. DONE

FUTURE: Do we want to enforce that this is called with ordered_by_ancestry_and(fields) and not order(fields)? I do like the way the sorting works for the &block case and it would be nice to get that into sql. Maybe make the case named_ancestry = ancestry.map(&:name).join('/') or ranked_ancestry = ancestry.map(&:rank).join('/')`

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.009%) to 96.486% when pulling c323a6f on kbrock:partial_tree_sort into cbaa20a on stefankroes:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.009%) to 96.486% when pulling c323a6f on kbrock:partial_tree_sort into cbaa20a on stefankroes:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.009%) to 96.486% when pulling c323a6f on kbrock:partial_tree_sort into cbaa20a on stefankroes:master.

@coveralls
Copy link

coveralls commented Nov 1, 2018

Coverage Status

Coverage decreased (-0.02%) to 96.438% when pulling e61229a on kbrock:partial_tree_sort into f4610ad on stefankroes:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.009%) to 96.486% when pulling c323a6f on kbrock:partial_tree_sort into cbaa20a on stefankroes:master.

@kbrock kbrock force-pushed the partial_tree_sort branch 2 times, most recently from c3300b0 to 6e9e861 Compare November 1, 2018 06:21
@kbrock kbrock force-pushed the partial_tree_sort branch from 6e9e861 to 512f457 Compare November 1, 2018 19:48
@kbrock kbrock changed the title Fixup Sorting Fixup Sorting across partial trees Nov 1, 2018
@Fryguy
Copy link
Contributor

Fryguy commented Nov 1, 2018

Can you give a high level on the algorithm? It seems that this changes more than just arrange_nodes, and I'm curious if the changes in the other methods are compatible.

@kbrock
Copy link
Collaborator Author

kbrock commented Nov 1, 2018

I feel the reason to only have partial results is pagination. (this can be generalized for any filtering method instead of limit, but choosing this just for simplification of illustration.

model.sort_by_ancestry(model.all).limit(5)} ===
  model.sort_by_ancestry(model.all.limit(5))

In these tests our ordering is:

      # - n1 (0)
      #   - n5 (0)
      #   - n3 (3)
      # - n2 (1)
      #   - n4 (0)
      #   - n6 (1)

The n1 < n2 need to be present to properly place their children (n5, n3) < (n4, n6)
The children have this order not because of any of their own information, but rather because of their parent's information.

Let me know if this is clear (I fear I've become a broken record around the issue of sorting with a rank and not having parent nodes)

@kbrock kbrock force-pushed the partial_tree_sort branch from 512f457 to 432be60 Compare November 2, 2018 00:07
@kbrock kbrock changed the title Fixup Sorting across partial trees [WIP] Fixup Sorting across partial trees Nov 2, 2018
@kbrock
Copy link
Collaborator Author

kbrock commented Nov 2, 2018

WIP: I'm splitting this up. I also don't like the way the tests are all in 2 methods - which forces comments.

@estepnv
Copy link

estepnv commented Apr 26, 2019

@kbrock I faced the issue when #sort_by_ancestry throws stack level too deep error when a too deep tree is given as an argument

I wrote a loop implementation, maybe you could adapt it to your changes so it could be merged with your patch. thanks

  def self.sort_by_ancestry(nodes, &block)
    nodes_index = {}
    nodes.each { |node| nodes_index[node.id] = node }

    root_key = -1
    tree = { root_key => [] }

    nodes.each do |node|
      tree[node.parent_id || root_key] ||= []
      tree[node.parent_id || root_key] << node.id
    end

    tree.keys
      .select { |node_id| nodes_index[node_id].blank? && node_id != root_key }
      .each { |node_id| tree[root_key] += tree.delete(node_id) }

    tree[root_key].delete(root_key)

    presorted_tree = tree.each do |parent_id, children_nodes|
      sorted_children_nodes = children_nodes.sort do |a_id, b_id|
                a_node = nodes_index[a_id]
        b_node = nodes_index[b_id]

        a_cestry = a_node.ancestry || '0'
        b_cestry = b_node.ancestry || '0'

        if block && a_cestry == b_cestry
          block.call(a_node, b_node)
        else
          a_cestry <=> b_cestry
        end
      end

      tree[parent_id] = sorted_children_nodes
    end

    __result = []

    stack = [-1]

    while stack.any?
      current_parent_id = stack.pop

      next if presorted_tree[current_parent_id].blank?

      node_id = presorted_tree[current_parent_id].shift
      node = nodes_index[node_id]
      __result << node if node.present?

      stack.push(current_parent_id) if presorted_tree[current_parent_id].present?
      stack.push(node_id) if presorted_tree[node_id].present?
    end

    __result
  end

it builds a normalized tree which represents all nodes
then it traverses it using stack populating the result array

@kbrock kbrock force-pushed the partial_tree_sort branch 2 times, most recently from 92a8d99 to 592350a Compare February 24, 2023 04:42
@kshnurov
Copy link
Contributor

@kbrock I don't quite understand what the hell is even happening here - arranging a tree and then flattening it instead of just doing .order(:ancestry) or .sort_by(&:ancestry)? Why would anyone need that?
And why do we need to invent a bicycle flatten_arranged_nodes when there's a Ruby's Hash#flatten method?

@estepnv
Copy link

estepnv commented Feb 24, 2023

@kbrock this is an old PR and I remember I was adding some changes to my local project patch. I’ll check it and comment later. Thanks

@kbrock
Copy link
Collaborator Author

kbrock commented Feb 28, 2023

@estepnv Thanks. I stumbled again across your OLD comment (sorry) when trying to close old PRs/Issues. We had already gotten a bunch of this PR in, so wanted to just close this and not think about it anymore. Also feel like it can be a pre-cursor to getting closer to pure database sorting.


@kshnurov This has been part of the code base for over 10 years. It was really necessary back before hashes were ordered (pre ruby 1.9) and I'm looking to see what we can remove here.

Something like this is necessary for all implementations of materialized path, and it is an advertised advantage in the closure_trees gem.

The sort_by_ancestry class method: TreeNode.sort_by_ancestry(array_of_nodes) can be used
to sort an array of nodes as if traversing in preorder. (Note that since materialised path
trees do not support ordering within a rank, the order of siblings is
dependant upon their original array order.)
-- README.md

I am reworking this method because it was left open and it was addressing a memory issue that a few people like @estepnv were seeing.

Users who use ancestry for things like threaded comment want order and pagination to work. They want sorting on partial trees returned (read: paginated results).

I would like to get to a solution that sorts in the database, but it is currently not as simple as order(:ancestry, :name) or arrange_by_ancestry_and(:name). The way this is setup, you need to put nodes into a tree to handle the rank among the various levels.

Also:

{1 => {2 => {}, 3 => {}}}.flatten
# => [1, {2=>{}, 3=>{}}]
flatten_arranged_nodes({1 => {2 => {}, 3 => {}}})
# => [1, 2, 3]

I also tried just returning the nodes and not throwing into/out of a tree, but as I expected, they come back in a different order.


I think possibly introducing ancestry derived columns (e.g.: ancestry.split.map(&:name).join or ancestry.split.map(&:rank).join) would give users some commonly requested features and finally nail down sorting once and for all

It would be best to not have to materializing the orders (and forcing a large database update when node orders change). Stuff like this is why I would like to find a way to better join from ancestry_ids to the associated ancestry records in pure SQL.

@kshnurov
Copy link
Contributor

kshnurov commented Feb 28, 2023

I'm looking to see what we can remove here.

All of it? I don't see how any of this legacy code could be useful when you can do native DB/Ruby sorting.

Users who use ancestry for things like threaded comment want order and pagination to work. They want sorting on partial trees returned (read: paginated results).

Pagination is done on top-level elements (you don't want to count nested comments), so you just do something like this:

comments = obj.comments.where(ancestry: Comment.ancestry_root).preload(:descendants).order(:created_at).page(1)
comments.each { |c| c.descendants.arrange }

I would like to get to a solution that sorts in the database, but it is currently not as simple as order(:ancestry, :name) or arrange_by_ancestry_and(:name).

Why not? It's as simple as that.

@kshnurov
Copy link
Contributor

Oh yeah, flatten doesn't flatten hashes. Well:

def flatten_arranged_nodes(h)
  h.flat_map { |k, v| [k]+flatten_arranged_nodes(v) }
end

@kbrock
Copy link
Collaborator Author

kbrock commented Feb 28, 2023

@kshnurov Thank you for enthusiasm and expertise.

As a gem maintainer, I do my best to keep a gem working in a consistent manner for all applications that use it. Changing the way sorting and pagination works is not viable.

The goal was to remove recursion and not create extra arrays. That way it would alleviate the out of memory error.

  def test_sort_by_ancestry_with_block_full_tree # current code
    AncestryTestDatabase.with_model :extra_columns => {:rank => :integer} do |model|
      n1, n2, n3, n4, n5, n6 = build_ranked_tree(model)
      sort = -> (a, b) { a.rank <=> b.rank }

      records = model.sort_by_ancestry(model.all.order(:id).reverse, &sort)
      assert_equal [n1, n5, n3, n2, n4, n6].map(&:id), records.map(&:id)
    end
  end

  def test_sort_by_ancestry_with_block_full_tree_ks # your suggestion
    AncestryTestDatabase.with_model :extra_columns => {:rank => :integer} do |model|
      n1, n2, n3, n4, n5, n6 = build_ranked_tree(model)

      records = model.all.ordered_by_ancestry_and(:rank)
      assert_equal [n1, n5, n3, n2, n4, n6].map(&:id), records.map(&:id)
    end
  end

I too wish the second one worked. In theory, we could just sort by ancestry and rank and be done with it.

If you can find a way to get that test working, that would be great.
None of your suggestions work. Most just throw errors.

@kshnurov
Copy link
Contributor

kshnurov commented Mar 1, 2023

Changing the way sorting and pagination works is not viable.

I'm not suggesting to change it, I'm just telling you can do that on DB level if your sorting is based on db columns.
If it's not (some really custom block) - I think whoever needs that should introduce his own implementation.

  def test_sort_by_ancestry_with_block_full_tree_sql_sort
    AncestryTestDatabase.with_model :extra_columns => {:rank => :integer} do |model|
      n1, n2, n3, n4, n5, n6 = build_ranked_tree(model)

      records = model.flatten_arranged_nodes(model.all.order(:ancestry, :rank).arrange)
      assert_equal [n1, n5, n3, n2, n4, n6].map(&:id), records.map(&:id)
    end
  end

works perfectly, flatten_arranged_nodes is the one liner I provided above.
However, I don't see how this could be used for pagination, if you paginate on model.all.order(:rank), you'll get incorrect pagination counting all nested comments.
And the sort_by_ancestry method name is quite confusing, since it doesn't just sort but also flattens.

The goal was to remove recursion

But your method is still recursive!

@kbrock kbrock force-pushed the partial_tree_sort branch from 9daee98 to b812927 Compare March 2, 2023 01:42
@kbrock
Copy link
Collaborator Author

kbrock commented Mar 2, 2023

I felt like the flatten arranged nodes are no longer recursive. I must have been looking at my code for too long because I just don't see the recursion.

I revisited the tests. I'm questioning some of the use cases.

It is hard to know if some of these use cases are due to locale messing up the ordering, ruby 1.8 messing up the ordering, or order(:ancestry, :rank) working properly on leaf nodes but not on tree nodes.

@kshnurov
Copy link
Contributor

kshnurov commented Mar 2, 2023

I just don't see the recursion.

This is a pure recursion:

cur = arranged
while cur.present?
  children = cur.first

  if children.present?
    cur = children
  end
end

@kbrock
Copy link
Collaborator Author

kbrock commented Mar 2, 2023

re recursion: (not really the problem at hand)

Do you mean traversal? I agree with you that the approach to solving this problem is to traverse the children. And it needs to be done by looking at each level at a time.

By recursion, I mean specifically using the call stack to store temporary sate, and calling your own function to setup new state for the next level. And in our case, using recursion creates a lot of temporary variables. So I had rewritten it to not create so many objects. But then I read the suggestion and went ahead to remove recursion all together.

https://en.wikipedia.org/wiki/Recursion_(computer_science)#:~:text=In%20computer%20science%2C%20recursion%20is,from%20within%20their%20own%20code.

My solution was close to yours (I'll use yours as a base since I don't have mine handy):

def flatten_arranged_nodes(h, final = [])
  h.each { |k, v| final << k ; flatten_arranged_nodes(v, final) unless v.empty? }
  final
end

I hope we are in agreement that the above one is recursive. It is calling itself.
I am not sure if the explicit stack (my implementation) or the call stack (recursive) one is better for memory. I do wish there were a way to pre allocate the array since we know the number of nodes in there.

But this all seems like an academic debate.

@kshnurov
Copy link
Contributor

kshnurov commented Mar 2, 2023

Oh, so the problem is memory usage, not recursion.
In that case your last solution should be the best. I'd just remove empty?, it's likely a waste.

def flatten_arranged_nodes(h, final = [])
  h.each { |k, v| final << k ; flatten_arranged_nodes(v, final) }
  final
end

I am not sure if the explicit stack (my implementation) or the call stack (recursive) one is better for memory.

Well, ruby is a black box here. IDK if stack.push/pop and flatten_arranged_nodes(v, final) can reference hash part instead of copying it. If it can't - it'll eat memory anyway, but the function is much easier to read. You'll need to test and count allocations to figure it out.

@kbrock
Copy link
Collaborator Author

kbrock commented Mar 2, 2023

The core problem we are trying to solve:

What is the most efficient way to sort nodes using rank? (memory and time)

For this conversation, lets assume that rank is in the database.

  • 1 (rank = 1)
    • 3 (0)
    • 4 (1)
  • 2 (0)
    • 5 (1)
    • 6 (0)

proper sort: 2, 6, 5, 1, 3, 4
order(:ancestry,:rank): 2, 1, 3, 4, 6, 5

  • So we need to find the cheapest way to to the proper tree order
  • If possible, it would be cool if we could do it in the database.
  • If possible, it would be cool if we did not have to do a tree at all.
  • It needs to work without all nodes being present

Question, can we assume the caller has sorted the nodes into ancestry,rank coming in?

@kshnurov
Copy link
Contributor

kshnurov commented Mar 2, 2023

What is the most efficient way to sort nodes using rank? (memory and time)

And what exactly is the goal of such "sorting" (flattening in fact) and why this gem should do it after all? It sounds like some really custom "sorting".

If I have e.g. nested comments - I need a tree, not a flat array, and a tree is perfectly sorted with .order(:ancestry, :rank).arrange. If I'm paginating such comments - I'm paginating on top-level ones. Otherwise, I'll get a broken tree on page 2+ because such an array won't contain necessary parents.

@kbrock
Copy link
Collaborator Author

kbrock commented Mar 3, 2023

@kshnurov I appreciate that your use cases allow you to paginate by the parent node, and you do not want to do sql only updates. But please respect that other people have different constraints. Also appreciate that backwards compatibility is concern.

Yes, we are sorting/arranging a tree and then flattening it.
If you think about it, most of the time when you output a tree, you will output it in a linear fashion, so you basically flatten it.

sort_by_ancestry(tree_nodes.take(50)).each do |node|
  puts "#{node.depth * ' '}- #{node.name}"
end

# vs

def print_tree(node_hash, count)
  node_hash.each do |n, children|
    puts "#{node.depth * ' '}- #{node.name}"
    count -= 1
    break if count < 0
    count = print_tree(children, count)
    break if count < 0
  end
  count
end
print_tree(tree_nodes.each_with_object({}) { |n,t| t.merge(n.arrange) }, 50)

Even order(ancestry,id) (id is basically created_at) does not work for me and many others. It returns all root nodes before returning any child nodes. So you have to perform multiple queries or fetch every node in the tree. And this does not work with users that have deep or big trees.

Interestingly, if ancestry stored "/ancestry/id", then order(ancestry) would work great.

@kbrock kbrock mentioned this pull request Mar 3, 2023
@kshnurov
Copy link
Contributor

kshnurov commented Mar 3, 2023

you do not want to do sql only updates

It's just completely broken, I've already explained why. If you want to keep broken functionality in this gem, as you did with 4.2.0/materialized_path2 - I'm not gonna stop you.

If you think about it, most of the time when you output a tree, you will output it in a linear fashion, so you basically flatten it.

I sort everything with a database, arrange into a tree, and then flatten/output with a simple recursion/partials however I want: flatten_arranged_nodes(model.all.order(:ancestry, :rank).arrange). I don't want to sort with Ruby and I don't need to output a broken tree, but if I would - I'll just write another two lines for it and wouldn't expect it from this gem.

sort_by_ancestry(tree_nodes.take(50)).each do |node|
  puts "#{node.depth * ' '}- #{node.name}"
end

This is SO broken, I'm surprised you even write this. What are you sorting here, 50 random nodes? It's as bad as doing Model.first(50).sort_by(&:ancestry). You'll need to load/sort ALL records and then .take(50). Like all of them, the whole database! You call that "pagination"?

@kbrock kbrock changed the title Make arrange_nodes non-recursive Make arrange_nodes more memory efficient Mar 4, 2023
kbrock added 4 commits March 3, 2023 23:38
We are sorting records that come back from a query.
That means certain assumptions can be made about the records coming back
It will not contain cousins without at least one parent node in there.
- Only preforms one compare (<=>) vs two ( == and <=> )
- null ancestry is a space (ensures it sorts first when using materialized path 2)

But it does not actually change the sorting algorithm
It is still sorting levels by id (not optimal)
And still issues when missing nodes
this is the part responsible for taking a tree and producing an array of nodes
with the parent node before the children

Since children is always a hash, this will never enter the sorting code
and never require the block. so the sorting and the &block was dropped.

This works because ruby enumerates the hashes with insert order (no need to sort again)
Since this is in its own method, we no longer have such a big call stack
Also, we are no longer creating a 2 new arrays for every set of children
@kbrock kbrock force-pushed the partial_tree_sort branch from b812927 to 719a457 Compare March 4, 2023 05:02
@kbrock
Copy link
Collaborator Author

kbrock commented Mar 4, 2023

Ran benchmarks on the various implementations of flatten_arranged_nodes

I ran:

I was not able to get a good number on the non recursive one. It modifies the array so benchmark-ips was not practical. I agree that it is too complicated and I had trouble making it non modifying without allocating a bunch of arrays.

My first attempt still used inject, but I think I redid it without inject. Can't remember. I dropped it to go for the non-recursive version (which was a bust).

data 19 wide 3 deep = 6859 nodes

method ips memsize
orig 13,808.2 i/s 67,240 bytes
pre 16,414.6 i/s 18,768 bytes
[]+arrange_nodes 6,759.3 i/s 77,608 bytes
arrange_nodes([]) empty? 33,238.5 i/s 3,528 bytes
arrange_nodes([]) 20,048.3 i/s 3,528 bytes

data 2 wide 12 deep = 4095 nodes

method ips memsize
orig 635.8 i/s 1,007,040 bytes
pre 771.7 i/s 203,688 bytes
[]+arrange_nodes 489.5 i/s 1,468,384 bytes
arrange_nodes([]) empty? 1,658.1 i/s 39,888 bytes
arrange_nodes([]) 1,349.7 i/s 39,888 bytes

data 6 wide 5 deep = 7776 nodes

method ips memsize
orig 2,331.6 i/s 315,136 bytes
pre 3,094.6 i/s 79,944 bytes
[]+arrange_nodes 1,515.4 i/s 387,552 bytes
arrange_nodes([]) empty? 6,538.3 i/s 17,744 bytes
arrange_nodes([]) 4,315.5 i/s 17,744 bytes

removing the array allocation made all the difference.
Keeping the empty? is ~50% faster.
Depth really increases the memory usage (hence wanting to avoid recursion)

@estepnv This is still recursive, but for any reasonable depth, it takes up 96% less memory. Deeper will have even bigger savings.

Let me know if you think this will work for you

@kbrock
Copy link
Collaborator Author

kbrock commented Mar 4, 2023

I appreciate that your use cases allow you to paginate by the parent node, and you do not want to do sql only updates. But please respect that other people have different constraints. Also appreciate that backwards compatibility is a concern.

It's just completely broken, I've already explained why. If you want to keep broken functionality in this gem, as you did with 4.2.0/materialized_path2 - I'm not gonna stop you.

Let me reiterate: Everyone has a reason for how they act.
Just because you don't understand their reasoning doesn't mean they they are stupid.

Sql only updates

Rails provides callback friendly and database only approaches:

class Model
  has_many :orders, :dependent => :delete  # DB
  has_many :orders, :dependent => :destroy # Ruby
end

Model.delete  # DB
Model.destroy # Ruby

Some problems can be done in pure sql while others require ruby.
I have converted some jobs from many hours to less than a minute by properly leveraging sql.

Ancestry provides both options, as well.
What you call broken is a legitimate and necessary feature.

Materialized Path 2

Even though most (all?) presentations out there on tree theory use a single format (i.e.: 1/2/3) for the path, I had wanted to explore options that were more sql friendly. Wanting a better solution to #246 had me devising a strategy to get my POCs back into the code base.

The code base didn't properly handle globally changing the delimiter, let alone changing the format. So it required a lot of changes ( #296 #333 #455 #457 #458 #459 #460 #472 )
But I was getting tired of keeping the feature branch in sync with master, so come #563 I knew I didn't have much steam left and merged #571 as an experimental and undocumented feature open for comment.

If I had unlimited resources and time then maybe I could have kept the branches up to date. But you saw how hard it would be to merge #481 and that is trivial compared to #571

Thank you for #597 but that doesn't give you the right to attack me. Keep in mind that a lot has changed in both rails and ancestry since ruby 1.8 and rails 3.x. So my POC ended up with a few problems.


sort_by_ancestry(tree_nodes.take(50)).each do |node|
  puts "#{node.depth * ' '}- #{node.name}"
end

This is SO broken, I'm surprised you even write this.

It would be nice if you could focus on trying to get a solution rather than saying that other people don't know what they are doing.

sort_by_ancestry(tree_nodes.limit(50).order(Arel.sql("name || '/' || id")).each do |node|
  puts "#{node.depth * ' '}- #{node.name}"
end

Yea, I'm starting to like that materialized_path_id format more and more. Or maybe it just requires a creative index.

@kbrock kbrock merged commit aeab49d into stefankroes:master Mar 5, 2023
@kbrock kbrock deleted the partial_tree_sort branch March 5, 2023 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants