Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support for segmented reductions and sorts specified by count-value representation #676

Open
jaredhoberock opened this issue May 7, 2012 · 3 comments
Assignees
Labels
thrust For all items related to Thrust.

Comments

@jaredhoberock
Copy link

Consider this API:

// fills an array with runs of keys specified by the count of each run
// count_by_key is the inverse of fill_by_count
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator>
OutputIterator fill_by_count(InputIterator1 counts_first, InputIterator1 counts_last, InputIterator2 values_first, OutputIterator result);

// reduces contiguous runs of keys into their count
// fill_by_count is the inverse of count_by_key
template<typename InputIterator,
         typename OutputIterator,
         typename Compare>
OutputIterator count_by_key(InputIterator first, InputIterator last, OutputIterator result, Compare comp);

// count_by_key may be a bad name because it is not a vectorized version of thrust::count count_unique? unique_count?

// reduces contiguous runs of elements. each run is specified by its count
// counts come first like keys
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(counts_first, counts_last, values_first, result, binary_op);

// reduces contiguous runs of elements. each run's size is n, except for the last partial run
// values come first (cf. fill_n) because the last run could be partial
// XXX should n come first to match the previous?
template<typename Size,
         typename InputIterator,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(values_first, values_last, Size n, OutputIterator result, BinaryFunction binary_op);

// sorts contiguous runs of elements. each run's is specified by its count
template<typename RandomAccessIterator1,
         typename RandomAccessIterator2,
         typename Compare>
void sort_by_count(RandomAccessIterator1 counts_first, RandomAccessIterator1 counts_last, RandomAccessIterator2 values_first, Compare comp);

// sorts contiguous runs of elements. each run's size is n, except for the last partial run
template<typename Size,
         typename RandomAccessIterator,
         typename Compare>
void sort_by_count(RandomAccessIterator first, RandomAccessIterator last, Size n, Compare comp);

Why a count-value structure?

  • The mapping between keys and counts is obvious (there are two functions which are inverses of the other)
  • The important special case of uniform counts is easy for the user to find (it's just an overload)
    and for the backend to target.

Why not a pairs of offsets, or range of ranges, or similar structure?

Because it forces the user to specify redundant data in the common case (continguous, non-overlapping values). It's easy to mess this up. There's no possibility of this kind of error with a count-value format (unless the user just gets the counts wrong). Indirection could be achieved as normal with a permutation_iterator.

Doesn't the count-value format imply that the implementation of each algorithm first has to perform an exclusive scan?

Under what conditions would an initial scan be expensive compared to the operation being performed?

If we preferred a different representation more convenient for the backend, would that simply force the user to perform the scan himself?

@wnbell
Copy link

wnbell commented May 8, 2012

IMO a "RangeOfRanges" interface is preferable to counts because:

  • it's more general, allowing overlapping, disordered and non-contiguous subranges (without per-element indirection)
  • one can produce a RangeOfRanges immediately from the scanned counts using the existing fancy iterators (e.g. [0,5,10...]->[(0,5),(5,10)...]
  • such algorithms can be implemented directly (e.g. one CUDA block/warp per subrange), avoiding the nontrivial scan precomputation

It's arguable that RangeOfRanges is more prone to error than a count interface, but I think that can be solved by providing convenience wrappers for some index to RoR transformations, such as the one from scanned counts to RoR.

Note that there is some precedent for the scanned count representation in the context of sparse matrices. The "compressed sparse row" (CSR) format uses an array of N + 1 integers to describe the beginning and end of each sparse row.

@FilipeMaia
Copy link

I agree with Nathan, I think the more flexible solution is preferable.

@jaredhoberock
Copy link
Author

It's true that a range-of-ranges formulation would be more general, but I think that this is a case where Worse is Better.

For the purposes of discussion, let's assume that the range-of-ranges proposal looks like

template<typename RangeOfRangesIterator>
void sort_each(RangeOfRangesIterator first, RangeOfRangesIterator last);

Where iterator_value<RangeOfRangesIterator>::type is some type Range such that begin(Range x) and end(Range x) return iterators delimiting a range of values.

One of my objections to sort_each is that it seems too targeted toward current CUDA limitations, i.e., that

  1. nested kernel launch is currently impossible and
  2. even supposing that nested kernel launch were possible, it's not clear how to exploit on-chip memory for the sort.

For any other backend, sort_each seems like sugar for the following:

template<typename RangeOfRangesIterator>
void sort_each(RangeOfRangesIterator first, RangeOfRangesIterator last)
{
  typedef typename iterator_value<RangeOfRangesIterator>::type Range;
  for_each(first, last, [](Range rng){
    sort(begin(rng), end(rng)
  });
}

I'm not sure how this algorithm adds value in any other context. It's true that it would eliminate quite a bit of verbosity, but so would

template<typenam RangeOfRangesIterator1, typename RangeOfRangesIterator2, typename Function>
void transform_each(RangeOfRangesIterator1 first1,
                    RangeOfRangesIterator1 last1,
                    RangeOfRangesIterator2 first2,
                    Function f)
{
  typedef typename iterator_value<RangeOfRangesIterator1>::type Range1;
  typedef typename iterator_value<RangeOfRangesIterator2>::type Range2;

  auto last2 = first2 + distance(first1,last1);

  for_each(zip(first1, first2), zip(last1,first2), [=](Range1 rng1, Range2 rng2){
    transform(begin(rng1),end(rng1),begin(rng2), f);
  });
}

In general I prefer an orthogonal interface, i.e. a small set of composable primitives (rather than the cross product thereof).

I guess my primary reasons for preferring sort_by_count over sort_each are because the interface is

  1. stricter, allowing greater opportunity for exploitation, and
  2. does not suggest the existence of a cross product of additional APIs, and
  3. does not introduce the concept of Range, which does not appear elsewhere in the interface

@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
@github-actions github-actions bot added the needs triage Issues that require the team's attention label Nov 8, 2023
@jrhemstad jrhemstad removed the needs triage Issues that require the team's attention label Nov 8, 2023
@NVIDIA NVIDIA deleted a comment from github-actions bot Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Status: Todo
Development

No branches or pull requests

5 participants