Consider support for segmented reductions and sorts specified by count-value representation #676

jaredhoberock · 2012-05-07T18:50:53Z

Consider this API:

// fills an array with runs of keys specified by the count of each run
// count_by_key is the inverse of fill_by_count
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator>
OutputIterator fill_by_count(InputIterator1 counts_first, InputIterator1 counts_last, InputIterator2 values_first, OutputIterator result);

// reduces contiguous runs of keys into their count
// fill_by_count is the inverse of count_by_key
template<typename InputIterator,
         typename OutputIterator,
         typename Compare>
OutputIterator count_by_key(InputIterator first, InputIterator last, OutputIterator result, Compare comp);

// count_by_key may be a bad name because it is not a vectorized version of thrust::count count_unique? unique_count?

// reduces contiguous runs of elements. each run is specified by its count
// counts come first like keys
template<typename InputIterator1,
         typename InputIterator2,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(counts_first, counts_last, values_first, result, binary_op);

// reduces contiguous runs of elements. each run's size is n, except for the last partial run
// values come first (cf. fill_n) because the last run could be partial
// XXX should n come first to match the previous?
template<typename Size,
         typename InputIterator,
         typename OutputIterator,
         typename BinaryFunction>
OutputIterator reduce_by_count(values_first, values_last, Size n, OutputIterator result, BinaryFunction binary_op);

// sorts contiguous runs of elements. each run's is specified by its count
template<typename RandomAccessIterator1,
         typename RandomAccessIterator2,
         typename Compare>
void sort_by_count(RandomAccessIterator1 counts_first, RandomAccessIterator1 counts_last, RandomAccessIterator2 values_first, Compare comp);

// sorts contiguous runs of elements. each run's size is n, except for the last partial run
template<typename Size,
         typename RandomAccessIterator,
         typename Compare>
void sort_by_count(RandomAccessIterator first, RandomAccessIterator last, Size n, Compare comp);

Why a count-value structure?

The mapping between keys and counts is obvious (there are two functions which are inverses of the other)
The important special case of uniform counts is easy for the user to find (it's just an overload)
and for the backend to target.

Why not a pairs of offsets, or range of ranges, or similar structure?

Because it forces the user to specify redundant data in the common case (continguous, non-overlapping values). It's easy to mess this up. There's no possibility of this kind of error with a count-value format (unless the user just gets the counts wrong). Indirection could be achieved as normal with a permutation_iterator.

Doesn't the count-value format imply that the implementation of each algorithm first has to perform an exclusive scan?

Under what conditions would an initial scan be expensive compared to the operation being performed?

If we preferred a different representation more convenient for the backend, would that simply force the user to perform the scan himself?

The text was updated successfully, but these errors were encountered:

wnbell · 2012-05-08T02:17:11Z

IMO a "RangeOfRanges" interface is preferable to counts because:

it's more general, allowing overlapping, disordered and non-contiguous subranges (without per-element indirection)
one can produce a RangeOfRanges immediately from the scanned counts using the existing fancy iterators (e.g. [0,5,10...]->[(0,5),(5,10)...]
such algorithms can be implemented directly (e.g. one CUDA block/warp per subrange), avoiding the nontrivial scan precomputation

It's arguable that RangeOfRanges is more prone to error than a count interface, but I think that can be solved by providing convenience wrappers for some index to RoR transformations, such as the one from scanned counts to RoR.

Note that there is some precedent for the scanned count representation in the context of sparse matrices. The "compressed sparse row" (CSR) format uses an array of N + 1 integers to describe the beginning and end of each sparse row.

FilipeMaia · 2012-05-10T23:29:59Z

I agree with Nathan, I think the more flexible solution is preferable.

jaredhoberock · 2012-05-21T21:06:55Z

It's true that a range-of-ranges formulation would be more general, but I think that this is a case where Worse is Better.

For the purposes of discussion, let's assume that the range-of-ranges proposal looks like

template<typename RangeOfRangesIterator>
void sort_each(RangeOfRangesIterator first, RangeOfRangesIterator last);

Where iterator_value<RangeOfRangesIterator>::type is some type Range such that begin(Range x) and end(Range x) return iterators delimiting a range of values.

One of my objections to sort_each is that it seems too targeted toward current CUDA limitations, i.e., that

nested kernel launch is currently impossible and
even supposing that nested kernel launch were possible, it's not clear how to exploit on-chip memory for the sort.

For any other backend, sort_each seems like sugar for the following:

template<typename RangeOfRangesIterator>
void sort_each(RangeOfRangesIterator first, RangeOfRangesIterator last)
{
  typedef typename iterator_value<RangeOfRangesIterator>::type Range;
  for_each(first, last, [](Range rng){
    sort(begin(rng), end(rng)
  });
}

I'm not sure how this algorithm adds value in any other context. It's true that it would eliminate quite a bit of verbosity, but so would

template<typenam RangeOfRangesIterator1, typename RangeOfRangesIterator2, typename Function>
void transform_each(RangeOfRangesIterator1 first1,
                    RangeOfRangesIterator1 last1,
                    RangeOfRangesIterator2 first2,
                    Function f)
{
  typedef typename iterator_value<RangeOfRangesIterator1>::type Range1;
  typedef typename iterator_value<RangeOfRangesIterator2>::type Range2;

  auto last2 = first2 + distance(first1,last1);

  for_each(zip(first1, first2), zip(last1,first2), [=](Range1 rng1, Range2 rng2){
    transform(begin(rng1),end(rng1),begin(rng2), f);
  });
}

In general I prefer an orthogonal interface, i.e. a small set of composable primitives (rather than the cross product thereof).

I guess my primary reasons for preferring sort_by_count over sort_each are because the interface is

stricter, allowing greater opportunity for exploitation, and
does not suggest the existence of a cross product of additional APIs, and
does not introduce the concept of Range, which does not appear elsewhere in the interface

upsj mentioned this issue Mar 18, 2022

Feature Request: Multiway count_if/copy_if #799

Open

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

github-actions bot assigned elstehle Nov 8, 2023

github-actions bot added the needs triage Issues that require the team's attention label Nov 8, 2023

jrhemstad removed the needs triage Issues that require the team's attention label Nov 8, 2023

NVIDIA deleted a comment from github-actions bot Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider support for segmented reductions and sorts specified by count-value representation #676

Consider support for segmented reductions and sorts specified by count-value representation #676

jaredhoberock commented May 7, 2012

wnbell commented May 8, 2012

FilipeMaia commented May 10, 2012

jaredhoberock commented May 21, 2012

Consider support for segmented reductions and sorts specified by count-value representation #676

Consider support for segmented reductions and sorts specified by count-value representation #676

Comments

jaredhoberock commented May 7, 2012

wnbell commented May 8, 2012

FilipeMaia commented May 10, 2012

jaredhoberock commented May 21, 2012