Transform and reduce alternatives

Transform-reduce is a pattern in which a set of data is first modified by applying a transformation on each of the elements and then it is reduced to a single value. In C++, this can be implemented straightforwardly with std::transform and std::accumulate. In C++17, an alternative for std::accumulate is available; std::reduce sums a range of elements just like std::accumulate, except that it does so out of order. That means you cannot use it with operators that are not communicative or associative (including overloads of operator+ that don’t exhibit these properties). On the other hand, there is yet another algorithm called std::transform_reduce that applies a functor to all the elements of a range and then reduces them, all in an out of order manner. And then, there are also parallel versions of these algorithms. In this post, I will try to compare the performance of these possible alternatives for implementing transform-reduce.

Let’s start with a problem: we have a list of order prices on which we need to apply a discount; if the order price is greater then 100, then a 20% discount applies. We need to compute the total sum of all orders after discounts have been applied.

For convenience, we will use the following function to apply a discount on an order price:

The traditional way to solve this problem is to first use std::transform to modify the elements of the range by applying the discount (with apply_discount) and then summing all the resulted values with std::accumulate. That should look like the following:

In C++17, we can replace std::accumulate with std::reduce, since summing the elements of the prices range can be done in any order; the result would be the same. std::reduce has various overloads but for this problem we need one that takes the range bound iterators and an initial value (and implicitly uses std::plus<> to sum the elements).

C++17 also provides a parallel version of tens of algorithms, including std::transform and std::reduce but not for std::accumulate. The algorithms that do have parallel versions have overloads that take an execution policy. This can be one of:

  • std::execution::seq: execution of the algorithm is sequential;
  • std::execution::par: execution of the algorithm may be parallelized on the calling thread or on another thread;
  • std::execution::par_unseq: execution of the algorithm may be parallelized, vectorized, or migrated across threads.

When using std::execution::seq, the execution is the same as when using an overload without an execution policy. On the other hand, std::execution::par and std::execution::par_unseq may execute the algorithm in parallel. par_unseq requires stronger quarantees than par; the function calls are unsequenced with respect for each other. Because of that, it is not possible to perform vectorization unsafe operations, such as allocate or deallocate memory, acquire mutexes, use non-lockfree std::atomic specializations, when using this policy.

With this in mind, we can rewrite the transform_and_reduce function to also specify an execution policy, as follows:

This can be, however, replaced with the C++17 new standard algorithm std::transform_reduce. This again has multiple overloads to support different scenarios and needs but the one we are interested in takes a policy, range bound iterators, an initial value, a binary functor to reduce the values (we can use std::plus<>) and a unary functor to transform the range elements.

The question is, how do these perform compared to each other for various range sizes. To check that, I have written the following testing program. It generates vectors of random values, from 100 to 100 million elements, calls all these functions, will all the three possible execution policies, and prints the results.

Here is an output example (Visual Studio 2017 15.6, release built for x64):

What I can see from here is that:

  • until 50,000 elements std::transform + std::accumulate, sequential std::transform + std::reduce and std::transform_reduce have similar times
  • after 50,000 elements the parallel version of std::transform_reduce is performing the best, with parallel std::transform + std::reduce comming close
  • the par_unseq version of std::transform + std::reduce is slightly better than the par version after more than 10 million elements; that is not the case for std::transform_reduce, whose vectorized version is only better at around 10 million elements.

Of course, you could argue that the call to std::transform is not actually needed here and the discount can be applied while computing the sum. Although this has different semantics, these functions could be simply rewritten as follows:

In this case, however, you can not execute transform_and_reduce with the par or par_unseq policies because they would not yield correct results.

Let us complicate the problem a bit and consider a list of orders, each order having a quantity and price per item. We should again calculate the total orders price, by summing individual order prices (quantity * price) after applying a discount the same way we did earlier. We could use the following order structure:

The implementation using std::transform + std::accumulate could look as follows:

The alternative with std::transform + std::reduce is not as straight forward as it may seem. First of all, std::reduce cannot use the same binary functor as std::accumulate because of some key requirements:

  • T, the type of the initial value, must meet the requirements of MoveConstructible, and
  • binary_op(init, *first), binary_op(*first, init), binary_op(init, init), and binary_op(*first, *first) must be convertible to T.

That means we must perform a certain trick to make std::reduce work:

  • the type of the initial value should be order and not long double, and
  • the return type of the binary operation should also be order; this functor would actually return a new order value with the quantity being irrelevant (and set to zero) and the price being the accumulated total price.

However, this no longer makes it possible for std::reduce to execute in parallel and yield correct results. Therefore, the implementation in this case could be as follows:

This does not look great and it’s exactly where std::transform_reduce comes to the rescue. This standard algorithm allows us to supply a unary operation to transform each element of the input range; we can use a lambda that returns quantity * discount(price). Then, we can use the binary functor std::plus<> to sum the results of the unary operation on an initial value. And this can all be done in parallel or even parallel unsequenced.

The results in this case look like the following:

What we can see from here is that:

  • std::transform + std::accumulate performs much better than std::transform + std::reduce
  • std::transform_reduce performs better than any of the two, regardless it is sequential or parallel, after about 10000 elements
  • the parallel unsequenced version is better or much better comparing to sequential and parallel runs when the size of the input range is between 50,000 elements.

Conclusion

In C++17, there are various alternatives for implementing the transform-reduce pattern, sequentially, in parallel or even parallel and vectorized. These can accommodate different needs, but the performance may differ depending on the actual problem you are solving and the size of the input datasets. Therefore, you should use the one that fits your needs the best.

See also

2 Comments on "Transform and reduce alternatives"


  1. Your testing code shown has a couple of typos, L60: std::execution::seq should be std::execution::par; and L65: std::execution::seq, should be std::execution::par_unseq.

    In your final test, last bullet, you note that in the larger sizes of vectors, the par_unseq usually out performs the par.
    however for the largest size you tested, par gives 203918 and par_unseq gives the much slower value of 829904.
    Can you comment why this seems to do this at that value?


  2. Thanks for the good catches. I have fixed the code issues. I have also run the tests again multiple times, and although I have seen similar results in a couple of cases, most of the runs do have better results for the vectorized version of transform_reduce. Perhaps yesterday I did not pay too much attention to all numbers and I ended up with some worse results. I guess it is some anomaly with external factors that generate those large numbers. I have updated the results with an output similar to what I get most of the times.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.