A recent question on stackoverflow raised the problem of a fast algorithm to count the unique colors in an image (Faster algorithm to check the colors in a image). My answer what that this kind of problems are suited for parallelization. However, parallelization may only help when used judiciously.
To demonstrate the point I have written a pretty simple implementation in VC++ (with C++11), that uses a std::set to store the unique colors. The size of the set is the number of the colors. The implementation simply iterates over the pixels of the image and adds them to the set.
There are probably better solutions than the one presented in this article. The code shown here is merely for the purpose of showing how parallelization can help, though not everything that runs in parallel is faster that a sequential code.
There are some things to note:
- I used CImage class to load an image from disk
- images can be 1,4,8,16,24 or 32 bit per pixel, but in the code presented here, for simplicity, I assume the image loaded from disk is 24 bits per pixel
- Windows device-independent bitmaps can be stored in memory in two ways: top-down (as you would naturally expect) or bottom-up. The processing of the image varies based on the storage. You can find details here: Top-Down vs. Bottom-Up DIBs.
- CImage::GetPitch method returns the pitch of the bitmap, which is the distance in bytes between the two memory addresses representing the beginning of two consecutive lines of the bitmap. This is important because the bitmap can be stored (align) with padding bytes at the end of a line, and the pitch and the width are used to determine the number of padding bytes. If the pitch is positive, then the bitmap is stored top-down. If the pitch is negative then the image is stored bottom-up.
- the pointer returned by CImage::GetBits points to the beginning of the buffer if the image is top-down, and to the end of the buffer if the image is bottom-up.
- since the image data is stored in contiguous array, the position of each pixel is given by the following algorithm:
let ptr be the beginning of the buffer
foreach row from 0 to height
foreach column from 0 to width
if top-down // ptr point to the beginning of the buffer
pixel = ptr + row * pitch + column * bytesperpixel
else bottom-up // ptr points to the end of the buffer
pixel = ptr - row * pitch + column * bytesperpixel
Since the pitch is positive when the bitmap is stored top-down and negative when then image is stored bottom-up, the above algorithm can be simplified to:
let ptr be the beginning of the buffer
foreach row from 0 to height
foreach column from 0 to width
pixel = ptr + row * pitch + column * bytesperpixel
My simple implementation of the sequential algorithm looks like this:
int CountColors24bpp(unsigned char* data, int width, int height, int pitch)
{
int bytespp = 3;
std::set<unsigned int> colors;
int padding = abs(pitch) - width * bytespp;
for(int i = 0; i < height; ++i)
{
for(int j = 0; j < width; ++j)
{
unsigned char* ptr = data + i*pitch + j * bytespp;
unsigned int color = ((*ptr) << 16) | (*(ptr+1) << 8) | *(ptr+2);
colors.insert(color);
}
}
return colors.size();
}
Loading the image from disk, and timing the execution looks like this:
CImage image;
image.Load(_T("d:\\sample.bmp"));
int width = image.GetWidth();
int height = image.GetHeight();
int pitch = image.GetPitch();
int bpp = image.GetBPP();
unsigned char* data = reinterpret_cast<unsigned char*>(image.GetBits());
{
std::chrono::time_point<std::chrono::high_resolution_clock> start = std::chrono::high_resolution_clock::now();
int colors = CountColors(data, width, height, bpp, pitch);
std::chrono::time_point<std::chrono::high_resolution_clock> end = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "[seq] colors count: " << colors << std::endl;
std::cout << "[seq] elapsed time: " << elapsed_time << "ms" << std::endl;
}
The result on my machine (Intel Core i7 2.67GHz, 6 GB RAM) on four random bitmaps with sizes 200×150, 1680×1050, 3360×1065 and 3360×2100 look like this (obviously with small variations):
[seq] colors count: 513
[seq] elapsed time: 1ms
[seq] colors count: 10544
[seq] elapsed time: 81ms
[seq] colors count: 33454
[seq] elapsed time: 172ms
[seq] colors count: 33454
[seq] elapsed time: 345ms
The simplest parallelization you can think of is using parallel_for from the Parallel Patterns Library. Especially, because the conversion from a sequential implementation to a parallel one is pretty trivial.
#include <mutex>
std::mutex g_mutex;
int CountColors24bpp_pfor(unsigned char* data, int width, int height, int pitch)
{
int bytespp = 3;
std::set<unsigned int> colors;
int padding = abs(pitch) - width * bytespp;
parallel_for(0, height, [&](int i) {
for(int j = 0; j < width; ++j)
{
unsigned char* ptr = data + i*pitch + j * bytespp;
unsigned int color = ((*ptr) << 16) | (*(ptr+1) << 8) | *(ptr+2);
g_mutex.lock();
colors.insert(color);
g_mutex.unlock();
}
});
return colors.size();
}
When you run this code you get some pretty surprising results.
[pfor] colors count: 513
[pfor] elapsed time: 106ms
[pfor] colors count: 10544
[pfor] elapsed time: 5803ms
[pfor] colors count: 33454
[pfor] elapsed time: 10714ms
[pfor] colors count: 33454
[pfor] elapsed time: 15854ms
It’s not that surprising after all, if you look at how the insertion is performed. The access to the std::set is guarded with std::mutex, allowing only one thread to insert a new element into the set. Because of this synchronization it takes much more, in the order of 50-100 times, than the sequential implementation. Of course, the bigger the image is, the smaller this performance lost.
Obviously, parallelization through the means of parallel_for is not a solution in this case. Using threads directly however, can help, if we can get rid of syncing. This is possible by using a separate std::set for each thread, and merging the results at the end.
#include <thread>
struct thread_data
{
unsigned char* data;
int width;
int h0;
int h1;
int pitch;
};
void CountColors24bpp_threadfunc(thread_data td, std::set<unsigned int>& colors)
{
int bytespp = 3;
int padding = abs(td.pitch) - td.width * bytespp;
for(int i = td.h0; i < td.h1; ++i)
{
for(int j = 0; j < td.width; ++j)
{
unsigned char* ptr = td.data + i*td.pitch + j * bytespp;
unsigned int color = ((*ptr) << 16) | (*(ptr+1) << 8) | *(ptr+2);
colors.insert(color);
}
}
}
int CountColors24bpp_threads(unsigned char* data, int width, int height, int pitch, int threadscount)
{
std::vector<std::set<unsigned int>> colors(threadscount);
std::vector<std::thread> threads(threadscount);
int range = height / threadscount;
for(int i = 0; i < threadscount; ++i)
{
thread_data td;
td.data = data;
td.h0 = range * i;
td.h1 = i == (threadscount - 1) ? height : td.h0 + range;
td.width = width;
td.pitch = pitch;
std::thread t(CountColors24bpp_threadfunc, td, std::ref(colors[i]));
threads[i].swap(t);
}
for(int i = 0; i < threadscount; ++i)
threads[i].join();
std::set<unsigned int> result;
for(int i = 0; i < threadscount; ++i)
result.insert(colors[i].begin(), colors[i].end());
return result.size();
}
A few considerations:
- the std::thread constructor does not take (yet) any number of parameters, so I had to pack the arguments in a structure
- the std::set with the colors is passed by reference, and therefore it has to be packed in a std::ref
- I create N threads, call the thread function with the arguments and then join them all
- when all threads have finished I aggregate the individual sets in a single one
Running this code with 4 threads yields results that look like this:
[threads] colors count: 513
[threads] elapsed time: 1ms
[threads] colors count: 10544
[threads] elapsed time: 28ms
[threads] colors count: 33454
[threads] elapsed time: 61ms
[threads] colors count: 33454
[threads] elapsed time: 110ms
As you can see the performance is better than the sequential code in each case (and implicitly far better than the parallel_for version).
As I said at the beginning, one can imagine better algorithms (both sequential and parallel), but the straight forward conclusion is that you have to beware of parallelization, as not every parallel version of an algorithm can run faster than the sequential version. The more syncing for shared-access takes place, the more performance is affected and the results can get much worse than the sequential version.
C++, C++11, image, parallel, sync, thread Hits for this post: 2613 .