A recent question on stackoverflow raised the problem of a fast algorithm to count the unique colors in an image (Faster algorithm to check the colors in a image). My answer what that this kind of problems are suited for parallelization. However, parallelization may only help when used judiciously.

To demonstrate the point I have written a pretty simple implementation in VC++ (with C++11), that uses a std::set to store the unique colors. The size of the set is the number of the colors. The implementation simply iterates over the pixels of the image and adds them to the set.

There are probably better solutions than the one presented in this article. The code shown here is merely for the purpose of showing how parallelization can help, though not everything that runs in parallel is faster that a sequential code.

There are some things to note:

  • I used CImage class to load an image from disk
  • images can be 1,4,8,16,24 or 32 bit per pixel, but in the code presented here, for simplicity, I assume the image loaded from disk is 24 bits per pixel
  • Windows device-independent bitmaps can be stored in memory in two ways: top-down (as you would naturally expect) or bottom-up. The processing of the image varies based on the storage. You can find details here: Top-Down vs. Bottom-Up DIBs.
  • CImage::GetPitch method returns the pitch of the bitmap, which is the distance in bytes between the two memory addresses representing the beginning of two consecutive lines of the bitmap. This is important because the bitmap can be stored (align) with padding bytes at the end of a line, and the pitch and the width are used to determine the number of padding bytes. If the pitch is positive, then the bitmap is stored top-down. If the pitch is negative then the image is stored bottom-up.
  • the pointer returned by CImage::GetBits points to the beginning of the buffer if the image is top-down, and to the end of the buffer if the image is bottom-up.
  • since the image data is stored in contiguous array, the position of each pixel is given by the following algorithm:

    Since the pitch is positive when the bitmap is stored top-down and negative when then image is stored bottom-up, the above algorithm can be simplified to:

My simple implementation of the sequential algorithm looks like this:

Loading the image from disk, and timing the execution looks like this:

The result on my machine (Intel Core i7 2.67GHz, 6 GB RAM) on four random bitmaps with sizes 200×150, 1680×1050, 3360×1065 and 3360×2100 look like this (obviously with small variations):

The simplest parallelization you can think of is using parallel_for from the Parallel Patterns Library. Especially, because the conversion from a sequential implementation to a parallel one is pretty trivial.

When you run this code you get some pretty surprising results.

It’s not that surprising after all, if you look at how the insertion is performed. The access to the std::set is guarded with std::mutex, allowing only one thread to insert a new element into the set. Because of this synchronization it takes much more, in the order of 50-100 times, than the sequential implementation. Of course, the bigger the image is, the smaller this performance lost.

Obviously, parallelization through the means of parallel_for is not a solution in this case. Using threads directly however, can help, if we can get rid of syncing. This is possible by using a separate std::set for each thread, and merging the results at the end.

A few considerations:

  • the std::thread constructor does not take (yet) any number of parameters, so I had to pack the arguments in a structure
  • the std::set with the colors is passed by reference, and therefore it has to be packed in a std::ref
  • I create N threads, call the thread function with the arguments and then join them all
  • when all threads have finished I aggregate the individual sets in a single one

Running this code with 4 threads yields results that look like this:

As you can see the performance is better than the sequential code in each case (and implicitly far better than the parallel_for version).

As I said at the beginning, one can imagine better algorithms (both sequential and parallel), but the straight forward conclusion is that you have to beware of parallelization, as not every parallel version of an algorithm can run faster than the sequential version. The more syncing for shared-access takes place, the more performance is affected and the results can get much worse than the sequential version.

, , , , , Hits for this post: 30334 .

Microsoft has recently announced at the AMD Fusion Developer Summit the introduction of a new technology called C++ Accelerated Massive Parallelism or shortly C++ AMP that helps C++ developers use the GPU for parallel programming. The new technology will be part of Visual C++, integrated with full support (edit, build, debug, profile) in the next version of Visual Studio. It is built in modern C++ on top of DirectX. It will provide an STL-like library, part of the Concurrency namespace, delivered in a amp.h header.

S.Somasegar (Senior Vice President of the Developer Division) said:

By building on the Windows DirectX platform, our implementation of C++ AMP allows you to target hardware from all the major hardware vendors. We expect that it will be part of the next Visual C++ compiler and fully integrated in the next release of Visual Studio experience.

You can read more about it here:

Looking forward for the first CTP and samples.

UPDATE
Herb Sutter introduces C++ AMP at the AMD Fusion Developer Summit 11

Daniel Moth digs deeper into C++ AMP with code samples and more

Herb Sutter

How long is your password? How long it will take a 100,000,000 GPU cores running at what, a million attempts per second to crack your password just by brute force? That where almost a kid can write that. An that is just a tiny example of how game changing this is.

, , , , Hits for this post: 46223 .

Channel9 recently posted a video with the Parallel Computing Concurrency Runtime team talking, mainly, about tasks and continuations, new features to the Parallel Patterns Library. These are already available through the ConcRT Extra’s Sample Pack. You can watch the half hour interview with the team here.

Besides the new stuff they shown, I particularly liked two things that Artur Laksberg said. The first was about the difference between parallelism and concurrency:

Parallelism is doing the same amount of work faster by utilizing multiple cores, multiple processors. Concurrency is understood as doing more work in the same amount of time.

The other one was about threads and tasks:

We want people to stop thinking about threads and start thinking in terms of independent, or not independent, units of work. You have one piece of work, you compose it with another piece of work and you have two tasks, you join together and what you get as a result is another task. And then, concurrency, as somebody said, just happens. It just happens if the runtime decides it’s beneficial for you, that it is safe to execute those two chunks, tasks, in parallel.

Hopefully people will start understanding that threads are obsolete and they should be thinking in tasks.

UPDATE: Microsoft Technical Computing group announced yesterday the availability of a book called Parallel Programming with Microsoft Visual C++: Design patterns for Decomposition, and Coordination on Multicore Architectures, describing six key patterns for data and task parallelism and how to implement them in VC++ using the Parallel Patterns Library and Asynchronous Agents Library, which shipped with Visual Studio 2010. There is also a printed version for the book. You can read more about it on VC++ team’s blog.

, , , , , Hits for this post: 39049 .

At the beginning of this year I have posted a short questionnaire on concurrency on my blog. I have promised to publish the results, so here they are. The number of respondents was 27 (and not everybody answered all the questions).

Q: What is your role in application development?

Q: How many cores has your primary working station (PC, laptop)?

Q: What type of application(s) are you developing?
A: Windows and web applications, ERP, banking, internal, back-end, integrations (addins: IE, Outlook, etc.), VxWorks kernel-base applications under PPC603, Point-of-Sale, internal, Web site + ERP backend + e-commerce, web applications, Windows applications, Mac applications, control software, visualizations, scada, Linux embedded application, Windows forms, Windows .Net with WPF/WCF and Silverlight, integration technology platforms for CRM and ERP, Kernel mode drivers, filesystem drivers, graphic engines, CAD.

Q: What programming languages are used for your application development?

Q: One a scale from 1 to 5 how do you rate your level of awareness on concurrency issues?

Q: On a scale from 1 to 5 how do you rate your activity in getting informed and acquiring competence on concurrency (trends, tools, etc.)?

Q: What do you expect to happen to developers that won’t learn to program concurrently?

Q: On a scale from 1 to 5, how would you grade the level of parallization of your application?

Q: Does your application performance scale proportionally with the number of cores?

Q: What libraries and tools are used in your application for concurrency?
A: System.Threading, The built-in threading classes, OS support (CRT, Win32 API), VS2010, .Net Framework – Thread, ThreadPool, Task, Monitor, lock() and many own written libraries (synchronized objects, asynchron calls and functions), MFC framework, .NET framework, Qt framework, Java framework, My own using RAII, nothing special: self made priority queues and thread pools, Win32 API’s, Windows threads.

Q: Give examples of routines in your application that are executed in worker threads.
A: loading of grids which care displayed at startup in the main application window, logging, tracing, loading data (in some cases), long running tasks (computations, report generation also in some cases), HTTP lenghtly posts, Outlook inbox scan for text, a job/task manager for things that are triggered by web UI but are not required to output any result to the client, data processing, database interactions, database data loading/uploading, looped data acquisition presented in chart monitors, sattelite data scaning and sectioned for DVB-S systems, any operations where I need to start/stop, all long running operations, application request processing, connection management, I/O, Open large files, communication, file uploads, update checking.

Q: Give examples of routines in your application that execute in the main (UI) thread and can be refactored to execute in worker threads.
A: loading all application modules, SQL queries, draw rows of panels – maybe even threading each individual block per row, all data processing which take more than a few seconds and which are not dependent or influence other events on UI, I’ve been multi-threaded programming for so long, I usually design the app for multi-theading up front, long (minutes) calculations, user triggered opening of multiple files, opening of a single file, certain post-processing of data, database stuff.

Q: Give examples of routines in your application that are parallelized.
A: logging and Tracing with multiple outputs (File System, Debug Window, Activity Logs written to database), UI, background processes, sending emails, invokes an event multithreaded in an own ThreadPool, all data processing in an application running in a sattelite set top box, zapit-thread, epg-thread, timer thread, other threads controlling whichever driver involved: front panel display driver, sound driver etc., remote connection handling, data processing, 3d transformations.

Q: Give examples of routines in your application that could be parallelized to improve performance.
A: cost computation and distribution for productions chains, the rest, all timers routines should turn to threads, all routines that take more than a few seconds and locks the UI, send-request-and-wait-response logic (actually I was thinking of SEDA kind of processing), importing models from project files.

, , , Hits for this post: 30232 .

I think application development faces two challenges nowadays: 64-bit and multi-core/many-core hardware. Switching from 32 to 64 bit is just another step in the evolution of processors. There was a time when we switched from 8 to 16, and then from 16 to 32. There are problems that arise every time, but probably in 10-20 years we will have 128 bit platforms. On the other hand, multi-core/many-core is a different shift not only in development but also thinking. We either run one core or multiple core processors. My working stations has 8 cores (4 physical and 4 virtualized) and so does my laptop. Still, I don’t see an 8 times improvement of the applications running on these machines; yet I know 8 times I not what I should expect. But not even 4 times. For instance, the time for building from scratch the application I’ve working on with Visual Studio 2008 dropped from 20 minutes to 8 minutes; that’s a 2.5 improvement. I know that having N cores doesn’t mean that applications can run N times faster, because not everything can run in several threads, and then we have problems with resource access, synchronization and others. All these prevent applications run N times faster. But the problem is that we are not thinking in parallel. We are still used to program in a single thread; and when I say that I mean not only most developers are not used to do parallelization for boosting performance of some routines, but also many operations are run in the main (UI) thread, making that GUI freeze.

Therefore I would like to get some feedback from people working in applications development to get an idea about awareness, issues, solutions regarding concurrency. Please take several minutes to answer the questions in this survey.

I’d like to quote James Reinders, lead evangelist and Director of Marketing and Business Development for Intel Software Development Products, who said that:

I am still confident that software development in 2016 will not be kind to programmers who have not learned to “Think Parallel.”

The “not parallel” era we are now exiting will appear to be a very primitive time in the history of computers when people look back in a hundred years. The world works in parallel, and it is time for computer programs to do the same.

Doing one thing at a time is “so yesterday.”

The questionnaire bellow is also available here.

Thank you for answering the questionnaire.

, , , Hits for this post: 31067 .

Last week Microsoft published on DevLabs a .NET language for building parallel applications, called Axum, and earlier known as Maestro. This new language is build on the architecture of the web, on the principles of isolation, message-passing, fault-tolerance, loose-coupling. It is said to have a more succinct syntax than Erlang, and have the isolation advantage over MPI, CCR and Asynchronous Agents.

Isolation is key in this architecture and is achieved with:

  • domains, that limits the runtime scope of data to its compile-time scope (objects don’t escape domains)
  • agents, active components that provide access to domains, and live in a thread of their own, different that the callers; their methods are not accessible outside;
  • channels, are the mean to communicate with agents; they are established by the runtime, when agents are created. The most important parts of the channels are the ports (input or output), that can be viewed as queues in which data is placed.

Here are more readings about these topics here:

You can find more about Axum at:

, , , , Hits for this post: 35781 .

Game of Life in F#

The Game of Life is a cellular automaton devised by the John Horton Conway in 1970. It is the best-known example of a cellular automaton. It consists of a collection of cells which, based on a few mathematical rules, can live, die or multiply. Depending on the initial conditions, the cells form various patterns throughout the course of the game.

Here you can learn more about the game:

I decided to implement this in F#. With all the code involving the user interface, it was nearly 300 lines of code. Pretty neat!

Using the game

The life’s arena can take several sizes:

  • Tiny: 15 x 10
  • Small: 30 x 20
  • Medium: 60 x 40
  • Large: 120 x 80
  • Huge: 150 x 100

The size can be changed from the Game > Size menu.

The following commands are available from the menu:

  • Reset (Ctrl + R): Resets the game, all cells die
  • Randomize (Ctrl + G): Randomly initialize alive cells
  • Start/Stop (Ctrl + B): Starts or stops continuos creation of new generations
  • Step (Ctrl + N): Creates a new generation of cells

In addition the followin commands are available from the File menu:

  • Save (Ctrl + S): saves game to a bitmap file called gameolife_.bmp; this file is created in the working folder
  • Save As… (Ctrl + Shift + S): saves the game to a file, giving you the possibility to chose the location and format (bmp, jpg, gif and png)

Note: The state of the game can be changed (killing cells, making others alive) simply by clicking with the mouse on the game panel.

Implementation in F#

Some comments on the implementation of the game. For more have a look at the code.

I defined two records, one called Cell and one called World. The cell reprezents a cell and has a flag called Alive (which is self explanatory), and World represents the ‘arena of life’, containing a matrix of Cells.

Various arena sizes are defined in a descriminated union and the mapping between that and values for width and height are done in this function:

A next generation of cells is computed based on the rules of the game. For each cell the number of alived neighbors is computed and then the state of the cell is changes (if necessary).

For computing the number of neighbors, I considered 9 cases:

  • cell is one of the four corners
  • cell is on one of the four edges (not a corner)
  • cell is anywhere else in the matrix

Depending on this position a list of neighbors is created and folded to compute the number of alive neighboring cells.

The rest is basically user interface code. But I made a parallel version of the game, when then computation of the next generation of cells is parallelized with the Parallel FX framework. Here is how the function looks.

Source code

Here are the available downloads

, , , Hits for this post: 41690 .

In this post I’ll show some F# constructs, all put together in a simple application that modifies file names that match a criteria. This would be an application that is started from a console with the following command line options:

Reading command line

The command line arguments can be retrieved using the Environment class from the .NET framework. This class has a static method called GetCommandLineArgs() that returns a list of the passed arguments.
We can define a type that contains all the parsed arguments.

This mutable record can be instantiated, and the value can be mutated while parsing the arguments. This is how you instantiate it:

Parsing the command line arguments can be done with pattern matching. This is the equivalent of switches in C+/C#/Java, only more powerful.
Basically, I’m checking each argument, and if it’s a flag in the command line (-f, -r, -p, -pre, -sub) I take the next argument and put it in the appropriate property of the record.

There are two things you could notice here. The first is the try … with block that makes sure any possible exception is caught.
The second is the quarding the rules with the contidion that the current argument is not the last one in the list. (-f should be followed by a folder, -suf by a suffix, etc.)
You can see what in the when statement:

Getting the files in a directory

We can get all files in a folder, using the following algorithm:

  • get all the files in the current folder
  • get all the sub-folders in the current folder and for each of them apply the algorithm again

That is spelled “recursion”!. Our function should take several arguments: the path of a folder, a pattern for mathing filenames and a flag indicating whether sub-folders should be parsed or not.

The above function is recursive and returns a sequence of file names. Sequences are lazy, which means that successive elements are computed and returned on demand, when they are needed.
That is the opposite of a list or array, whose elements are created at once. The keyword ‘yield’ here is used to return a new value as the sequence is iterated.

Processing the files

To process the files, we simply iterate over the sequence of files from the specified folder, match it against the provided parttern, and if there is a match, apply the prefix and/or suffix transformation.

Well, I have two cores on my machine, and since the Parallel FX framework is available, I like to use it. So here is the parallel version of that:

The provided (via command line) pattern is a regular expression. Initially, the folder is checked for all files and then these files are matched against this regular expression.

As I was saying in a previous post, if you use PFX, you have to add a reference to the System.threading.dll assembly, which requires a reference to the System.Core.dll assembly.
That should be specified at the project’s propertyes.

-r C:\WINDOWS\assembly\GAC_MSIL\System.Core\3.5.0.0__b77a5c561934e089\System.Core.dll -r “C:\Program Files\Microsoft Parallel Extensions Dec07 CTP\System.Threading.dll”

Putting all together

All that put together looks like this:

Of course, the options available in this application (on file name changes) are pretty limited, but that can be extended at will.

, , , , , Hits for this post: 40665 .