A recent question on stackoverflow raised the problem of a fast algorithm to count the unique colors in an image (Faster algorithm to check the colors in a image). My answer what that this kind of problems are suited for parallelization. However, parallelization may only help when used judiciously.

To demonstrate the point I have written a pretty simple implementation in VC++ (with C++11), that uses a std::set to store the unique colors. The size of the set is the number of the colors. The implementation simply iterates over the pixels of the image and adds them to the set.

There are probably better solutions than the one presented in this article. The code shown here is merely for the purpose of showing how parallelization can help, though not everything that runs in parallel is faster that a sequential code.

There are some things to note:

  • I used CImage class to load an image from disk
  • images can be 1,4,8,16,24 or 32 bit per pixel, but in the code presented here, for simplicity, I assume the image loaded from disk is 24 bits per pixel
  • Windows device-independent bitmaps can be stored in memory in two ways: top-down (as you would naturally expect) or bottom-up. The processing of the image varies based on the storage. You can find details here: Top-Down vs. Bottom-Up DIBs.
  • CImage::GetPitch method returns the pitch of the bitmap, which is the distance in bytes between the two memory addresses representing the beginning of two consecutive lines of the bitmap. This is important because the bitmap can be stored (align) with padding bytes at the end of a line, and the pitch and the width are used to determine the number of padding bytes. If the pitch is positive, then the bitmap is stored top-down. If the pitch is negative then the image is stored bottom-up.
  • the pointer returned by CImage::GetBits points to the beginning of the buffer if the image is top-down, and to the end of the buffer if the image is bottom-up.
  • since the image data is stored in contiguous array, the position of each pixel is given by the following algorithm:

    Since the pitch is positive when the bitmap is stored top-down and negative when then image is stored bottom-up, the above algorithm can be simplified to:

My simple implementation of the sequential algorithm looks like this:

Loading the image from disk, and timing the execution looks like this:

The result on my machine (Intel Core i7 2.67GHz, 6 GB RAM) on four random bitmaps with sizes 200×150, 1680×1050, 3360×1065 and 3360×2100 look like this (obviously with small variations):

The simplest parallelization you can think of is using parallel_for from the Parallel Patterns Library. Especially, because the conversion from a sequential implementation to a parallel one is pretty trivial.

When you run this code you get some pretty surprising results.

It’s not that surprising after all, if you look at how the insertion is performed. The access to the std::set is guarded with std::mutex, allowing only one thread to insert a new element into the set. Because of this synchronization it takes much more, in the order of 50-100 times, than the sequential implementation. Of course, the bigger the image is, the smaller this performance lost.

Obviously, parallelization through the means of parallel_for is not a solution in this case. Using threads directly however, can help, if we can get rid of syncing. This is possible by using a separate std::set for each thread, and merging the results at the end.

A few considerations:

  • the std::thread constructor does not take (yet) any number of parameters, so I had to pack the arguments in a structure
  • the std::set with the colors is passed by reference, and therefore it has to be packed in a std::ref
  • I create N threads, call the thread function with the arguments and then join them all
  • when all threads have finished I aggregate the individual sets in a single one

Running this code with 4 threads yields results that look like this:

As you can see the performance is better than the sequential code in each case (and implicitly far better than the parallel_for version).

As I said at the beginning, one can imagine better algorithms (both sequential and parallel), but the straight forward conclusion is that you have to beware of parallelization, as not every parallel version of an algorithm can run faster than the sequential version. The more syncing for shared-access takes place, the more performance is affected and the results can get much worse than the sequential version.

, , , , , Hits for this post: 30843 .

At the beginning of this year, Microsoft announced a “C++ renaissance”. Quoting from the description of a Channel 9 video with Craig Symonds and Mohsen Agsen:

C++ is currently undergoing a renaissance. This means that, by definition, the language, compilers and compositional tooling are evolving and coalescing into a state that maximizes native developer efficiency, productivity, and creativity across hardware and software domains.

Everybody agrees that Microsoft made C++ a sort of second class citizen in the past years, while the company invested a lot in the .NET framework. Many developers have switched from native development to managed (.NET) simply because it offers a more productive environment. And the postponing of the ISO standard committee in releasing the new C++0x standard only made things worse.

However, with the completion of the new C++ standard this year, Microsoft, apparently, plans to change that, and make C++ again appealing to developers. They already made C++0x features available in the VS2010 C++ compiler and are working on implementing most of the rest for Visual Studio vNext. They are also investing in tools (now labeled Application Lifecycle Management), and for instance are bringing intellisence to C++/CLI. One of the most important areas of development is parallelism, where they are developing the PPL and Agents libraries and now the C++ AMP that they just announced. And also recently the Kinect for Windows SDK beta that provides Kinect capabilities to developers who build applications with C++ (and other laguanges). And in the mean time they hired Erich Gamma in the Visual Studio team.

But this is not enough in my opinion. Improvements in language and tools are an important part, but not everything. It is equally necessary for Microsoft to evangelize it, using any necessary means. Unless they can spread the word, the work might pass unnoticed. To be honest, I was very reluctant about this part, half an year ago, when they announced the “renaissance”. However, looking back at what they done I’d say they are on the right track. Of course, there is still a lot of work to match the “advertising” effort put into .NET. But right now C++ is getting more attention at conferences such as PDC or TechEd, or their publishing assets, such as Channel 9, MSDN or their team blogs. So I tried to assemble a collection of videos, blogs, books and code samples related to C++ or native development that they published since the announcement of the renaissance. So far it looks good, in my opinion.

Channel 9
E2E: Herb Sutter and Erik Meijer – Perspectives on C++
Craig Symonds and Mohsen Agsen: C++ Renaissance
Windows 7 Taskbar Integration for MFC Applications
Tony Goodhew: VC++ Developer Communication – Questions and Answers
Talkin’ C++ with Kate Gregory
MVP Summit 2011: Meet C++ MVPs Angel, PJ, Tom and Sheng
Talkin’ C++ with Alon, Marius, Bruno, and Jim
Talkin’ C++ with Boris Jabes: C++ Intellisense, Game Development, and Boris Faces His Demons
Application Restart and Recovery on Windows 7 in Native Code
Parallel Programming for C++ Developers: Tasks and Continuations, Part 1 of 2
Parallel Programming for C++ Developers: Tasks and Continuations, Part 2 of 2
Conversation with Herb Sutter: Perspectives on Modern C++(0x/11)
First Look: New ALM Tools for VC++ Developers
Modern Native C++ Development for Maximum Productivity
Mohsen Agsen – C++ Today and Tomorrow
Herb Sutter: C++ Questions and Answers
Herb Sutter – Heterogeneous Computing and C++ AMP
Daniel Moth: Blazing-fast code using GPUs and more, with C++ AMP
C9 Lectures: Stephan T Lavavej – Advanced STL, 1 of n
C9 Lectures: Stephan T Lavavej – Advanced STL, 2 of n
C9 Lectures: Stephan T Lavavej – Advanced STL, 3 of n
C9 Lectures: Stephan T Lavavej – Advanced STL, 4 of n
C9 Lectures: Stephan T Lavavej – Advanced STL, 5 of n

Visual C++ Team Blog
Grr… My VC++ Project Is Building Slower in VS2010. What Do I Do Now? (A Step by Step Guide)
C++/CLI IntelliSense in Visual Studio vNext
Exception Boundaries: Working With Multiple Error Handling Mechanisms
Troubleshooting Tips for IntelliSense Slowness
Build Related Improvement in VS2010 SP1
Converting An MFC Ribbon To Designer Format
Enforcing Correct Concurrent Access of Class Data

Parallel Programming in Native Code Blog
Sorting in PPL
How to pick your parallel sort?
The Concurrency Runtime and Visual C++ 2010: Lambda Expressions
The Concurrency Runtime and Visual C++ 2010: Automatic Type Deduction
The Concurrency Runtime and Visual C++ 2010: The decltype Type Specifier
The Concurrency Runtime and Visual C++ 2010: Rvalue References
The Concurrency Runtime and Visual C++ 2010: Transporting Exceptions between Threads
Building Responsive GUI Applications with PPL Tasks

MSDN Magazine
Writing a Debugging Tools for Windows Extension
Writing a Debugging Tools for Windows Extension, Part 2: Output
Writing a Debugging Tools for Windows Extension, Part 3: Clients and Callbacks
Agile C++ Development and Testing with Visual Studio and TFS

Books & Publications
Parallel Programming with Microsoft Visual C++
The Visual C++ Weekly

Code & Samples
Code samples for the Concurrency Runtime and Parallel Pattern Library in Visual Studio 2010
Bing Maps Trip Optimizer
Hilo: Developing C++ Applications for Windows 7
All-in-One Code Framework

, , , Hits for this post: 44135 .

Microsoft has recently announced at the AMD Fusion Developer Summit the introduction of a new technology called C++ Accelerated Massive Parallelism or shortly C++ AMP that helps C++ developers use the GPU for parallel programming. The new technology will be part of Visual C++, integrated with full support (edit, build, debug, profile) in the next version of Visual Studio. It is built in modern C++ on top of DirectX. It will provide an STL-like library, part of the Concurrency namespace, delivered in a amp.h header.

S.Somasegar (Senior Vice President of the Developer Division) said:

By building on the Windows DirectX platform, our implementation of C++ AMP allows you to target hardware from all the major hardware vendors. We expect that it will be part of the next Visual C++ compiler and fully integrated in the next release of Visual Studio experience.

You can read more about it here:

Looking forward for the first CTP and samples.

Herb Sutter introduces C++ AMP at the AMD Fusion Developer Summit 11

Daniel Moth digs deeper into C++ AMP with code samples and more

Herb Sutter

How long is your password? How long it will take a 100,000,000 GPU cores running at what, a million attempts per second to crack your password just by brute force? That where almost a kid can write that. An that is just a tiny example of how game changing this is.

, , , , Hits for this post: 46734 .

Channel9 recently posted a video with the Parallel Computing Concurrency Runtime team talking, mainly, about tasks and continuations, new features to the Parallel Patterns Library. These are already available through the ConcRT Extra’s Sample Pack. You can watch the half hour interview with the team here.

Besides the new stuff they shown, I particularly liked two things that Artur Laksberg said. The first was about the difference between parallelism and concurrency:

Parallelism is doing the same amount of work faster by utilizing multiple cores, multiple processors. Concurrency is understood as doing more work in the same amount of time.

The other one was about threads and tasks:

We want people to stop thinking about threads and start thinking in terms of independent, or not independent, units of work. You have one piece of work, you compose it with another piece of work and you have two tasks, you join together and what you get as a result is another task. And then, concurrency, as somebody said, just happens. It just happens if the runtime decides it’s beneficial for you, that it is safe to execute those two chunks, tasks, in parallel.

Hopefully people will start understanding that threads are obsolete and they should be thinking in tasks.

UPDATE: Microsoft Technical Computing group announced yesterday the availability of a book called Parallel Programming with Microsoft Visual C++: Design patterns for Decomposition, and Coordination on Multicore Architectures, describing six key patterns for data and task parallelism and how to implement them in VC++ using the Parallel Patterns Library and Asynchronous Agents Library, which shipped with Visual Studio 2010. There is also a printed version for the book. You can read more about it on VC++ team’s blog.

, , , , , Hits for this post: 39519 .

At the beginning of this year I have posted a short questionnaire on concurrency on my blog. I have promised to publish the results, so here they are. The number of respondents was 27 (and not everybody answered all the questions).

Q: What is your role in application development?

Q: How many cores has your primary working station (PC, laptop)?

Q: What type of application(s) are you developing?
A: Windows and web applications, ERP, banking, internal, back-end, integrations (addins: IE, Outlook, etc.), VxWorks kernel-base applications under PPC603, Point-of-Sale, internal, Web site + ERP backend + e-commerce, web applications, Windows applications, Mac applications, control software, visualizations, scada, Linux embedded application, Windows forms, Windows .Net with WPF/WCF and Silverlight, integration technology platforms for CRM and ERP, Kernel mode drivers, filesystem drivers, graphic engines, CAD.

Q: What programming languages are used for your application development?

Q: One a scale from 1 to 5 how do you rate your level of awareness on concurrency issues?

Q: On a scale from 1 to 5 how do you rate your activity in getting informed and acquiring competence on concurrency (trends, tools, etc.)?

Q: What do you expect to happen to developers that won’t learn to program concurrently?

Q: On a scale from 1 to 5, how would you grade the level of parallization of your application?

Q: Does your application performance scale proportionally with the number of cores?

Q: What libraries and tools are used in your application for concurrency?
A: System.Threading, The built-in threading classes, OS support (CRT, Win32 API), VS2010, .Net Framework – Thread, ThreadPool, Task, Monitor, lock() and many own written libraries (synchronized objects, asynchron calls and functions), MFC framework, .NET framework, Qt framework, Java framework, My own using RAII, nothing special: self made priority queues and thread pools, Win32 API’s, Windows threads.

Q: Give examples of routines in your application that are executed in worker threads.
A: loading of grids which care displayed at startup in the main application window, logging, tracing, loading data (in some cases), long running tasks (computations, report generation also in some cases), HTTP lenghtly posts, Outlook inbox scan for text, a job/task manager for things that are triggered by web UI but are not required to output any result to the client, data processing, database interactions, database data loading/uploading, looped data acquisition presented in chart monitors, sattelite data scaning and sectioned for DVB-S systems, any operations where I need to start/stop, all long running operations, application request processing, connection management, I/O, Open large files, communication, file uploads, update checking.

Q: Give examples of routines in your application that execute in the main (UI) thread and can be refactored to execute in worker threads.
A: loading all application modules, SQL queries, draw rows of panels – maybe even threading each individual block per row, all data processing which take more than a few seconds and which are not dependent or influence other events on UI, I’ve been multi-threaded programming for so long, I usually design the app for multi-theading up front, long (minutes) calculations, user triggered opening of multiple files, opening of a single file, certain post-processing of data, database stuff.

Q: Give examples of routines in your application that are parallelized.
A: logging and Tracing with multiple outputs (File System, Debug Window, Activity Logs written to database), UI, background processes, sending emails, invokes an event multithreaded in an own ThreadPool, all data processing in an application running in a sattelite set top box, zapit-thread, epg-thread, timer thread, other threads controlling whichever driver involved: front panel display driver, sound driver etc., remote connection handling, data processing, 3d transformations.

Q: Give examples of routines in your application that could be parallelized to improve performance.
A: cost computation and distribution for productions chains, the rest, all timers routines should turn to threads, all routines that take more than a few seconds and locks the UI, send-request-and-wait-response logic (actually I was thinking of SEDA kind of processing), importing models from project files.

, , , Hits for this post: 30636 .

I think application development faces two challenges nowadays: 64-bit and multi-core/many-core hardware. Switching from 32 to 64 bit is just another step in the evolution of processors. There was a time when we switched from 8 to 16, and then from 16 to 32. There are problems that arise every time, but probably in 10-20 years we will have 128 bit platforms. On the other hand, multi-core/many-core is a different shift not only in development but also thinking. We either run one core or multiple core processors. My working stations has 8 cores (4 physical and 4 virtualized) and so does my laptop. Still, I don’t see an 8 times improvement of the applications running on these machines; yet I know 8 times I not what I should expect. But not even 4 times. For instance, the time for building from scratch the application I’ve working on with Visual Studio 2008 dropped from 20 minutes to 8 minutes; that’s a 2.5 improvement. I know that having N cores doesn’t mean that applications can run N times faster, because not everything can run in several threads, and then we have problems with resource access, synchronization and others. All these prevent applications run N times faster. But the problem is that we are not thinking in parallel. We are still used to program in a single thread; and when I say that I mean not only most developers are not used to do parallelization for boosting performance of some routines, but also many operations are run in the main (UI) thread, making that GUI freeze.

Therefore I would like to get some feedback from people working in applications development to get an idea about awareness, issues, solutions regarding concurrency. Please take several minutes to answer the questions in this survey.

I’d like to quote James Reinders, lead evangelist and Director of Marketing and Business Development for Intel Software Development Products, who said that:

I am still confident that software development in 2016 will not be kind to programmers who have not learned to “Think Parallel.”

The “not parallel” era we are now exiting will appear to be a very primitive time in the history of computers when people look back in a hundred years. The world works in parallel, and it is time for computer programs to do the same.

Doing one thing at a time is “so yesterday.”

The questionnaire bellow is also available here.

Thank you for answering the questionnaire.

, , , Hits for this post: 31444 .

On of the most important challenges nowadays in programming is concurrency. If we don’t learn to write programs that are able to run on multiple cores the progress in hardware will be pointless. But when you run multiple threads for various processing you might face the situation when you have to write over and over again same or similar code for creating the threads, setting up the parameters for the threads, joining the threads, checking the result, cleaning-up, etc.

In this post I will show how you can create some helpers in C++ to simplify this process. This is not going to be a full solution, neither a solution that fits all needs, but can be a start.

What I would like to have is a helper class that will take care of:

  • finding how many threads can run (considering each available core can run a thread)
  • creating and starting the threads
  • joining the threads
  • checking the result of the threads execution
  • cleaning up

The class show bellow does just that.

This helper class contains:

  • one parameter-less constructor that identifies the number of available processors and sets the threads count equal to the processors count
  • one constructor that takes the number of threads that should be created
  • one method (SetThreadParams) for setting the parameters for each thread that will be created
  • one method (Run) that creates and runs the thread, waits for them and checks the result of the execution

As you can see the Run() method is simplistic. It does not handle for instance timed out or abandoned thread executions. Also it joins all threads, waiting until all of them finished execution. A more flexible method could wait only until the first thread finishes and then maybe closes the other threads. But as I said, this is a sample and not a complete solution.

Having this helper set up, I will start several threads to find the prime numbers in a sequence and print them in the console.

The following function computes whether a number is prime/

The thread procedure will run through a sub-sequence of a vector of integers and verify if each element is prime. I will use the following structure to pass the sequence bounds to the thread procedure:

The thread procedure could look like this:

To print to the console a locking mechanism is necessary, otherwise prints from two different threads could collide. The critical section will be initialized before the threads are started.

What remains to be done is generating a sequence of integers, setting up the parameters with the sequence bounds for each thread and run the threads using the helper.

Having this threads helper class, what I need to do when running some processing in several threads is:

  • setup thread parameters (if the case)
  • write the thread procedure
  • create a ThreadHelper object and initialize it
  • run the threads and collect the results

The helper class prevents writing same code over and over again and helps focusing on the most important tasks: writing the thread procedure. As I said earlier it is not a full solution, nor one that fits all scenarios, but you can develop it to suit your needs.

Hits for this post: 37442 .

Microsoft has made available a first beta version of an experimental version of .NET 4.0, called .NET Framework 4.0 Beta 1 Enabled for Software Transactional Memory v1.0. Since that is quite a long name, the short one is STM.NET. This is a special version of .NET 4.0 that enables software transactional memory for C#. It allows programmers to demarcate regions of code as operating in an atomic, isolated transaction from other code running concurrently. The means to do this is a delegate called Atomic.Do, or try-catch blocks. Might be that in the future an ‘atomic’ block will be added to the language(s).

This first version of the framework, also comes with additional tools:

  • tooling (debugging, ETW tracing)
  • lock interoperability
  • interoperability with traditional transactions
  • annotations (how methods run in transactions, suppressed transactions on methods, etc.)
  • static and dynamic checking of annotations

On the other hand there are some limitations:

  • only works for C# for now
  • cannot be installed on a machine with VS 2010, nor the opposite
  • there is only a 32-bit version

More information about it can be found at the STM team blog or MSDN DevLabs.

, , , Hits for this post: 49082 .

Last week Microsoft published on DevLabs a .NET language for building parallel applications, called Axum, and earlier known as Maestro. This new language is build on the architecture of the web, on the principles of isolation, message-passing, fault-tolerance, loose-coupling. It is said to have a more succinct syntax than Erlang, and have the isolation advantage over MPI, CCR and Asynchronous Agents.

Isolation is key in this architecture and is achieved with:

  • domains, that limits the runtime scope of data to its compile-time scope (objects don’t escape domains)
  • agents, active components that provide access to domains, and live in a thread of their own, different that the callers; their methods are not accessible outside;
  • channels, are the mean to communicate with agents; they are established by the runtime, when agents are created. The most important parts of the channels are the ports (input or output), that can be viewed as queues in which data is placed.

Here are more readings about these topics here:

You can find more about Axum at:

, , , , Hits for this post: 36153 .

I recently found an interesting problem on the web, about reducing words, letter by letter until only one letter remains. Here is a formal definition:

We define word reduction as removing a single letter from a word while leaving the remaining letters in their original order so that the resulting sequence of characters is also a word. A good word can be reduced step-by-step until all that is left is a or i. Your program will answer the question: what is the biggest word in the given dictionary that can be reduced?

A typical example, not very long is this:

So I though this could be a good exercise for F#. Several good dictionaries, both small and big, can be found here.

My approach was to read the dictionary and build a list for each word length: one for 1-letter words, one for 2-letter words, etc. This could be a Dictionary<int, list>, and let’s call it simply dictionary. Then these lists could be traversed and create a second set of lists (let’s call this reducedwords), but only with the words that meet the defined reduction.

An good approach would be to take each list of words from the initial dictionary, starting with the list with words of 2 letters, and for each word to delete one letter at a time. Then check to see if the resulting word already exists in the list from reducedwords corresponding to a length smaller with 1. If it’s there, that this word could be reduced and should be added to the list from reducedwords corresponding to the current length. In other words, the algorithm would be:

Here is a function for reading a dictionary file:

One I have the dictionary read, I can apply the algorithm and generate a second Dictionary structure. The following function also returns length of the longest word(s) in the reduced dictionary. This is useful for printing.

Since the problem is about printing only the longest such reductions, I’ll only consider starting from the list of reduced words that has the longest words. That’s why I needed findReducedWords to return the length of longest reducible word.

To print these paths I apply the same algorithm as before. The only difference is that I build a list with the words in the reducing path, starting with the longest word and ending with a 1-letter word.

The only thing left to do is calling all these functions:

My results for this dictionary were:

Hits for this post: 22645 .