In my last post I was writing about parallelizing loops with Parallel.For in C#. Today I though it would be nice to try that in F#. So, here is the benchmarking of the matrix multiplication and the bubblesort algorithm in F4.

Matrices Multiplication

I started with a create_matrix function that creates and randomly initializes a matrix of doubles.

let create_matrix rows columns =
   let rnd = System.Random()
   Array2.init rows columns (fun i j -> rnd.NextDouble())

Sequential multiplication could look like this:

let multiply_sequential (m1:float[,]) (m2:float[,]) =
   let rows1 = Array2.length1 m1
   let cols1 = Array2.length2 m1
   let rows2 = Array2.length1 m2
   let cols2 = Array2.length2 m2
   let result = Array2.create rows1 cols2 0.0

   if(cols1 <> rows2) then
      failwith "Matrices size incorrect for multiplication!"

   for i = 0 to rows1-1 do
      for j = 0 to cols2-1 do
         for k = 0 to cols1-1 do
            result.[i,j] <- result.[i,j] + m1.[i,k] * m2.[k,j]
         done
      done
   done
   result

Parallelizing it only implies replacing the outer loop with Parallel.For.

let multiply_parallel (m1:float[,]) (m2:float[,]) =
   let rows1 = Array2.length1 m1
   let cols1 = Array2.length2 m1
   let rows2 = Array2.length1 m2
   let cols2 = Array2.length2 m2
   let result = Array2.create rows1 cols2 0.0

   if(cols1 <> rows2) then
      failwith "Matrices size incorrect for multiplication!"

   Parallel.For(0, rows1, (fun i->
      for j = 0 to cols2-1 do
         for k = 0 to cols1-1 do
            result.[i,j] <- result.[i,j] + m1.[i,k] * m2.[k,j]))
   result

We can test those function and get the same output as I had in my previous post with this:

let main() =
   let step = 100
   let size = ref step
   while (!size <= step*10) do
      let m1 = create_matrix !size !size
      let m2 = create_matrix !size !size
      printfn "Matrices size: %dx%d" !size !size

      printf "Sequential...\t"
      let starts = DateTime.Now
      let ms = multiply_sequential m1 m2
      printfn "%a" output_any (DateTime.Now - starts)

      printf "Parallel...\t"
      let startp = DateTime.Now
      let ms = multiply_parallel m1 m2
      printfn "%a" output_any (DateTime.Now - startp)

      size := !size + step
   done   

main()

Before running, you have to make sure you add System.Threading.dll to the referred assemblies. And since this one depends on System.Core.dll, you also have to add this one. In you are using Visual Studio and a F# project, you can add the two references from the project properties.

-r c:\Windows\assembly\GAC_MSIL\System.Core\3.5.0.0__b77a5c561934e089\System.Core.dll -r "C:\Program Files\Microsoft Parallel Extensions Dec07 CTP\System.Threading.dll"

The results are shown below:

Matrices size: 100x100
Sequential...   00:00:00.1250000
Parallel...     00:00:00.1250000
Matrices size: 200x200
Sequential...   00:00:00.9218750
Parallel...     00:00:00.6093750
Matrices size: 300x300
Sequential...   00:00:03.1093750
Parallel...     00:00:01.9375000
Matrices size: 400x400
Sequential...   00:00:07.5000000
Parallel...     00:00:04.7343750
Matrices size: 500x500
Sequential...   00:00:15.1562500
Parallel...     00:00:09.3125000
Matrices size: 600x600
Sequential...   00:00:25.7031250
Parallel...     00:00:16.5312500
Matrices size: 700x700
Sequential...   00:00:41.9843750
Parallel...     00:00:26.4375000
Matrices size: 800x800
Sequential...   00:01:03.5781250
Parallel...     00:00:39.8281250
Matrices size: 900x900
Sequential...   00:01:32.1093750
Parallel...     00:00:57.3125000
Matrices size: 1000x1000
Sequential...   00:02:07.0468750
Parallel...     00:01:18.9687500

Array Sorting

First, I created a function, create_array, that creates and randomly initializes an array of doubles.

let create_array size =
    let rnd = new Random()
    let arr = Array.create size 0.0
    for i = 0 to arr.Length-1 do
        arr.(i) <- rnd.NextDouble()
    arr

The sequential bubblesort implementation is quite straight forward, of course.

let bubblesort_seq (arr : double array) =
    for i = 0 to arr.Length-1 do
        for j = 0 to arr.Length-1 do
            if (arr.(i).CompareTo(arr.(j)) < 0) then
                let temp = arr.(j)
                arr.(j) <- arr.(i)
                arr.(i) <- temp
    arr

Parallelizing it, again, only means replacing the outer for loop with Parallel.For.

let bubblesort_parallel (arr : double array) =
    Parallel.For(0, arr.Length, (fun i ->
            for j = 0 to arr.Length-1 do
                if (arr.(i).CompareTo(arr.(j)) < 0) then
                    let temp = arr.(j)
                    arr.(j) <- arr.(i)
                    arr.(i) <- temp))

    arr

And this is how the two functions were used:

let main()=
    let step = 5000
    let size = ref step
    while (!size <= step*10) do
        let arr = create_array !size
        printfn "Array size: %d" arr.Length

        printf "Sequential...\t"
        let starts = DateTime.Now
        let arrs = bubblesort_seq arr
        printfn "%a" output_any (DateTime.Now - starts)

        printf "Parallel...\t"
        let startp = DateTime.Now
        let arrp = bubblesort_parallel arr
        printfn "%a" output_any (DateTime.Now - startp)

        size := !size + step
    done

main()

The output for the program is:

Array size: 5000
Sequential...   00:00:00.2343750
Parallel...     00:00:00.1562500
Array size: 10000
Sequential...   00:00:00.8593750
Parallel...     00:00:00.5156250
Array size: 15000
Sequential...   00:00:01.9531250
Parallel...     00:00:01.1718750
Array size: 20000
Sequential...   00:00:03.3125000
Parallel...     00:00:02.1562500
Array size: 25000
Sequential...   00:00:05.4062500
Parallel...     00:00:03.5312500
Array size: 30000
Sequential...   00:00:07.4062500
Parallel...     00:00:05.0312500
Array size: 35000
Sequential...   00:00:10.6562500
Parallel...     00:00:06.8906250
Array size: 40000
Sequential...   00:00:13.2343750
Parallel...     00:00:08.9375000
Array size: 45000
Sequential...   00:00:17.6406250
Parallel...     00:00:11.4687500
Array size: 50000
Sequential...   00:00:20.8281250
Parallel...     00:00:14.2187500

If you compare the output with the one from C#, you'll notice that the times are smaller. It looks like F# is faster than C#. Of course it can get faster if I replace the call to CompareTo() with operator <.

if (arr.(i).CompareTo(arr.(j)) < 0) then
if (arr.(i) < arr.(j)) then

In this case the results look like this:

Array size: 5000
Sequential...   00:00:00.1093750
Parallel...     00:00:00.1093750
Array size: 10000
Sequential...   00:00:00.4843750
Parallel...     00:00:00.2343750
Array size: 15000
Sequential...   00:00:01.1093750
Parallel...     00:00:00.4687500
Array size: 20000
Sequential...   00:00:01.9062500
Parallel...     00:00:00.8437500
Array size: 25000
Sequential...   00:00:03.0156250
Parallel...     00:00:01.2500000
Array size: 30000
Sequential...   00:00:04.3437500
Parallel...     00:00:01.8906250
Array size: 35000
Sequential...   00:00:05.9062500
Parallel...     00:00:02.4375000
Array size: 40000
Sequential...   00:00:07.7656250
Parallel...     00:00:03.3593750
Array size: 45000
Sequential...   00:00:09.8281250
Parallel...     00:00:04.0312500
Array size: 50000
Sequential...   00:00:12.2031250
Parallel...     00:00:05.2343750
Hits for this post: 11692 .

After the MVP Summit in Seattle, I started to dig into the Parallel FX framework (currently under a CTP available here). In just a few words, the framework is composed of:

  • Task Parallel Library (TPL), that provides means to manage task (i.e. units of execution), and exposes a
  • TPL API, represented by the static function in the Parallel class: For, ForEach and Do
  • Parallel LINQ (PLINQ): allows parallelization of LINQ queries (AsParallel)

Since I have a 2 codes CPU (AMP Athlon 64 X2 Dual) I wanted to see how much performance do I actually gain by parallelizing loops. So I benchmarked matrices multiplication and arrays sorting.

Matrices Multiplication

Having this Matrix class:

class Matrix< T >
{
    private T[,] _values;

    public int Rows { get; private set; }
    public int Columns { get; private set; }

    public T this[int row, int column]
    {
        get { return _values[row, column]; }
        set { _values[row, column] = value; }
    }

    public Matrix(int rows, int columns)
    {
        if (rows < 1) throw new ArgumentOutOfRangeException("rows");
        if (columns < 1) throw new ArgumentOutOfRangeException("columns");
        Rows = rows;
        Columns = columns;
        _values = new T[rows, columns];
    }
}

I could write the multiplication routine like this:

public static Matrix< double > MultiplySequential(Matrix< double> m1, Matrix< double > m2)
{
    if (m1.Columns != m2.Rows)
         throw new ArgumentException("incorrect matrix size", "m2");

    Matrix< double > result = new Matrix< double >(m1.Rows, m2.Columns);
    for (int i = 0; i < m1.Rows; i++)
    {
        for (int j = 0; j < m2.Columns; j++)
        {
            result[i, j] = 0;
            for (int k = 0; k < m1.Columns; k++)
            {
                result[i, j] += m1[i, k] * m2[k, j];
            }
         }
    }
    return result;
}

Modifying the function to use Parallel.For was pretty simple, since this is very straight forward:

public static Matrix< double > MultiplyParallel(Matrix< double> m1, Matrix< double > m2)
{
    if (m1.Columns != m2.Rows)
        throw new ArgumentException("incorrect matrix size", "m2");

    Matrix< double > result = new Matrix< double >(m1.Rows, m2.Columns);
    Parallel.For(
        0,
        m1.Rows,
        i =>
        {
            for (int j = 0; j < m2.Columns; j++)
            {
                result[i, j] = 0;
                for (int k = 0; k < m1.Columns; k++)
                {
                    result[i, j] += m1[i, k] * m2[k, j];
                }
            }
        });

    return result;
}

I used matrices of various sizes, started from 100x100 to 1000x1000 and compared the execution time for the sequential version or the parallel version.

for(int SIZE = 100; SIZE <= 1000; SIZE += 100)
{
    Console.WriteLine("Matrices size: {0}x{0}", SIZE);

    Matrix< double > m1 = MatrixOperations.GenerateRandom(SIZE, SIZE);
    Matrix< double > m2 = MatrixOperations.GenerateRandom(SIZE, SIZE);

    Console.Write("Sequential...\t");
    DateTime starts = DateTime.Now;
    Matrix< double > ms = MatrixOperations.MultiplySequential(m1, m2);
    Console.WriteLine("{0}", DateTime.Now - starts);

    Console.Write("Parallel...\t");
    DateTime startp = DateTime.Now;
    Matrix< double > mp = MatrixOperations.MultiplyParallel(m1, m2);
    Console.WriteLine("{0}", DateTime.Now - startp);
}

The results were:

Matrices size: 100x100
Sequential...   00:00:00.0625000
Parallel...     00:00:00.0937500
Matrices size: 200x200
Sequential...   00:00:00.4531250
Parallel...     00:00:00.2343750
Matrices size: 300x300
Sequential...   00:00:01.4843750
Parallel...     00:00:00.7500000
Matrices size: 400x400
Sequential...   00:00:03.7812500
Parallel...     00:00:02.0625000
Matrices size: 500x500
Sequential...   00:00:07.8750000
Parallel...     00:00:04.1406250
Matrices size: 600x600
Sequential...   00:00:14.3437500
Parallel...     00:00:07.2656250
Matrices size: 700x700
Sequential...   00:00:23.4531250
Parallel...     00:00:11.7812500
Matrices size: 800x800
Sequential...   00:00:36.1250000
Parallel...     00:00:18.4218750
Matrices size: 900x900
Sequential...   00:00:51.5312500
Parallel...     00:00:25.8906250
Matrices size: 1000x1000
Sequential...   00:01:11.5156250
Parallel...     00:00:35.6875000

They show that the time for executing the parallel version was only half the time for the sequential multiplication (remember that I have 2 cores). That means reducing the execution time, not 20 or 30%, but 50%, basically the maximum possible.

Array Sorting

For sorting the arrays I tested with the wost sorting algorithm: bubblesort. Here is the implementation:

interface ISort< T >
{
    void SortSequential(T[] array);
    void SortParallel(T[] array);
}

class BubbleSort< T > : ISort< T >
    where T: IComparable
{
    public void SortSequential(T[] array)
    {
        for(int i = 0; i < array.Length; ++i)
        {
            for(int j = 0; j < array.Length; ++j)
            {
                if(array[i].CompareTo(array[j]) < 0)
                {
                    T aux = array[i];
                    array[i] = array[j];
                    array[j] = aux;
                }
            }
        }
    }

    public void SortParallel(T[] array)
    {
        Parallel.For(
            0,
            array.Length,
            i =>
            {
                for (int j = 0; j < array.Length; ++j)
                {
                    if (array[i].CompareTo(array[j]) < 0)
                    {
                        T aux = array[i];
                        array[i] = array[j];
                        array[j] = aux;
                    }
                }
            });
    }
}

First, I tested that with small array, 100 to 1000 elements.

int STEP = 100;
for (int SIZE = STEP; SIZE <= STEP*20; SIZE += STEP)
{
    double[] array = GenerateRandomDoubles(SIZE);

    ISort< double > sorter = new BubbleSort< double >();

    Console.WriteLine("Array size: {0}", SIZE);

    Console.Write("Sequential...\t");
    DateTime starts = DateTime.Now;
    sorter.SortSequential(array);
    Console.WriteLine("{0}", DateTime.Now - starts);

    Console.Write("Parallel...\t");
    DateTime startp = DateTime.Now;
    sorter.SortParallel(array);
    Console.WriteLine("{0}", DateTime.Now - startp);
}

And the results were:

Array size: 100
Sequential...   00:00:00
Parallel...     00:00:00.0781250
Array size: 200
Sequential...   00:00:00
Parallel...     00:00:00
Array size: 300
Sequential...   00:00:00
Parallel...     00:00:00
Array size: 400
Sequential...   00:00:00
Parallel...     00:00:00
Array size: 500
Sequential...   00:00:00.0156250
Parallel...     00:00:00
Array size: 600
Sequential...   00:00:00
Parallel...     00:00:00.0156250
Array size: 700
Sequential...   00:00:00
Parallel...     00:00:00.0156250
Array size: 800
Sequential...   00:00:00.0156250
Parallel...     00:00:00
Array size: 900
Sequential...   00:00:00.0156250
Parallel...     00:00:00.0156250
Array size: 1000
Sequential...   00:00:00.0312500
Parallel...     00:00:00.0156250
Array size: 1100
Sequential...   00:00:00.0156250
Parallel...     00:00:00.0156250
Array size: 1200
Sequential...   00:00:00.0312500
Parallel...     00:00:00.0312500
Array size: 1300
Sequential...   00:00:00.0312500
Parallel...     00:00:00.0156250
Array size: 1400
Sequential...   00:00:00.0468750
Parallel...     00:00:00.0312500
Array size: 1500
Sequential...   00:00:00.0468750
Parallel...     00:00:00.0312500
Array size: 1600
Sequential...   00:00:00.0468750
Parallel...     00:00:00.0312500
Array size: 1700
Sequential...   00:00:00.0625000
Parallel...     00:00:00.0468750
Array size: 1800
Sequential...   00:00:00.0625000
Parallel...     00:00:00.0468750
Array size: 1900
Sequential...   00:00:00.0781250
Parallel...     00:00:00.0468750
Array size: 2000
Sequential...   00:00:00.0937500
Parallel...     00:00:00.0468750

Then, I changed the for loop, so that it creates array from 5000 to 50000 elements. The results were:

Array size: 5000
Sequential...   00:00:00.5312500
Parallel...     00:00:00.4062500
Array size: 10000
Sequential...   00:00:02.0468750
Parallel...     00:00:01.2500000
Array size: 15000
Sequential...   00:00:04.5781250
Parallel...     00:00:02.7187500
Array size: 20000
Sequential...   00:00:08.1562500
Parallel...     00:00:04.7187500
Array size: 25000
Sequential...   00:00:12.7343750
Parallel...     00:00:07.5625000
Array size: 30000
Sequential...   00:00:18.5468750
Parallel...     00:00:10.6406250
Array size: 35000
Sequential...   00:00:25.4687500
Parallel...     00:00:14.5468750
Array size: 40000
Sequential...   00:00:33.6562500
Parallel...     00:00:18.9531250
Array size: 45000
Sequential...   00:00:42.7968750
Parallel...     00:00:23.8906250
Array size: 50000
Sequential...   00:00:52.9218750
Parallel...     00:00:29.4687500

The conclusion is that the parallel sorting is not efficient when the array size is less than 1000 elements; in other words there is no performance gain, but perhaps performance lost due to overhead of creating tasks and executing them and managing the context switching. The same conclusion is expressed in the framework's documentation:

Target areas of your program where algorithms are computationally expensive and/or data sets are large (e.g. > 1000 elements). In such cases, there will likely be benefit to using parallelism.

That means, that when you parallelize an algorithm, you should take into consideration the size of the data structures it performs on, and unless a given threshold is not reached, a non-parallelized version should be used.

Hits for this post: 12993 .

A new installation kit for the Visual C++ 2008 Feature Pack has been released. It’s recommended to uninstall the original pack (version 9.0.30304.0) and install the new one. The new feature pack is available at the Download Center. Before you install, make sure you read the note from the System Requirements paragraph; there is an important notice there related to the order of installation with Windows SDK 6.1.

Comments on the feature pack refresh have been made on the Visual C++ blog.

Hits for this post: 5155 .

The MVP Global Summit ended today. It has been a great week here at Seattle and Redmond. There were 1753 MVPs present at the summit and 1000 Microsoft employees from the product developing teams were involved in the event, delivering presentations, talking about the things they do for the next versions of their products and, most important, taking a lot of feedback from the MVPs. Personally, I spent two days with the Visual C++ team, and the feeling among the MVPs there was great. In the past there has been some frustration among us about the quality of VC++, but now we have a good feeling that things will change and VC++ will retake its place and lead, at least to some extent, the innovations deliver by Microsoft.

Here are some pictures from the summit.

MVP Global Summit

The flags at Microsoft Convention Center…

Flags at MSCC

Welcoming the MVPs…

Welcoming the MVPs

Group picture…

Group picture

CodeGuru MVPs together with Korean MVPs…

CodeGuru and Korean MVPs

Canadian group picture…

Canadian group picture

The battle between Newsgroups and Forums…

Newsgroups vs. Forums

CodeGuru group picture…

CodeGuru group picture

That’s it for this year summit. Looking forward for the next one.

Hits for this post: 17102 .

Today starts the MVP Global Summit in Seattle and at Microsoft’s campus in Redmond. I have arrived yesterday in Seattle, one day later than planned because I lost my connection flight in New York. At least after spending the second part of Saturday waiting in lines (one hour at the Customs, two hours waiting at Delta’s counters for a rebook, 45 minutes at the hotel for checking-in, and at least 30 minutes at the restaurant to get a table to eat) I traveled to Seattle at business class. I already met some friends here, from Romania and Germany and I’m looking forward to meet the others and take part at the sessions planned for this summit.

Here is a picture taking from my room at the North Tower of Westin Hotel.

Seattle at night from the North Tower of Westin Hotel

, , Hits for this post: 23535 .

I had to do some output formatting in C++ for showing the content of a buffer. Take for instance this buffer:

unsigned char buffer [] =
{
    0x00, 0x01, 0x02, 0x03,
    0x04, 0x05, 0x06, 0x07,
    0x08, 0x09, 0x0A, 0x0B,
    0x0C, 0x0D, 0x0E, 0x0F
};

I wanted the output to be like this:

0x00, 0x01, 0x02, 0x03,
0x04, 0x05, 0x06, 0x07,
0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f,

The simples way to do it is like this:

for(int index = 0; index < sizeof(buffer); ++index)
{
   std::cout << "0x" << std::hex << std::setw(2) << std::setfill('0')
      << (int)buffer[index] << std::dec << ", ";
   if((index+1) % 4 == 0)
      std::cout << std::endl;
}

That for loop achieves the goal. But then, I though, why not using std::copy? Well, the first thing you can come up with is this:

std::copy(
    &buffer[0],
    &buffer[sizeof(buffer)],
    std::ostream_iterator< unsigned char >(std::cout, " ")
);

But that can only produce this output:

  ? ? ? ? ? ?
 ? ?

That happens because we used unsigned char instead of int for the ostream iterator. You can correct that like this:

std::copy(
    &buffer[0],
    &buffer[sizeof(buffer)],
    std::ostream_iterator< int >(std::cout, " ")
);

and get the output like this:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

But this is still far from what I want. The best approach to solve the problem is to use a class that encapsulates a character (or any other type that we want to print), provide a printing function for it and overload the operator<< to output instances of this class. This is shown bellow:

template < class T >
class Printer
{
   T value;
public:
   Printer(T val): value(val)
   {
   }

   void Print(std::ostream& os) const
   {
      os << "0x" << std::hex << std::setw(2) << std::setfill('0')
         << (int)value << std::dec << ", ";
   }
};

template < class T >
std::ostream& operator<<(std::ostream& os, Printer< T > const& elem)
{
   elem.Print(os);
   return os;
}

We can now transform the copy call to this:

std::copy(
    &buffer[0],
    &buffer[sizeof(buffer)],
    std::ostream_iterator< Printer < unsigned char > >(std::cout)
);

and the output is:

0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F

That is much better, but not yet perfect. To achieve the final goal, we can transform the Printer class to keep a count of the printer elements:

template < class T >
class Printer
{
   static int index;
   T value;
public:
   Printer(T val): value(val)
   {
   }

   void Print(std::ostream& os) const
   {
      os << "0x" << std::hex << std::setw(2) << std::setfill('0')
         << (int)value << std::dec << ", ";
      if((index+1) % 4 == 0)
         os << std::endl;
      index++;
   }
};
template < class T >
int Printer< T >::index = 0;

template < class T >
std::ostream& operator<<(std::ostream& os, Printer< T > const& elem)
{
   elem.Print(os);
   return os;
}

The copy call remains the same, and the output is:

0x00, 0x01, 0x02, 0x03,
0x04, 0x05, 0x06, 0x07,
0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f,

Of course, if you use this several times, you have to reset the static index from the Printer class. I'll leave that to you.

Hits for this post: 7415 .

Here is a list of, what I consider, good F# resources. Hopefully you’ll find them helpful.

Official Documentation

Forums and Wikis

  • F# wikicontains articles, tips and sample code
  • hubFSa very good forum focused on F#

Blogs

Hits for this post: 8267 .

Yesterday Microsoft released the Visual C++ 2008 Feature Pack, formally known as MFC Feature Pack beta. The pack can be downloaded from Microsoft’s Download Center, is available only in English and requires Visual Studio 2008 Standard Edition or above. Installation on systems with Visual Studio 2008 Service Pack 1 BETA is not supported.

Hits for this post: 13256 .

As it became a tradition, Google made a few April Fool’s Day jokes. Here is a list of the best ones:

Microsoft did a very good joke also. Channel9 featured an interview with Brian Beckman on “Project Quark”, about a quantum computer developed at Microsoft.

http://channel9.msdn.com/Showpost.aspx?postid=394477

Hits for this post: 7275 .