Archive for the ‘Concurrency and Parallelism’ Category

Measuring traffic on the PCI Express Bus (PCIe)

Friday, April 24th, 2015

During my talk at the parallel 2015 conference i was asked how one can measure traffic on the PCI express bus. For multi GPU computing it is very important to control the amount of data exchanged on the PCIe bus.

You need the  Intel Performance Counter Monitor ( Compile it and copy pcm-pcie.exe into a new directory.

Then read this helpful article of how to obtain the missing WinRing dlls and drivers. Copy them into the same directory, start the cmd.exe as an admin and there you go.

Now you can analyse the traffic on the PCI bus.

DEBUG: Setting Ctrl+C done.

Intel(r) Performance Counter Monitor: PCIe Bandwidth Monitoring Utility

Copyright (c) 2013-2014 Intel Corporation
This utility measures PCIe bandwidth in real-time

PCIe event definitions (each event counts as a transfer):
   PCIe read events (PCI devices reading from memory – application writes to disk/network/PCIe device):
     PCIePRd   – PCIe UC read transfer (partial cache line)
     PCIeRdCur* – PCIe read current transfer (full cache line)
         On Haswell Server PCIeRdCur counts both full/partial cache lines
     RFO*      – Demand Data RFO
     CRd*      – Demand Code Read
     DRd       – Demand Data Read
     PCIeNSWr  – PCIe Non-snoop write transfer (partial cache line)
     PRd       – MMIO Read [Haswell Server only: PL verify this on IVT] (Partial Cache Line)
   PCIe write events (PCI devices writing to memory – application reads from disk/network/PCIe device):
     PCIeWiLF  – PCIe Write transfer (non-allocating) (full cache line)
     PCIeItoM  – PCIe Write transfer (allocating) (full cache line)
     PCIeNSWr  – PCIe Non-snoop write transfer (partial cache line)
     PCIeNSWrF – PCIe Non-snoop write transfer (full cache line)
     ItoM      – PCIe write full cache line
     RFO       – PCIe parial Write
     WiL       – MMIO Write (Full/Partial)

* – NOTE: Depending on the configuration of your BIOS, this tool may report '0' if the message
           has not been selected.

Starting MSR service failed with error 2 The system cannot find the file specified.
Trying to load winring0.dll/winring0.sys driver…
Using winring0.dll/winring0.sys driver.

Number of physical cores: 6
Number of logical cores: 12
Number of online logical cores: 12
Threads (logical cores) per physical core: 2
Num sockets: 1
Physical cores per socket: 6
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3500000000 Hz
Package thermal spec power: 140 Watt; Package minimum power: 47 Watt; Package maximum power: 0 Watt;
2 memory controllers detected with total number of 5 channels.

Detected Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz “Intel(r) microarchitecture codename Haswell-EP/EN/EX”
Update every 1 seconds
delay_ms: 84
Skt | PCIeRdCur |  RFO  |  CRd  |  DRd  |  ItoM  |  PRd  |  WiL
0    4236         576    5980 K  1536 K     48    3456    7116
*    4236         576    5980 K  1536 K     48    3456    7116

[Rezension] Neuauflage erforderlich: Gutes Buch, aber leider nicht mehr aktuell

Thursday, March 20th, 2014

Rezension des Buchs: Barbara Chapman, Gabriele Jost, Ruud Van Der Pas. Using OpenMP. MIT Press. 2008.

Dieses Buch erschien 2008 und da war die Version 2.5 von OpenMP aktuell. Damals hätte ich dem Buch vier Sterne gegeben. Einen Stern hätte ich abgezogen, weil es z. B. Fehler im Beispielcode gibt.

Aber die vergangenen 6 Jahre sind im Bereich der parallelen Programmierung eine kleine Ewigkeit. Und hier finde ich es negativ, dass es noch keine 2. Auflage des Buches gibt. Somit sind es nur noch 3 Sterne.

Mit der Version 3.0, die ebenfalls 2008 erschien wurde das „task“-Konstrukt in OpenMP eingeführt. Dieser sollte heute eigentlich der Ausgangspunkt für die Parallelisierung mit OpenMP sein. Das vorherige Pragma „for“ ist nicht effizient komponierbar, d.h. Unterfunktionen dürfen nicht selber auch parallelisieren, weil dann sehr viel mehr Threads gestartet werden, als Prozessoren vorhanden sind. Dieses führt zu oft zu schlechtem Code oder zu schlechter Auslastung.

Und spätestens die Version 4.0, die Mitte 2013 erschien, erfordert eine Neuauflage des Buches. Diese Version unterstützt SIMD-Befehle, benutzerdefinierte Reduktionen und unterstützt auch das GPU-Computing, die Auslagerung von Rechenoperationen auf Grafikkarten.

Ansonsten bildet das Buch eine gute Einführung mit ein paar guten Performancetipps, wie z. B. der Optimierung von Memory-Access-Patterns.

CUDA Real-Time Ray Tracer

Sunday, January 3rd, 2010

During the christmas holidays i rewrote my ray tracer for the NVIDIA CUDA architecture. CUDA is extremely powerful: with an NVIDA 285 i achieved more than 250 FPS for 640×480 pixels, 57 FPS for 1080×1030.

Compare this with a similiar ray tracer running on an Intel Core i7 920.

Compiling OpenCL programs on Mac OS X Snow Leopard

Monday, September 28th, 2009

I installed Snow Leopard on my laptop yesterday. I was very curious about OpenCL and installed the drivers and the GPU Computing SDK from NVIDIA.

I searched my hard disk after installation and found the following directory: /Developer/GPU Computing/OpenCL. Looks promising.

In the subdirectory src/oclDeviceQuery I found a basic test and I tried to compile it.

$ cd src/oclDeviceQuery
$ make
ld: library not found for -loclUtil
collect2: ld returned 1 exit status
make: *** [../../..//OpenCL//bin//darwin/release/oclDeviceQuery] Error 1

I googled for “-loclUtils”? I found nothing. “liboclUtils”? Nothing. So i found a brand new problem that is not known to mankind. Hurray. 😉

But i remembered a similiar situation when i used the CUDA-SDK. So i searched the other directories. The solution is to create the library manually.

$ pushd ../../common/
$ make
ar: creating archive ../..//OpenCL//common//lib/liboclUtil.a
q - obj/release/oclUtils.cpp.o
$ popd

So -loclUtil should now be found. And i tried to compile again.

$ make
ld: library not found for -lshrutil
collect2: ld returned 1 exit status
make: *** [../../..//OpenCL//bin//darwin/release/oclDeviceQuery] Error 1

Aha, there’s another library missing. I tried the one in /Developer/GPU Computing/shared.

$ cd ../../../shared/
$ make
src/rendercheckGL.cpp: In member function ‘virtual bool CheckBackBuffer::readback(GLuint, GLuint, GLuint)’:
src/rendercheckGL.cpp:523: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘GLuint’
src/rendercheckGL.cpp:527: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘GLuint’
src/rendercheckGL.cpp: In member function ‘virtual bool CheckFBO::readback(GLuint, GLuint, GLuint)’:
src/rendercheckGL.cpp:1342: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘GLuint’
src/rendercheckGL.cpp:1346: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘GLuint’
a - obj/release/shrUtils.cpp.o
a - obj/release/rendercheckGL.cpp.o
a - obj/release/cmd_arg_reader.cpp.o

Back into the directory with the device query sources.

$ cd ../OpenCL/src/oclDeviceQuery/
$ make

The compilation succeeds, but where’s the executable? It is not in the current directory.

$ ls
Makefile            obj/                oclDeviceQuery.cpp

I searched the directories again and its in a bin subfolder of /Developer/GPU Computing/OpenCL

$ ../../bin/darwin/release/oclDeviceQuery

oclDeviceQuery.exe Starting...

OpenCL SW Info:

 CL_PLATFORM_VERSION: 	OpenCL 1.0 (Jul 15 2009 23:07:32)
 OpenCL SDK Version:

That’s it. OpenCL runs on my laptop. Yeah. :-)

Hyper-Threading with the Intel Core i7

Sunday, June 14th, 2009

I have got a new computer. As alway i build it myself. See the following photos and note the impressive size of the cpu cooler by Noctua).


I chose the Intel Core i7, because i was very curious about it’s technical features. It has four “real” physical cores, but provides eight “virtual” cores with hyper-threading. These “virtual” cores are shown by the operating systems in their task/process managers. See the following screenshots for Windows and Linux.

    8 cores on Linux

The question i asked myself is: How do these virtual cores perform ? How many programms can i run in parallel without hurting performance ? What is the speedup ? Is it 4 ? Is it 8 ?

So I made a test. I chose a single threaded program, the ray tracer pbrt and started this program 1, 2, 3, …, 8, 9, 10 times as a process under Linux and timed the running times. Here are the results.

Number of programms Running times Speedup Explanation
1 2 3 4 5 6 7 8 9 10
1 1:18.27 1
2 1:18.57 1:18.32 1.997
3 1:18.69 1:18.76 1:19.18 2.97
4 1:19.62 1:21.88 1:20.12 1:19.68 3.83
5 1:54.01 1:54.38 1:53.47 1:19.33 1:54.90 3.41 2 cores with 2 threads each and 1 core with 1 thread
6 1:56:13 1:22.16 1:23.09 1:54.22 1:55.41 1:54.95 4.05 2 cores with 2 threads each and 2 core with 1 thread each
7 1:53.27 1:25.28 1:53.62 1:53.92 1:56.38 1:55.49 1:54.05 4.72 3 cores with 2 threads each and 1 core with 1 thread
8 1:59.50 1:57.72 1:55.16 1:54.96 1:58.60 1:57.72 1:58.46 1:59.62 5.25 4 cores with 2 threads each
9 2:08.65 2:09.34 1:59.44 2:07.06 2:00.61 2:38.73 2:02.70 2:01.40 2:10.74 4.45 4 cores with 2 threads each
10 2:04.29 2:23.16 2:44.80 2:09.42 2:45.95 2:16.97 2:14.71 2:10.60 2:15.10 2:09.96 4.73 4 cores with 2 threads each

For up to four programs the Core i7 behaves like a usual four core processor. These four programs can run in parallel with the same performance of about 80 seconds. The speedup is almost linear.

When more than four programs run, the processors has to run at least two threads on one core. Then two virtual processors have to share a single physical processors and the programs take about 114 seconds.

Conclusion: Hyper-threading gives us some extra computing power here. The best speedup of 5.25 was achieved with 8 programs.

By the way: the following image was the one rendered for the benchmark. See the gallery of pbrt for more.

Parallelization with Haskell – Easy as can be

Sunday, June 7th, 2009

The functional programming language Haskell provides a very easy way of parallelization. Consider the following naive implementation of the
Fibonacci function.

fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

This implementation has a bad expontential time complexity, so it should be improved, for example with caching. But this is beyond the scope of this article. We just need a function that takes a while to finish.

In Haskell there are two operators that have to be used for parallelization: par and pseq. par a b is some kind of a “fork” operation: a is started in parallel and b is returned. Keep in mind that Haskell is has a lazy evaluation strategy. a is only evaluated if it is needed The function pseq a b evaluates first a then b.

Equipped with this two operations it is very easy to parallelize fib.

parfib n
| n < 11 = fib n -- For small values of n we use the sequential version
| otherwise = f1 `par` (f2 `pseq` (f1+f2)) -- calculate f1 and f2 in parallel, return the sum as the result
f1 = parfib (n-1)
f2 = parfib (n-2)

The code has to be compiled with the -threaded option.

ghc -O3 -threaded --make -o parfib ParFib.hs

The number of threads is specified at runtime with the -N command line option.

./parfib +RTS -N7 -RTS

On an Intel Core i7 920 this resulted in a speedup of 4.13 for n=38. This processor has four physical cores.

So this is efficient. Haskell is still one of the best programming languages.

An excercise in parallelization with the Cell Broadband Engine

Tuesday, May 19th, 2009

The cell broadband engine is a multi-core processor. One of the cores, the so called PPE, is a general processor that can handle I/O, memory, etc. There are 6 so called SPEs that are spezialized to number crunching. All the cores are 128-bit SIMD .

So basically there are two ways to parallelize here.

  1. Run the ray tracer on the six SPEs and merge the results.
  2. Rewrite the ray tracer to process 4 rays simultaneously using the SIMD vectors.

At the point of writing i only implemented the first point. See my homepage for details. The following film shows the ray tracer in action. The ray tracer simply splits the screen into n parts and uses an SPE for each part.

Copyright © 2007-2015 Jörn Dinkla. All rights reserved.