Asaf Shelly is a highly experienced parallel programmer, blogs on Intel.com and has been recognised in the online programming community with an Intel Black Belt. Appropriately enough, he’s working on three projects in parallel at the moment: supporting a USB device side engine for a medical company, writing a Windows NDIS network driver for a start-up company, and converting communication and management libraries from Borland C++ to C#.Net for Israeli Aerospace Industries.
We interviewed him by email to find out more about the parallel programming challenges he’s faced, and about what’s new in the second generation Core i series processors, formerly known by the code name Sandy Bridge.
This is the first part in a two-part interview. Come back again tomorrow, when we’ll discuss programming models of the future, which languages Asaf is most interested in, and the most important thing Asaf tells his parallel programming students. If you want to meet Asaf, come along to the free networking event for programmers in London next month.
Softtalkblog: Give me some examples of your parallel programming projects…
I am now working on supporting a USB device side engine for a medical company. The PC side was the main reason for producing a .Net component wrapper for Microsoft’s WinUSB driver, which is available with full source code here. If you take a closer look at the .Net WinUSB component you will see uses of ‘Async’ and ‘Invoke’ which mean that a thread is posting a request to another thread. This is because you might have a reader thread querying for data, or you may have a thread pool (user work object). Read operations come from the application but read-completion events come from the driver. All these must be synchronized correctly. In this case the component uses an event callback (delegate) to signal that a new buffer is ready for use or data was successfully transmitted. Regardless of read and write, the USB device may be attached or detached at any point in time. This is another event which needs to be handled by the user code.
It was relatively simple for me to understand parallel systems because the Windows NT kernel is fully parallel and enforces a parallel software design upon drivers. I covered this a while back in this blog post about the Windows NT Kernal.
Most of the work I do is parallel and once you get used to it, it is very simple to implement and understand. All too often it even simplifies system design because there are parallel design patterns which solve problems very easily but which are usually replaced by a collection of serial design patterns just because people are used to working serially and are not familiar with parallel design patterns.
At the moment, I am working nights at home. I still try to find the time to blog about parallel computing which includes some setup (code samples, tests, etc). I also try to keep up with my emails and get back to people asking me technical questions as result of a blog post or an offline presentation. Most times if you have the correct design concept then it all just fits in place.
Softtalkblog: What challenges have you had in parallel programming?
This is a very good question. At first it was me. I really tried to understand when to use a mutex over a critical-section as a lock. In time I realised that the problem is not really with me, it is with the information available. There was too much confusion in the texts. For example a critical-section is not really protecting a section, it is protecting a resource. It was a long learning process for me. Eventually I wrote a few chapters for a C++ book about this and was asked to teach a class. This was a few years back and since then I have had over 30 different classes that are between one and five days long. This was my main motivation for the establishment of the website AsyncOp.com. There were simply not enough resources online at the time.
Today I can tell you that parallel design is much simpler than object oriented design but we are used to working with the latter and most of the courseware covers objects but not tasks. Too often people ask me to produce a parallel solution for a serial design. Most times it just doesn’t work this way. Parallel computing is at the foundation of the application. You have events coming from the user, network, devices, keyboard and mouse, background computation, etc. All these events come asynchronously you cannot ignore that. Once it is clear that the top-level-design is really parallel then the riddle of design details and implementation is solved.
Softtalkblog: What have you found is the biggest help when developing parallel programs?
That’s an easy one. Thinking like people, not like machines. Go to a fast food restaurant. You have several people taking orders in parallel. This all adds up to a single queue of tasks. These are then dispatched to different workers: one dealing with the drinks, someone else gets the tomato, and someone else in on the grill. The worker on the grill is handling the grill section for all orders in the queue. Eventually the orders are buffered so that everyone at the same table gets what they ordered at the same time. Students do this successfully before they learn the definition of ‘object’. People always notice this when we go for lunch during classes after talking about parallel system design and the importance of queues. Suddenly you start identifying work patterns all around you.
Softtalkblog: What does the new Sandy Bridge processor bring for programmers?
We have seen the internal design in the Under-NDA event. It is really unique, taking two steps ahead for parallel computing. So far we have seen multicore CPUs where the connection between cores was enhancing the existing design. The Sandy Bridge technology introduces a new core-to-core interconnection technology. As far as I can see this will enhance parallel software dramatically by reducing bottlenecks which present a performance barrier today. The interesting thing is that this will happen automatically without any software redesign, just as it was 10 years ago with CPU clock speed upgrades.
Softtalkblog: What benefits does AVX bring?
It might be surprising to hear that Intel 386 processors supported the base prototype of what has now matured to be AVX. Using AVX it is possible to reduce the number of cores in use. Multicore programming means that we multiply CPU power by utilizing more cores. This way by using four cores we can potentially reduce processing time to a quarter of the processing time we could have for the same task using only a single core. In order to use four cores we need to use four threads and synchronize them.
AVX can achieve the same results using a single core with no synchronization overhead required. In fact AVX has the ability to replace 16 cores in some cases. AVX technology means that a single CPU core is used to operate on several data variables at the same time so instead of using four cores each operating on a single variable we use a single core operating on four variables at the same time. The limitation is that total size of operands is up to 256 bit. The data operations can handle basic types such as byte, int, and quad-word, and precision floating point variables.
AVX also has mathematical and logical acceleration with features such as CPU hardware acceleration for square root in a single CPU instruction, 256 bit floating point operations, encryption hardware acceleration, average, maximum, minimum, multiply-accumulate, and conditional assignment, all operating on multiple variables at the same time.
By using AVX we can use a single CPU core to do the work of up to 16 CPU cores and with no need for synchronization management. Microsoft Visual Studio 2010 already has support for AVX instructions and you can see a performance demo comparing serial application with SSE (the older version of AVX) reducing computation time from 57 seconds to 2 seconds on my blog post called Visual Studio 2010 Built-In CPU Acceleration.
Asaf, thank you. We’ll pick up the discusssion again tomorrow!