Serial and binary search program in c using recursion
To parallelize the for loop, the openMP directive is: This directive tells the compiler to parallelize the for loop below. Whilst parallelizing the loop, it is not possible to return from within the if statement if the element is found. This is due to the fact that returning from the if will result in an invalid branch from OpenMP structured block.
Hence we will have change the implementation a bit. The above snippet will keep on scanning the the input till the end regardless of a match, it does not have any invalid branches from OpenMP block. It is as simple as this , all that had to be done was adding the comipler directive and it gets taken care of, completely. Also, the code will run in serial after the OpenMP directives have been removed, albeit with the modification.
It is noteworthy to mention that with the parallel implementation, each and every element will be checked regardless of a match, though, parallely. This is due to the fact that no thread can directly return after finding the element.
Further, if there are more than one instances of the required element present in the array, there is no guarantee that the parallel linear search will return the first match. The order of threads running and termination is non-deterministic. There is no way of which which thread will return first or last. To preserve the order of the matched results, another attribute index has to be added to the results. You can find the complete code of Parallel Linear Search here.
Selection sort is an in-place comparison sorting algorithm. Selection sort is noted for its simplicity, and it has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited. In selection sort, the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end. Initially, the sorted part is empty and the unsorted part is the entire list. This process continues moving unsorted array boundary by one element to the right.
Selection Sort has the time complexity of O n 2 , making it unsuitable for large lists. By parallelizing the implementation, we make the multiple threads split the data amongst themselves and then search for the largest element independently on their part of the list. Each thread locally stores it own smallest element. The outer loop is not parallelizable owing to the fact that there are frequent changes made to the array and that every i th iteration needs the i-1 th to be completed.
In selection sort, the parallelizable region is the inner loop, where we can spawn multiple threads to look for the maximum element in the unsorted array division. Then we can reduce each local maximum into one final maximum.
However, in the implementation, we are not looking for the maximum element, instead we are looking for the index of the maximum element. For this we need to declare a new custom reduction. The ability to describe our own custom reduction is a testament to the flexibility that OpenMP provides. The declared reduction clause receives a struct.
So, our custom maximum index reduction will look something like this:. For that, we can have a simple verify function that checks if the array is sorted. So, the parallel implementation is equivalent to the serial implementation and produces the required output.
You can find the complete code of Parallel Selection sort here. Mergesort is one of the most popular sorting techniques. It is the typical example for demonstrating the divide-and-conquer paradigm. Because of the non-uniform distribution of letters in natural languages, the payoff is not There's a double evaluation going on, so this can be counter-productive if the arguments to strcmp aren't just simple variables.
If you're working on adjacent strings in sorted input, you will almost always have to check past the first character anyway. An entirely different way to speed up strcmp is to place all your strings into a single array, in order. Then you only have to compare the pointers , not the strings. If the point of all the calls to strcmp is to search for a value from a large, known set and you expect that you'll be doing many such searches then you'll want to invest in a hash table.
But if the strings you are working with are of substantial size, strlen will dutifully check each character until it finds the NUL terminator. There can be a substantial improvement if your character strings are fairly long.
Except for the first one, which terminates the string, the extra zeroes are typically never used, so some time is wasted, though typically not very much. If you are tempted to replace these with your own versions, be sure to verify that your version a works and b is faster.
This does two things: This is not a sure thing and you should try it both ways on all the machines you plan to run on before you do this. Also, converting back-and-forth between ints and floats can take up considerable time. I have heard this is an acute problem on Apollos. On many machines, float s work faster than double s. If the bottleneck involves FP arithmetic and you don't need the extra precision, change the pertinent variables to float and see what happens. But similar to above, the cost of converting between float s and double s can outweigh the benefits if applied indiscriminately.
Try compiling your code with all the compilers you have at your disposal and use the fastest one for compiling the one or two functions which are bottlenecks. For the rest just use the one that gives the most informative error messages and produces the most reliable output.
Compilers are evolving even as we speak, so keep up with the latest revision if possible. In that case the solution is to rewrite the code so it can use a static or global array, or perhaps allocate it from the heap. A similar solution applies to functions which have large structs as locals or parameters. On machines with small stacks, a stack bound program may not just run slow, it might stop entirely when it runs out of stack space.
Recursive functions, even ones which have few and small local variables and parameters, can still affect performance. On some ancient machines there is a substantial amount of overhead associated with function calls, and turning a recursive algorithm into an iterative one can save some time. A related issue is last-call optimization. When func2 is done it returns directly to func1 's caller.
This a reduces the maximum depth of the stack and b saves the execution of some return-from-subroutine code as it will get executed only once instead of twice or more, depending on how deeply the function results are passed along.
If func1 and func2 are the same function recursion the compiler can do something else: This is called tail-recursion elimination. Of course, this is not a portable solution. RISC assembler is especially difficult to write by hand.
Experts vary on the approach one should take. Some suggest writing your own assembler version of a function from scratch in hopes that you'll find a novel way of calculating the result while others recommend taking the compiler's version as a starting point and just fine-tuning it. In many cases though, calling a dynamically linked function is slightly slower than it would be to call it statically. The principal part of this extra cost is a one-time thing: For applications with thousands of functions, there can be a noticeable lag at startup.
Linking statically will reduce this, but defeats to some extent the benefits of code sharing that dynamic libraries can bring. Often, you can selectively link some libraries as dynamic and others as static. For example, the X11, C and math libraries you'd link dynamically since other processes will be using these also and the program can use to the copy already in memory but still link your own application-specific libraries statically.
That is the property of a program to use addresses which are near in both time and location other recent references to memory. The main difference between optimizing for VM and optimizing for cache is scale: VM pages can be anywhere from 0.
Cache blocks typically range from 16 bytes to bytes and get read in in the tens of microseconds. A program which forces many VM pages or cache lines to load in quick succession is said to be "thrashing. But don't fool yourself into thinking that malloc always allocates adjacent chunks of memory on each successive call.
You have to allocate one giant chunk and dole it out yourself to be certain. But this can lead to other problems. Search and sort algorithms differ widely in their patterns of accessing memory. Merge sort is often considered to have the best locality of reference. Search algorithms may take into consideration that the last few steps of the search are likely to take place in the same page of memory, and select a different algorithm at that point.
If you find yourself writing C code like this, you probably need to be re-educated: As long as you're done using the pointer before you modify the string, array, or struct you're okay.
ANSI C now requires that structs are pass-by-value like everything else, thus if you have extraordinarily large structs, or are making millions of function calls on medium-sized ones, you might consider passing the struct's address instead, after modifying the called function so that it doesn't perturb the contents of the struct. If you already have an array of structs, but find that the critical part of your program is accessing only a small number of fields in each struct, you can split these fields into a separate array so that the unused fields do not get read into the cache unnecessarily.
There may still be padding at the end, but eliminating that can destroy the performance gains the alignment was meant to provide. A typical use of char or short variables is to hold a flag or mode bit. You can combine several of these flags into one byte using bit-fields at the cost of data portability.
The rationale is that if the data structure is an odd size relative to the cache line size, it may overlap two cache lines thus doubling the time needed to read it in from main memory. The size of a data structure can be increased by adding a dummy field onto the end, usually a character array. The alignment is harder to control, but usually one of these techniques will work: Use malloc instead of a static array.
Some malloc s automatically allocate storage suitably aligned for cache lines. Allocate a block twice as large as you need, then point wherever in it that satisfies the alignment you need. Use an alternate allocator e. Use the linker to assign specific addresses or alignment requirements to symbols. Wedge the data into a known position inside another block which is already aligned.
MARCH FORWARD Theoretically, it makes no difference whether you iterate over an array forwards or backwards, but some caches are of a "predictive" type that tries to read in successive cache lines even before you need them.
Because these caches must work quickly, they tend to be fairly dim and rarely have the extra logic for predicting backwards traversal of memory pages. Take as an example a hypoythetical direct-mapped 1MB cache with byte cache lines and a program which uses 16MB of memory, all happening on a machine with 32 bit addresses. The simplest way for the cache to map the memory into the cache is to mask off the first 12 and the last 7 bits of the address, then shift to the right 7 bits.
If the program happens to use an array of byte structs, and refers to just one element in each one while processing the whole array, every access will map to the same cache line and force a reload, which is considerable delay. This seems like a contrived situation, but the same problem arises for any struct which is a multiple of in the above example. If the struct were half the size, the problem would be half as severe and so on but probably still noticeable.
Since we have used only one one! The solution in this case is probably to isolate the one field into a separate array. Say the field is 4 bytes wide; then the cache would only need a reload every 32 iterations through the array. This is probably seldom enough that the cache lines can be read in advance of their need and processing can occur at full speed.
Some malloc s have a way of "tuning" for performance mallopt for example. I would emphasize that this has narrow utility.
Programs with long lifetimes for example, background demons and programs which use a significant portion of physical memory should carefully free everything as soon as it's no longer needed.
BE STINGY Allocate less stuff in the first place; this is kind of trite since most programmers don't go to the trouble of calling malloc unless they have a good reason, but there might be a few things that could go on the stack instead.
If you have some data structure which is mostly "holes" like a hash table or sparse array, it might be to your advantage to use a smaller table or list or something. Buddy system allocators are subject to massive amounts of internal fragmentation, and often their worst case is when you allocate something whose size is near, or slightly larger than a power of two, which is not exactly uncommon.
They also tend to put things of like size together in preference to putting things allocated in sequence together; this affects locality of reference, sometimes in a good way, sometimes not.
This can be used to get malloc to allocate small blocks fairly quickly and painlessly. SunOS and perhaps others have madvise and vadvise system calls. These can be used to clue the OS in as to what sort of way you're going to be accessing memory: It is entirely possible that rewriting the code won't improve the situation as much as simply throwing money at the problem and buying more memory.
Some common sense is required, since for mass-market software this isn't always practical. Few development environments include one.
Often, the best that one can do is infer from an execution profile that a cache problem might exist in a certain function, especially if large arrays or many data structures are involved. Some gain can be wrought from using a larger than normal sized buffer; try out setvbuf. If you aren't worried about portability you can try using lower level routines like read and write with large buffers and compare their performance to fread and fwrite.
Using read or write in a single-character-at-a-time mode is especially slow on Unix machines because of the system call overhead. Consider using mmap if you have it. This can save effort in several ways. The data doesn't have to go through stdio which saves a buffer copy. Depending on the sophistication of the paging hardware, the data need not even be copied into user space; the program can just access an existing copy.
Lastly, the file is can be paged directly off the source disk and doesn't have to use up virtual memory. Though if you take the disk space you'd use to write the records out and just add it to the paging space instead you'll save yourself a lot of hassle.
Significant parallelism may result at the cost of program complexity. Waiting for the screen to catch up stops your program.
This doesn't add to the CPU or disk time as reported for accounting purposes, but it sure seems slow to the user. Bitmap displays are subject to a similar problem: A general solution is to provide a way for the user to squelch out irrelevant data. Screen handling utilities like curses can speed things up by reducing wasted screen updates. You may find that sending short, infrequent messages is extremely slow.
This is because small messages may just sit in a buffer for a while waiting for the rest of the buffer to get filled up.
There's some socket options you can set to get around this for TCP; or switch to an non-streaming protocol such as UDP. But the usual limitation is the network itself and how busy it is. For the most part there's no free lunch and sending more data simply takes more time. However, a local network with a switched hub, bandwidth should have a clear theoretical upper limit. If you aren't getting anywhere near it, you might be able to blame the IP implementation being used with your OS.
Try switching to a different "IP stack" and see if that helps a bit. One is usually implemented in terms of the other and may suffer an extra buffer copy or other overhead. Try them both and see which is faster. By slightly changing the calling sequences they have enabled several nifty optimizations and claim a substantial improvement. Programmers tend to over-estimate the usefulness of the programs they write.
The approximate value of an optimization is: Machines are not created equal. What's fast on one machine may be slow on another.
Extra interface cards, different disks, number of logged in users, extra memory, background demons, and just about anything else can affect the speed of various parts of a program and influence which part of the program is a bottleneck and its speed as a whole. A particular problem is that many programmers are power users and have machines loaded with memory and a math co-processor and tons of disk space while the users get bare-bones CPUs and run over a network. The programmer gets a distorted view of the program's performance and may fail to optimize the parts of the program which are taking up the user's time, such as floating point or memory intensive routines.
As I've mentioned along the way, many of these optimizations may already be performed by your compiler! Don't get into the habit of writing code according to the above rules of optimization.
Only apply them after you have discovered exactly which function is the problem. Some of the rules if applied globally would make the program even slower. Nearly all make the intent of the code less obvious to a human reader, which can get you in trouble if you have to fix a bug or something later on. Be sure to comment optimizations--the next programmer may simply assume it's ugly code and rewrite it.
Spending a week optimizing a program can easily cost thousands of dollars in programmer time. Sometimes, it's easier to just buy a faster CPU or more memory or a faster disk and solve the problem that way.
Novices often assume that writing lots of statements on a single line and removing spaces and tabs will speed things up. While that may be a valid technique for some interpreted languages, it doesn't help at all in C.
Buddy system allocators distribute blocks in memory depending in part on their sizes and not merely by the order in which the were allocated. A program which relies heavily on the availability of the CPU is said to be compute bound. Such a program would not be sped up significantly by installing more memory or a faster disk drive. This means almost the entire state of the CPU itself is saved then restored at each function call.
CPUs are designed with this in mind and do all of this very quickly, but additional speed can be gained by inlining, or integrating the called function into the caller. The called function then uses the caller's stack to store local variables and can when semantics allow make direct references to parameters rather than copies.
While waiting for the resource to become available, or for the request to simply finish, the CPU is idle or may switch to a different process. Such a program will have the CPU biding time waiting for the data cache to be updated or for VM to get paged in. O-notation a rough measure of program time complexity.
Typically expressed as O f N where the f N is a mathematical function which defines an upper limit times an arbitrary constant for the expected running time of the program for a given input size N. For example, many simple sort programs have a running time which is proportional to the square of the number of elements being sorted. The O-notation can be used to describe space complexity also.
Depending on context it may refer to the human process of making improvements at the source level or to a compiler's efforts to re-arrange code at the assembly level or some other low level. Some CPUs have several distinct stages to their pipeline. Typically the measurement takes the form of an interval sampling of the location of the program counter or the stack. After the program being measured is finished running, the profiler collects the statistics and generates a report.
Few profilers can measure this directly but it can be inferred if a profiler shows most of the time is spent in a recursive routine whose statements are too simple to account for the total time. The delay is called a wait state, and it is described by how many instruction cycles pass before the result is available.
A compiler or assembler for a pipelined CPU is expected to fill in the wait states with other productive instructions. Aho, Ravi Sethi, Jeffrey D. Principles, Techniques, and Tools The "Dragon Book" ISBN The Dragon Book has quite a bit of detail about the inner workings of optimizers and what types of optimizations are easiest for the compiler to perform.
By reading it you may gain some insight into what you can reasonably expect a modern compiler to do for you. There is an entry in the Jargon file for it as well: The classic text "Compilers: Principles, Techniques and Tools", by Alfred V.