Improve Program Efficiency with Threading

CSCE313 Embedded Systems

Questions:

In this programming assignment, you will increase the efficiency of your PA1. You may have noticed that PA1 took a long time to collect 1K data requests per patient. It would take even longer to collect all requests in a single file. Transferring raw files through file messages, although faster than using data messages, was also quite slow because of single channel usage at a time. In PA3, we learned how to create multiple channels, which the server processed independently. However, we could not take advantage of the concurrency because our client program in PA3 was single-threaded. In PA 4, we will attempt to improve on the bottlenecks by incorporating multiple threads to make these operations significantly faster.

Threading for Faster Programs:
The primary bottleneck in PA1 for collecting data points is the fact that we must collect data points sequentially, one after another. This problem is worsened further by the random processing time of each request (i.e.,emulated using a usleep() call) by the server. And since the requests are all sequential, one delayed request-response would affect all subsequent requests. Notice in the BIMDC/ directory that there are 15 files, one for each patient. One way to collect data faster from these files is to use 15 threads from the client side, one for each patient. Each thread would create its own request channel with the server and then independently collect the responses for a person. Since the server already processes each channel in a separate thread, you can get at most 15-fold speed over the sequential version. This technique is shown in Figure 1. Note that file requests can be sped up similarly.

However, there is still a remaining issue: even if there are adequate hardware resources (say, multiple CPU cores), the speedup is still limited to p-fold, where p is the number of patients you are collecting data for. In addition, since each request takes a random time, some patient threads may take much longer compared to the other ones. To avoid this roadblock,we have to make the number of threads somehow independent of the number of patients p.The standard solution to this problem is to separate the tasks of producing these requests and processing them (i.e., sending them to the server), and using a buffer in between these tasks. This way the number of patients p can be decoupled from the number of threads that would be in charge of communicating with the server, simplifying the scaling issue. That means now there are p patient threads pushing requests onto the buffer, which can be thought of an STL::queue.

Then, there are w worker threads, each communicating with the server independently of each other through a dedicated request channel. As a result, the theoretical speedup factor is now w, which is independent of p and can lead to significantly more speedup in scenarios where w >> p. Figure 2 demonstrates this technique.

Independent of number of patients p For this technique to work, you need a special and more complicated buffer than just an STL queue between the patient threads and the worker threads. First, the queue must be thread-safe; otherwise simultaneous accesses from the producers (i.e., patient threads) and consumers (i.e.,worker threads) would lead to a race condition. Second, the buffer must be made “bounded” so that the memory footprint of the buffer is under check and does not grow to infinity. In summary, the buffer must be protected against “race-conditions”, and “overflow” and “underflow” scenarios.

Overflow can happen when the patient threads are much faster than the worker threads, while underflow can happen under the opposite scenario.The BoundedBuffer class is the perfect solution to all these problems Client Requesting Data Points The p patient threads place requests into the request buffer. The w workers threads in this PA work as follows. Each worker thread pops a
request from the request buffer (i.e., the BoundedBuffer), sends it to the server, collects the response, and then puts the response in the response buffer - another BoundedBuffer object. In addition, there are h histogram threads who pop these responses from the response buffer and update p ecg histograms, one for each patient.

There is a histogram per patient that keeps track of that patient’s statistics that multiple histogram threads would potentially update the same histogram leading to another race condition, which must be avoided by using mutual exclusion. In order for the histogram threads to know which response is for which patient, the worker threads make sure to prepend each data response with
the respective patient no. You can do that by making a pair out of the patient number and the data response.

When requesting data messages, the client program should take 4 command line arguments: n for number of data items (try between [1, 15K]), p for number of patients (try between [1, 15]), w for number of worker threads (try between [50, 500]), h for number of histogram threads (try between [5, 100]), and b for bounded buffer size in bytes (try between [256, 2048]). For instance, the following command is requesting 15K ecg data for the first 10 patients using 500 worker threads and a request buffer of size 1KB (i.e., 1024 bytes). It will use a separate 1KB response buffer to collect the responses and then h = 5 histogram threads to make the 10 patient histograms for ecg values.

Notice that there is a total of p + w + h threads in just the client program: p patient threads, w worker threads and h histogram threads. All these threads must be running simultaneously for your program to work; otherwise the request and/or the response buffers will stall after reaching their bounds (buffer full) or after running dry (buffer empty).You cannot just use a huge request buffer where all requests would fit.

Make sure to test your program using small request/response buffer size (e.g., b = 100); your program should work perfectly fine, maybe a bit slower.Smaller b values along with high p, w, n, h increase concurrency and thus manifest race condition bugs that are not visible under easier circumstances. Make sure to stress-test your program under these settings.

You can also run the client program for file requests. First, the client queries the file size just like PA1. After that, instead of sending the requests directly to the server, the client starts a thread that makes and puts all the requests on to the request buffer. The worker threads pop those requests from the buffer, send those out to the server through their dedicated request channels, receive the responses, and write those responses into the appropriate locations of the file. Note that unlike requesting data messages,there is no response buffer in this case; only a request buffer. The structure is shown in Figure 4. Note that while the program is running, there are w + 1 total threads working simultaneously: 1 thread for making the requests and pushing them to the request buffer, the rest w worker threads who keep taking from the request buffer and processing them.

Get instant help from 5000+ experts for