Task:
In this project, you will explore several aspects of performance and parallel execution by running several programs while measuring execution time. Please be aware these programs will necessarily tax the system, in some cases significantly. When running on zeus or eros please be considerate of others and do not execute too many times when the system is in use by multiple users. Also please do not allow batch files to run for more than a few minutes. Any console outputs inside the timed calculation loops will slow the execution significantly and should be avoided except for debugging very small problem sizes. If you are running this on a local machine such as your laptop you may experience malloc failures and should probably save work from other applications first.
1) Processor/Cache Evaluation
A) On the system you will be using for this project build a table of the number of cores, threads per core, L1 instruction cache size, L1 data cache size, L2 cache size, and CPU frequency(s) (e.g. base, turbo, max, etc)
On GNU/Linux systems you can execute the command lscpu or view the proc/cpustats file. On Windows/OSX please research and list the application or method you use.
Download the code MatrixOps.cpp from TRACS.
Read through the code including the timing measurement.
B) Complete the missing code segments to calculate the size of the array and do a column first traversal. Compile with
$ g++ -O3 MatrixOps.cpp -o matrixOps -std=c++11
Once completed test the timing with various nxn-sized matrices. For n < 500 the timing will likely be very fast, for n > 20000 the memory allocation may fail. For longer running processes you can check your array size calculations using pid or task manager.
C) Record the timings for row first and column first for the smallest meaningful size (that doesn’t generate weird times, the largest size that will allocate the array, and at least two in-between points). What difference do you see on row first versus column first? Make a table in your report with the data.
D) Try to capture the effect of a warm vs cold cache, meaning that on the first pass the score is lower than subsequent passes due to the number of cache misses on the first pass through memory. Some possibilities: loop several times in succession with separate timing records, use malloc and free to move array locations, experiment with different size arrays (of course you probably don’t want a single pass to bust the cache). Explain what you tried and why plus any results that show the effect (screen shots are great).