#includeextern "C" { #include "d4-7/d4.h" } #include #include #include #include void doread(void* addr, d4cache* Cache) { d4memref R; R.address = (d4addr)addr; R.size = sizeof(double); R.accesstype = D4XREAD; d4ref(Cache, R); } void dowrite(void* addr, d4cache* Cache) { d4memref R; R.address = (d4addr)addr; R.size = sizeof(double); R.accesstype = D4XWRITE; d4ref(Cache, R); } void matmult_ijk(Matrix& A, Matrix& B, Matrix& C, d4cache* Cache) { int N = A.size(); for (unsigned i = 0; i < N; ++i) { doread(&A(i,i), Cache); doread(&B(i,i), Cache); dowrite(&C(i,i), Cache); } } int main(int argc, char** argv) { d4cache* Mem; d4cache* L1; Mem = d4new(0); L1 = d4new(Mem); L1->name = "L1"; L1->lg2blocksize = 8; L1->lg2subblocksize = 6; L1->lg2size = 20; L1->assoc = 2; L1->replacementf = d4rep_lru; L1->prefetchf = d4prefetch_none; L1->wallocf = d4walloc_always; L1->wbackf = d4wback_always; L1->name_replacement = L1->name_prefetch = L1->name_walloc = L1->name_wback = "L1"; int r; if (0 != (r = d4setup())) { std::cerr << "Failed\n"; abort(); } Matrix A(10), B(10), C(10); matmult_ijk(A, B, C, L1); std::cout << L1->miss[D4XREAD] + L1->miss[D4XWRITE] + L1->miss[D4XINSTRN] + L1->miss[D4XMISC] + L1->miss[D4XREAD+D4PREFETCH] + L1->miss[D4XWRITE+D4PREFETCH] + L1->miss[D4XINSTRN+D4PREFETCH] + L1->miss[D4XMISC+D4PREFETCH] << " of " << L1->fetch[D4XREAD] + L1->fetch[D4XWRITE] + L1->fetch[D4XINSTRN] + L1->fetch[D4XMISC] + L1->fetch[D4XREAD+D4PREFETCH] + L1->fetch[D4XWRITE+D4PREFETCH] + L1->fetch[D4XINSTRN+D4PREFETCH] + L1->fetch[D4XMISC+D4PREFETCH] <<"\n"; return 0; } 
The sample Dinero code above does not compute a matrix multiply, it only shows how to feed an address trace into Dinero. You will have to write the matrix multiply code.
This code does not model the correct cache parameters. Read the man pages that come with the simulator.
Dinero has a man page distributed in the source which explains the parameters to the cache and the various API calls that are available. Google will also find the man page.
Be sure to start with a clean cache each measurement.
allocate A,B,C initialize A,B,C flush cache do 1 matrix matrix multiply
You cannot do more than 1 matrix matrix multiply for a given allocation or you will not get valid numbers. It is fine to put the protocol above in a loop, but you must reallocate A,B, and C each time. This will randomize the starting address of each matrix, which is related to the optional question of why the model and the real machine diverge.
To program with PAPI on lonestar, you need to do a:
module load papi
To see which papi counters are available on a host, do:
papi_avail
Example code to use papi follows:
#include#include void handle_error (int retval) { printf("PAPI error %d: %s\n", retval, PAPI_strerror(retval)); exit(1); } void init_papi() { int retval = PAPI_library_init(PAPI_VER_CURRENT); if (retval != PAPI_VER_CURRENT && retval < 0) { printf("PAPI library version mismatch!\n"); exit(1); } if (retval < 0) handle_error(retval); std::cout << "PAPI Version Number: MAJOR: " << PAPI_VERSION_MAJOR(retval) << " MINOR: " << PAPI_VERSION_MINOR(retval) << " REVISION: " << PAPI_VERSION_REVISION(retval) << "\n"; } int begin_papi(int Event) { int EventSet = PAPI_NULL; int rv; /* Create the Event Set */ if ((rv = PAPI_create_eventset(&EventSet)) != PAPI_OK) handle_error(rv); if ((rv = PAPI_add_event(EventSet, Event)) != PAPI_OK) handle_error(rv); /* Start counting events in the Event Set */ if ((rv = PAPI_start(EventSet)) != PAPI_OK) handle_error(rv); return EventSet; } long_long end_papi(int EventSet) { long_long retval; int rv; /* get the values */ if ((rv = PAPI_stop(EventSet, &retval)) != PAPI_OK) handle_error(rv); /* Remove all events in the eventset */ if ((rv = PAPI_cleanup_eventset(EventSet)) != PAPI_OK) handle_error(rv); /* Free all memory and data structures, EventSet must be empty. */ if ((rv = PAPI_destroy_eventset(&EventSet)) != PAPI_OK) handle_error(rv); return retval; } int main(int argc, char** argv) { init_papi(); int EventSet = begin_papi(PAPI_TOT_INS); DoTest(); long_long r = end_papi(EventSet); std::cout << "Total instructions: " << r << "\n"; return 0; } 
You only need to model the L1 data cache in Dinero. We don't care about the instruction cache and it would be a lot more work to generate an address trace for that cache anyway.
Don't forget that *C += ... is a read, then a write of C.
For both parts, run the experiments with N = 1 .. 512.
The L1 cache on lonestar is not 12MB, /proc/cpuinfo only shows the L3 cache. That file has other information that you can use to find the L1 cache parameters and there are other ways to do it.
There is no requirement that you wrap the matrix in a class. The amount of syntactic sugar you want to apply is up to you, as long as it doesn't negatively impact performance (say by requiring more pointer dereferences than necessary). The memory representation is up to you as long as it is sensible for a dense matrix and is row-major. One can think of two representations (with a couple variations) that are sensible under these constraints and they should both give you about the same answer.