Redesigning the finpar infrastructure #1

athas · 2015-02-11T09:44:16Z

(To eliminate ambiguity, here is the nomenclature: we have a number of
benchmarks (currently CalibGA, CalibVolDiff, and GenericPricer),
each of which have several data sets (typically Small, Medium, and
Large), and several implementations (right now mostly different
versions of C) each of which may have several configurations.
Running a benchmark consists of selecting a data set and an
implementation, and possibly specifying a specific configuration of
the implementation.)

Recently, a Martin, Frederik, and myself have been implementing the
finpar bechmarks in more diverse programming languages - (streaming)
NESL, APL and Futhark, at least. Unfortunately, the current finpar
infrastructure is not very easy to work with, and so their work has
not been integrated. I have identified the following problems:

Implementations are not cleanly separated from data sets and
ancillary code.

Solution: for each benchmark, have a directory that contains
only implementations.
Building an implementation modifies the implementation
directory, and more importantly, configuring an implementation
often involves manually modifying files in the directory (see
CalibGA/includeC/KerConsts.h for an example). This is really
bad and makes structured and reproducible benchmarking almost
impossible.

Solution: when "compiling" an implementation, put everything
in a new, separate directory, I will call the instantiation
directory. All configuration must be done via passing options to
the compilation step, and will be reflected in the files put in
the instantiation directory.
Adding new implementations is a mess, because you have to modify
the global build system.

Solution: define a setup/run-protocol that each benchmark
implementation must follow, and which can be used by a generic
controller script.
Validation is done by the benchmark implementations. There is no
reason to do this.

Solution: have the implementation produce
their results in some well-defined format, and have the controller
script validate it.
Everything is done with Makefiles. Nobody likes modifying
Makefiles, and we don't need incremental rebuilds anyway.

Solution: write as much as possible in Python or simple shell
script.

I propose the following rough protocol:

Each benchmark implementation must include one executable file,
called instantiate. This can be written in whatever language
one prefers.
When the instantiate program for an implementation is invoked,
the following environment variables must be set:
- FINPAR_IMPLEMENTATION, which must point at the
  implementation directory. This is to get around the fact that
  it's not always easy to find the location of the running
  program.
- FINPAR_DATASET, which must point at a directory containing
  .input and .output files.
The instantiate program will instantiate the implementation in
the current directory, which will become the instantiation
directory.
The instantiate program can be passed command-line options to
futher configure the implementation. These are defined on a
per-implementation basis, and not standardised.
After instantiation, the instantiation directory must contain a
program run, which, when executed, will run the benchmark
implementation. The result will be two files in the
instantiation directory:
- runtime.txt, which contains the runtime in milliseconds as
  an integer.
- result.data, which contains the result in our well-defined
  data format.
I have judged that the runtime should be measured by the
implementation itself, as it is not possible to black-box
measure this without possibly measuring the wrong things (like
kernel compilation, exotic hardware setup, parsing of input
data, or IO).

The following questions have yet to be answered:

What data format should we use? Currently, finpar uses the
Futhark value format, which is pretty simple. We can possibly
make life even simpler by using JSON, but it is incredibly
annoying that JSON does not support comments.
Should we use environment variables at all? It was mostly to
avoid having the instantiate-script do command-line parsing unless
it wants to.

Yet, I think this is a good protocol. It will allow us to build an
easy-to-use controller program on top of it, that can automatically
generate a bunch of different instantiations with different
configurations and data sets, and maybe draw graphs of the results,
etc. I estimate that the above could be implemented fairly quickly,
and sanity-checked with the extant benchmark implementations.

The text was updated successfully, but these errors were encountered:

athas · 2015-02-11T09:50:43Z

There is another unresolved issue: sometimes, it is convenient to share code across implementations. For example, the various C implementations currently share a bunch of boilerplate code related to parsing. I propose we just handle this in an ad-hoc fashion, maybe with a lib directory somewhere that is pointed to by an environment variable.

athas · 2015-02-11T10:01:02Z

FINPAR_DATASET is overspecified, as only the input file is needed for instantiation. This should be a FIPAR_INPUT variable instead.

athas · 2015-02-11T10:05:55Z

When executing run, the current directory must be the instantiation directory.

dybber · 2015-02-13T12:56:11Z

Looks good. Det individual benchmark implementers should be free to use Makefiles in his instantiation file if he wishes, but the main benchmark-runner should be developed in pure Python. Keep the main benchmark repo low-key, only the simple benchmark runner that outputs flat text-files, and maybe have different projects for visualising benchmark data and generating websites etc.

Why use environment variables over command line arguments?

runtime.txt: remember that you would want to run each benchmark like 100 times to calculate mean and std.dev. I think a better approach will be to output the running time on standard out, and let the benchmark-runner script collect these running times for each run and create the file. Secondly, we found that we liked to compare our timings over time. So it should probably be calib_futhark_.result, calib_snesl_.txt, to keep old versions and make sure we don't mix up what language we were benchmarking.

I think the file format-debate ended like: the default file format doesn't matter, as long as there is a conversion script to JSON. I will probably make one that converts to CSV, to make it easier to use for R and APL.

athas · 2015-02-13T14:34:01Z

Martin Dybdal notifications@github.com writes:

Why use environment variables over command line arguments?

This was based on the idea that people might not want to parse command
lines, and the environment is already a key-value store. Another
solution would be a key-value JSON file.

runtime.txt: remember that you would want to run each benchmark like
100 times to calculate mean and std.dev. I think a better approach
will be to output the running time on standard out, and let the
benchmark-runner script collect these running times for each run and
create the file. Secondly, we found that we liked to compare our
timings over time. So it should probably be
calib_futhark_.result, calib_snesl_.txt, to keep
old versions and make sure we don't mix up what language we were
benchmarking.

These are really good hints, thanks!

How do you propose we deal with repeated execution? Should each
benchmark just be expected to repeat the right number of times? I
suppose that if it outputs a list of runtimes, we can just complain if
the number of entries is not as expected.

(I do not expect people to intentionally cheat in their implementations;
this is mostly to guard against bugs.)

I think the file format-debate ended like: the default file format
doesn't matter, as long as there is a conversion script to JSON. I
will probably make one that converts to CSV, to make it easier to use
for R and APL.

As I found out, the current file format is already a perfect subset of
JSON.

\ Troels
/\ Henriksen

athas · 2015-02-20T17:00:16Z

I have pushed a branch new-design which incorporates some of the ideas above. Only the CalibVolDiff benchmarks have been fully ported. I have committed to using JSON and Python for everything. Use the finpar program if you want to try it out.

vinter · 2015-02-23T08:18:20Z

Science storage can easily handle this - if the data don change often (and I would assume not:)) a link in github should do the trick?

/B

On 20 Feb 2015, at 18:00, Troels Henriksen notifications@github.com wrote:

I have pushed a branch new-design which incorporates some of the ideas above. Only the CalibVolDiff benchmarks have been fully ported. I have committed to using JSON and Python for everything. Use the finpar program if you want to try it out.

—
Reply to this email directly or view it on GitHub #1 (comment).

dybber · 2015-02-27T09:29:43Z

How do you propose we deal with repeated execution? Should each
benchmark just be expected to repeat the right number of times? I
suppose that if it outputs a list of runtimes, we can just complain if
the number of entries is not as expected.

I think it should be the job of the benchmarking script to repeat the process and collect the reported timings in a file.

athas · 2015-02-27T10:45:02Z

What is 'the benchmarking script'?

athas · 2015-06-24T07:42:24Z

I have been thinking that instead of run/instantiate scripts, maybe a Makefile with well-defined targets is more familiar.

dybber · 2015-06-24T09:22:45Z

+1

We decided to use such a setup for our 'aplbench' a while ago. Look at this branch: https://github.com/dybber/aplbench/tree/make-setup

athas · 2015-06-24T09:30:11Z

Martin Dybdal notifications@github.com writes:

+1

We decided to use such a setup for our 'aplbench' a while ago. Look at
this branch: https://github.com/dybber/aplbench/tree/make-setup

I think using a Makefile for the top-level script is bad software
engineering. It makes it way too hard to do any kind of real analysis
and programmatic configuration, and it makes the entire system less
flexible and extensible - consider the added difficulty of adding a new
implementation or benchmark.

\ Troels
/\ Henriksen

athas added the enhancement label Feb 11, 2015

athas self-assigned this Feb 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesigning the finpar infrastructure #1

Redesigning the finpar infrastructure #1

athas commented Feb 11, 2015

athas commented Feb 11, 2015

athas commented Feb 11, 2015

athas commented Feb 11, 2015

dybber commented Feb 13, 2015

athas commented Feb 13, 2015

athas commented Feb 20, 2015

vinter commented Feb 23, 2015

dybber commented Feb 27, 2015

athas commented Feb 27, 2015

athas commented Jun 24, 2015

dybber commented Jun 24, 2015

athas commented Jun 24, 2015

Redesigning the finpar infrastructure #1

Redesigning the finpar infrastructure #1

Comments

athas commented Feb 11, 2015

athas commented Feb 11, 2015

athas commented Feb 11, 2015

athas commented Feb 11, 2015

dybber commented Feb 13, 2015

athas commented Feb 13, 2015

athas commented Feb 20, 2015

vinter commented Feb 23, 2015

dybber commented Feb 27, 2015

athas commented Feb 27, 2015

athas commented Jun 24, 2015

dybber commented Jun 24, 2015

athas commented Jun 24, 2015