Evaluating artifact#

Evidence of bug finding#

In our paper, we claimed that (in the abstract):

… has found 65 new bugs in the last seven months for TVM, TensorRT, ONNXRuntime, and PyTorch.

You can find the evidence in our bug finding table.

Coverage experiments#

We will go through the main experiments corresponding to Section 5.2 in the paper, which evaluates end-to-end coverage efficiency of NNSmith and baselines.

Expected time cost

21 hours machine time;
<1 hour human time;

Experiment ID	NNSmith[1]	GraphFuzzer[2]	LEMON[3]
TVM	E1 (4hr)	E2 (5hr)	E3 (2hr)
ONNXRuntime	E1 (4hr)	E2 (5hr)	E3 (1hr)

Note

We call ONNXRuntime as “ort” for short.

TL;DR#

Evaluate the artifact in the fastest way:

Just run this in a tmux session;

Come back 1 day later;
Jump to the result visualization section to verify the results.

Or if you want to understand the scripts being executed, you can continue reading the following sub-sections (E1~E3).

E1: NNSmith[1] Coverage#

E1: Evaluating NNSmith on {tvm, ort}

Fuzzer type: NNSmith (with binning);
System under test (SUT):
- TVM (LLVM CPU backend);
- ONNXRuntime (CPU backend);
Experiment time: 8 hours;
Outputs (will be used in visualization section):
- /artifact/nnsmith/nnsmith-tvm-binning/
- /artifact/nnsmith/nnsmith-ort-binning/

E2: GraphFuzzer[2] Coverage#

E2: Evaluating GraphFuzzer on {tvm, ort}

Fuzzer type: GraphFuzzer;
System under test (SUT):
- TVM (LLVM CPU backend);
- ONNXRuntime (CPU backend);
Experiment time: 10 hours;
Outputs (will be used in visualization section):
- /artifact/nnsmith/graphfuzzer-tvm/
- /artifact/nnsmith/graphfuzzer-ort/

E3: LEMON[3] Coverage#

E3: Evaluate LEMON on {tvm, ort}

Fuzzer type: LEMON;
System under test (SUT):
- TVM (LLVM CPU backend);
- ONNXRuntime (CPU backend);
Experiment time: 3 hours;
Outputs (will be used in visualization section):
- /artifact/nnsmith/lemon-tvm/
- /artifact/nnsmith/lemon-ort/