Evaluating artifact#

Evidence of bug finding#

In our paper, we claimed that (in the abstract):

… has found 65 new bugs in the last seven months for TVM, TensorRT, ONNXRuntime, and PyTorch.

You can find the evidence in our bug finding table.

Coverage experiments#

We will go through the main experiments corresponding to Section 5.2 in the paper, which evaluates end-to-end coverage efficiency of NNSmith and baselines.

Expected time cost

  • 21 hours machine time;

  • <1 hour human time;

Experiment ID





E1 (4hr)

E2 (5hr)

E3 (2hr)


E1 (4hr)

E2 (5hr)

E3 (1hr)


We call ONNXRuntime as “ort” for short.


Evaluate the artifact in the fastest way:

  • Just run this in a tmux session;

bash /artifact/eval_nnsmith.sh;      \
bash /artifact/eval_graphfuzzer.sh;  \
bash /artifact/eval_lemon.sh

Or if you want to understand the scripts being executed, you can continue reading the following sub-sections (E1~E3).

E1: NNSmith1 Coverage#

E1: Evaluating NNSmith on {tvm, ort}

  • Fuzzer type: NNSmith (with binning);

  • System under test (SUT):

    • TVM (LLVM CPU backend);

    • ONNXRuntime (CPU backend);

  • Experiment time: 8 hours;

  • Outputs (will be used in visualization section):

    • /artifact/nnsmith/nnsmith-tvm-binning/

    • /artifact/nnsmith/nnsmith-ort-binning/

cd /artifact # In the container
bash eval_nnsmith.sh

E2: GraphFuzzer2 Coverage#

GraphFuzzer” (..huh?)

The paper by Luo, Weisi, et al2 does not give a name to the fuzzer. We call it “GraphFuzzer” for convenience.

E2: Evaluating GraphFuzzer on {tvm, ort}

  • Fuzzer type: GraphFuzzer;

  • System under test (SUT):

    • TVM (LLVM CPU backend);

    • ONNXRuntime (CPU backend);

  • Experiment time: 10 hours;

  • Outputs (will be used in visualization section):

    • /artifact/nnsmith/graphfuzzer-tvm/

    • /artifact/nnsmith/graphfuzzer-ort/

cd /artifact # In the container
bash eval_graphfuzzer.sh

E3: LEMON3 Coverage#

E3: Evaluate LEMON on {tvm, ort}

Pre-generated LEMON models

Evaluating LEMON in NNSmith’s setting is very complicated (why?). For reviewers’ convenience, the LEMON models are pre-generated and pre-converted (see -v /data/artifact:/... in the docker run command). Nevertheless, you can refer to Generate LEMON models from scratch to re-generate the models.

  • Fuzzer type: LEMON;

  • System under test (SUT):

    • TVM (LLVM CPU backend);

    • ONNXRuntime (CPU backend);

  • Experiment time: 3 hours;

  • Outputs (will be used in visualization section):

    • /artifact/nnsmith/lemon-tvm/

    • /artifact/nnsmith/lemon-ort/

cd /artifact # In the container
bash eval_lemon.sh

Visualizing and understanding results#

Visualizing coverage

Run the following script to generate images in /artifact/nnsmith/tvm-cov and /artifact/nnsmith/ort-cov.

bash /artifact/viz_main.sh
# Check the outputs.
$ ls /artifact/nnsmith/tvm-cov
# tvm_br_cov_venn.png      tvm_branch_cov-time.png  tvm_opt_branch_cov-iter.png
# tvm_branch_cov-iter.png  tvm_opt_br_cov_venn.png  tvm_opt_branch_cov-time.png
$ ls /artifact/nnsmith/ort-cov/
# ort_br_cov_venn.png      ort_branch_cov-time.png  ort_opt_branch_cov-iter.png
# ort_branch_cov-iter.png  ort_opt_br_cov_venn.png  ort_opt_branch_cov-time.png

Check the results

The image results are still in the docker container, we need pull those images out of it to see how they look like.

Get image outputs from docker to local

First you need to temporarily leave the current container, there are three ways to do it:

  1. TMUX: ctr + b then d;

  2. Local (recommended): just open a new terminal on the machine which is by default out of the container;

  3. Local: type exit to exit the container environment (later you can resume the container with docker start -i ${USER}-nnsmith);

# Now in the local environment
docker cp ${USER}-nnsmith:/artifact/nnsmith/tvm-cov . # copy TVM results to local folder `tvm-cov`
docker cp ${USER}-nnsmith:/artifact/nnsmith/ort-cov . # copy ORT results to local folder `ort-cov`

Now let’s check the results corresponding to figures in the paper:

Fuzzing randomness

The sample images below are freshly generated when testing the artifact on the original test-bed (Oct. 14, 2022). They can slightly differ from that in the paper due to fuzzing randomness.

The randomness in fuzzing could come from performance divergence in different system and random seeds. This means detailed reproduced data might not be strictly equivalent to that presented in the paper, but the overall trend should be consistent in the long run (say 4 hours).

Potential legend style shifting (if you skipped LEMON)

According to E3: LEMON3 Coverage, the curve/pie for LEMON baseline might not be available if not starting with the original test-bed). As a result, showing only two baselines make the curves or pies of the figures below shifted with legend styles. In this case, please distinguish the systems by tagged labels as the colors might not match that in the original paper.

Figure 4: Total branch coverage over time (all files)


Figure 4.(a) ONNXRuntime
In ./ort-cov/ort_branch_cov-time.png


Figure 4.(b) TVM
In ./tvm-cov/tvm_branch_cov-time.png

Figure 5: Total branch coverage over test cases (all files)


Figure 5.(a) ONNXRuntime
In ./ort-cov/ort_branch_cov-iter.png


Figure 5.(b) TVM
In ./tvm-cov/tvm_branch_cov-iter.png

Figure 6: Total branch coverage over time (pass files)


Figure 6.(a) ONNXRuntime


Figure 6.(b) TVM

Figure 7: Venn diagram of overall coverage (total coverage shown in parenthesis)


Figure 7.(a) ONNXRuntime
See ./ort-cov/ort_br_cov_venn.png


Figure 7.(b) TVM
See ./tvm-cov/tvm_br_cov_venn.png

Congratulations! You have successfully finished the main experiments of NNSmith!!! 🎉🎉🎉

Read more#

You may further refer to Read more for potential questions and extra/non-main experiments.


Liu, Jiawei, et al. “NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers.” Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2023.


Luo, Weisi, et al. “Graph-based fuzz testing for deep learning inference engines.” 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021.


Wang, Zan, et al. “Deep learning library testing via effective model generation.” Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2020.