Benchmarking is a way to assess the performance of software on a
computing platform and to compare performance between different
platforms. The benchmark performance results can also be used to
prioritize software performance optimization efforts on emerging High
Performance Computing (HPC) systems. Optimization of the R interpreter,
its intrinsic functionality, and R packages for specific hardware
architectures will be necessary for data analysts to take full advantage
of the latest HPC clusters, and to obviate the need to reengineer their
analysis workflows. We developed the RHPCBenchmark
package
to determine the single-node run time performance of compute intensive
linear algebra kernels that are common to many data analytics
algorithms, and the run time performance of machine learning
functionality commonly implemented with linear algebra operations.
The benchmarks are divided into three categories: dense matrix linear
algebra kernels, sparse matrix linear algebra kernels, and machine
learning functionality. All of the dense linear algebra kernels are
implemented around BLAS or LAPACK interfaces. The sparse linear algebra
kernels are members of the R Matrix library. The machine learning
benchmarks currently only cover variants of K-means functionality for
clustering using the cluster
package. The dense matrix
linear algebra kernels, sparse matrix linear algebra kernels, and
machine learning functions that are benchmarked are all part of the R
interpreter’s intrinsic functionality or packages included the with the
R programming environment standard distributions from CRAN.
The following functions are intended to be called by users of this library to perform benchmarking for each of the three categories of benchmarks:
RunDenseMatrixBenchmark
Executes the dense matrix
microbenchmarksRunSparseMatrixBenchmark
Executes the sparse matrix
microbenchmarksRunMachineLearningBenchmark
Executes the machine
learning microbenchmarksEach of these benchmark functions is a top-level function which executes microbenchmarks on several R computational kernels and other functionality relevant to each benchmark category. The top-level benchmarks execute multiple performance trials for each microbenchmark in which the run time (synonymous with wall-clock time in this document) is obtained. The benchmarks then compute the average, standard deviation, maximum, and minimum run time for each microbenchmark. The user may also specify one or more warm-up runs to ensure the R programming environment is settled before executing the performance trials. The results of each microbenchmark performance trial are written to a data frame returned by the top-level benchmark functions. The summary statistics for each microbenchmark are written to files in CSV format as the benchmark function progresses through the microbenchmarks, permitting retention of data if the benchmark function is terminated prematurely. If a CSV file already exists in an output directory, the new results will be concatenated to the existing file.
See the object documentation for the top-level benchmark functions and the microbenchmark definition classes listed below for information on how to configure the individual microbenchmarks. If any of the microbenchmarks fails to run in a timely manner or fails due to memory constraints, the matrix sizes and number of performance trials per matrix can be adjusted. Each benchmark function takes as input microbenchmark definition objects that specify data sets and microbenchmarking functions used to time the functionality being benchmarked. The microbenchmark definition classes supported are listed below.
Microbenchmark definition classes:
DenseMatrixMicrobenchmark
Specifies a dense matrix
microbenchmarkSparseMatrixMicrobenchmark
Specifies a sparse matrix
microbenchmarkClusteringMicrobenchmark
Specifies a clustering for
machine learning microbenchmarkEach microbenchmark definition class has an allocator function field and a benchmark function field which specify functions for allocating the data for the microbenchmark and the function that performs the timing of functionality being tested. The microbenchmark definition classes also have parameters used to specify dimensions of the data sets to be used with the microbenchmarks. See the object documentation for the microbenchmark definition classes for specifics on the class fields and how they should be populated.
The microbenchmarks supported by each benchmark are described in this section.
The dense matrix microbenchmarks supported by the dense matrix benchmark are: Cholesky factorization, matrix cross product, matrix determinant, eigendecomposition, linear solve with multiple right hand sides, least squares fit, matrix deformation and transpose, matrix-matrix multiplication, matrix-vector multiplication, QR decomposition, and singular value decomposition.
Each microbenchmark is tested with several matrices of increasing size. The tested matrix dimensions are parameterized by N with values of N equal to: 1000, 2000, 4000, 8000, 10000, 15000, and 20000. The lsfit microbenchmark tests matrices of dimension 2N-by-N/2, while the remaining microbenchmarks all test N-by-N (square) matrices.
The table below lists the name of each microbenchmark supported along with a brief description of the functionality it tests. The kernel functions are applied to matrix or vector operands, where A and B are input matrices and x is an input vector generated by the corresponding allocator function.
Microbenchmark name | Functionality tested | Kernel functions |
---|---|---|
choleksy | Cholesky factorization | chol(A) |
crossprod | Matrix cross product | crossprod(A) |
deformtrans | Transpose, reshape, and retranspose matrix | B <- t(A); dim(B) <- c(N/2, 2*N); t(B) |
determinant | Matrix determinant | determinant(A) |
eigen | Eigendecomposition | eigen(A, symmetric=FALSE, only.values=FALSE) |
lsfit | Least squares fit | lsfit(A, b, intercept=FALSE) |
solve | Linear solve | solve(A, B) |
transpose | Matrix transpose | t(A) |
matmat | Matrix-matrix multiplication | A %*% B |
matvec | Matrix-vector multiplication | A %*% x |
qr | QR decomposition | qr(A, LAPACK=TRUE) |
svd | Singular value decomposition | svd(A) |
For fast performance of the dense matrix kernels, it is crucial to link the R programming environment with optimized BLAS and LAPACK libraries. It is also important to have substantial amounts of memory (16GB minimum) to run most of the microbenchmarks.
The function GetDenseMatrixDefaultMicrobenchmarks
defines the default microbenchmarks referenced in the table. It returns
a vector of DenseMatrixMicrobenchmark
objects specifying
each microbenchmark. Microbenchmark parameters that can be specified
include the dimensions of the matrices to be performance tested with,
number of performance trials per matrix, and the allocator and
microbenchmarking functions to be used. By default, the benchmark
function RunDenseMatrixBenchmark
calls
GetDenseMatrixDefaultMicrobenchmarks
to specify the
microbenchmarks to be executed. See the object documentation for
GetDenseMatrixDefaultMicrobenchmarks
for more details.
The allocator functions for the dense matrix microbenchmarks take a
DenseMatrixMicrobenchmark
object specifying the
microbenchmark and an integer index indicating which of the above values
of N is to be applied. The
allocator must return a list of allocated data objects, including the
matrix, for the microbenchmark to operate on. The microbenchmarking
functions take the DenseMatrixMicrobenchmark
object
defining the microbenchmark and the list of data objects returned by the
allocator function.
The sparse matrix microbenchmarks supported by the sparse matrix
benchmark are: matrix-vector multiplication, Cholesky factorization, LU
factorization, and QR factorization. Each sparse matrix kernel tested is
from the Matrix
library included with the
R distribution, and each kernel is performance tested
with two or three sparse matrices from different application domains.
The microbenchmarks, their associated identifiers and brief descriptions
of the tested matrices are given in the table below. The table below
lists each sparse matrix tested in the sparse matrix benchmark along
with some fundamental characteristics of each matrix. The Laplacian
matrices were generated for use with the benchmark, the rest of the
matrices come from the University of
Florida Sparse Matrix Collection. More detailed descriptions of the
matrices from the Sparse Matrix Collection can be found on the
web site. The R data files containing the sparse matrices can be
downloaded in the companion package RHPCBenchmarkData.
Matrix name | Number of rows | Number of columns | Number of non-zeros | Symmetry and Structure |
---|---|---|---|---|
laplacian7pt_100 | 1000000 | 1000000 | 6940000 | symmetric banded |
laplacian7pt_200 | 8000000 | 8000000 | 55760000 | symmetric banded |
ca2010 | 710145 | 710145 | 3489366 | symmetric block |
ct20stif | 52329 | 52329 | 2600295 | symmetric block |
Andrews | 60000 | 60000 | 760154 | symmetric unstructured |
G3_circuit | 1585478 | 1585478 | 7660826 | symmetric block |
circuit5M_dc | 3523317 | 3523317 | 14865409 | non-symmetric block/banded |
stomach | 213360 | 213360 | 3021648 | non-symmetric banded |
torso3 | 259156 | 259156 | 4429042 | non-symmetric banded |
Maragal | 21255 | 10152 | 537694 | rectangular unstructured |
landmark | 71952 | 2704 | 1146848 | rectangular banded |
The following table lists the microbenchmarks supported by the sparse matrix benchmark, the matrices each microbenchmark is executed with, and the kernel function that is microbenchmarked. The kernel functions are applied to matrix or vector operands, where A and B are input matrices and x is an input vector generated by the corresponding allocator function.
Microbenchmark name | Functionality tested | Problem Domain Description | Kernel function |
---|---|---|---|
matvec_laplacian7pt_100 | Sparse matrix-vector multiplication | Laplacian 7-point stencil applied to 100x100x100 grid | A %*% x |
matvec_laplacian7pt_200 | Sparse matrix-vector multiplication | Laplacian 7-point stencil applied to 200x200x200 grid | A %*% x |
matvec_ca2010 | Sparse matrix-vector multiplication | Undirected weighted graph from congressional redistricting | A %*% x |
cholesky_ct20stif | Sparse Cholesky factorization | Stiffness matrix from structural | Matrix::Cholesky(A) |
cholesky_Andrews | Sparse Cholesky factorization | Eigenvalue problem from computer graphics/vision | Matrix::Cholesky(A) |
cholesky_G3_circuit | Sparse Cholesky factorization | Circuit simulation | Matrix::Cholesky(A) |
lu_circuit5M_dc | Sparse LU factorization | Large circuit DC analysis | Matrix::lu(A) |
lu_stomach | Sparse LU factorization | 3D electro-physical model of duodenum | Matrix::lu(A) |
lu_torso3 | Sparse LU factorization | 3D electro-physcial model of torso | Matrix::lu(A) |
qr_Maragal_6 | Sparse QR factorization | Rank deficient least squares | Matrix::QR(A) |
qr_landmark | Sparse QR factorization | Least squares | Matrix::QR(A) |
The functions
GetSparseMatrixVectorDefaultMicrobenchmarks
,
GetSparseCholeskyDefaultMicrobenchmarks
,
GetSparseLuDefaultMicrobenchmarks
, and
GetSparseQrDefaultMicrobenchmarks
define the default
microbenchmarks referenced in the table. Each returns a vector of
SparseMatrixMicrobenchmark
objects specifying each
microbenchmark. Microbenchmark parameters that can be specified include
the dimensions of the matrices to be performance tested with, the number
of performance trials to be conducted per matrix, and the allocator and
microbenchmarking functions to be used. By default, the benchmark
function RunSparseMatrixBenchmark
calls each of the
GetSparse*
functions to specify the microbenchmarks to be
executed. See the object documentation for
RunSparseMatrixBenchmarks
and the GetSparse*
functions for more details.
The allocator functions for the sparse matrix microbenchmarks take a
SparseMatrixMicrobenchmark
object specifying the
microbenchmark and an integer index indicating which matrix is being
tested. The integer index is unused by the microbenchmarks specified by
the GetSparse*
default functions because the sparse matrix
microbenchmarks read the test matrices from files as opposed to
dynamically generating them. The allocator returns a list of data
objects, including the matrix, for the microbenchmark to operate on. The
microbenchmarking functions take the
SparseMatrixMicrobenchmark
object defining the
microbenchmark and the list of data objects returned by the allocator
function.
Currently, clustering microbenchmarks are the only microbenchmarks
supported by the machine learning benchmark. The clustering benchmark
generates clusters of normally distributed feature vectors in an N-dimensional, real-valued feature
space where the mean of one cluster is located at the origin and the
means of two clusters are each located at positions -1 and 1 of one or
more of the axes. See the object documentation for the function
GenerateClusterData
for more details on how the clusters
are generated. Microbenchmarking is performed with two cluster
identification functions pam
and clara
, from
the cluster
package provided with the R
distribution. The pam
function implements the
partitioning around medoids algorithm which has quadratic time
complexity. The clara
function is used for large data sets
and operates in linear time complexity. The pam
function is
called with cluster.only=TRUE
so that only assignments of
feature vectors to clusters is reported. The clara
function
is called with samples=50
, keep.data=FALSE
,
rngG=TRUE
, and pamLike=TRUE
. See the object
documentation for the RunMachineLearningBenchmark
,
pam
, and clara
functions for more details.
Microbenchmark name | Functionality tested | Kernel function | Number of features | Number of clusters | Number of feature vectors per cluster |
---|---|---|---|---|---|
pam_cluster_3_7_2500 | clustering | cluster::pam | 3 | 7 | 2500 |
pam_cluster_3_7_5000 | clustering | cluster::pam | 3 | 7 | 5000 |
pam_cluster_3_7_5715 | clustering | cluster::pam | 3 | 7 | 5715 |
pam_cluster_16_33_1213 | clustering | cluster::pam | 16 | 33 | 1213 |
pam_cluster_64_33_1213 | clustering | cluster::pam | 64 | 33 | 1213 |
pam_cluster_16_7_2858 | clustering | cluster::pam | 16 | 7 | 2858 |
pam_cluster_32_7_2858 | clustering | cluster::pam | 32 | 7 | 2858 |
pam_cluster_64_7_5715 | clustering | cluster::pam | 64 | 7 | 5715 |
clara_cluster_64_33_1213 | clustering | cluster::clara | 64 | 33 | 1213 |
clara_cluster_1000_99_1000 | clustering | cluster::clara | 1000 | 99 | 1000 |
The function GetClusteringDefaultMicrobenchmarks
defines
the default microbenchmarks referenced in the table. The function
returns a vector of ClusteringMicrobenchmark
objects
specifying each microbenchmark. By default, the benchmark function
RunMachineLearningBenchmark
calls the
GetClusteringDefaultMicrobenchmarks
function to specify the
microbenchmarks given in the table. See the object documentation for
RunMachineLearningBenchmarks
and the
GetClusteringDefaultMicrobenchmarks
functions for more
details.
The allocator functions for the clustering microbenchmarks take a
ClusteringMicrobenchmark
object specifying the
microbenchmark. The allocator must return a list of allocated data
objects, including the matrix of feature vectors, for the microbenchmark
to operate on. The microbenchmarking functions take the
ClusteringMicrobenchmark
object defining the microbenchmark
and the list of data objects returned by the allocator function.
There is no explicit support for multithreading by the benchmarking
functions in the the R HPC benchmark. However, if the functionality
being microbenchmarked is implemented with support for multithreading,
and the number of threads can be controlled through the use of
environment variables, as is often the case, then the benchmarks can be
executed multithreaded. For example, if a benchmark user wishes to
benchmark functionality implemented with a package engineered with
OpenMP support, they can set the OMP_NUM_THREADS
environment variable before running the benchmarks. If a microbenchmark
tests kernel functions implemented with a multithreaded BLAS or LAPACK
library, then the environment variable specific to that library for
controlling the number of threads should be set. For example, the
MKL_NUM_THREADS
environment variable should be set when the
dense matrix benchmarks are tested using an instance of
R that is linked to the parallel Intel Math Kernel
Library which implements multithreaded BLAS and LAPACK
functionality.
Even though the benchmark functions do not control the number of
threads to be utilized, the benchmarks must still report the number of
threads used in the CSV files and data frames generated for reporting
results. Thus, the environment variable specifying the number of threads
must be retrievable in a way that is portable regardless of which
multithreaded library the R programming environment is linked with. To
accomplish this, the environment variable
R_BENCH_NUM_THREADS_VARIABLE
must be set by the benchmark
user to specify the name of the particular environment variable
specifying the number of threads to be used. So if
R_BENCH_NUM_THREADS_VARIABLE
is set to
MKL_NUM_THREADS
, and MKL_NUM_THREADS
is set to
16, then the number of threads reported in the timing results will be
16.
For consistent behavior, the user should set the environment variable for the number of threads before the R programming environment is initialized. This is because the multithreading libraries with which the R environment may be linked will not necessarily register changes to the number of threads after initialization.
Examples are given in this section to show how to run each benchmark. Before running the examples, certain environment variables need to be set to specify the number of threads for parallel processing; see the previous section for how to properly do this.
This example runs all of the default dense matrix microbenchmarks,
saves the summary statistics for each microbenchmark in the directory
DenseMatrixResults
, and saves the data frame returned from
the dense matrix benchmark to a file named
allResultsFrame.RData
.
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./DenseMatrixResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./DenseMatrixResults/allResultsFrame.RData")
This example runs all but the matrix transpose microbenchmarks, which tend to run very slowly, and saves the results to the same directory as in the previous example.
# Get the default microbenchmark definition objects and deactivate
# the unwanted matrix transpose microbenchmarks.
myDenseMatrixMicrobenchmarks <- GetDenseMatrixDefaultMicrobenchmarks()
myDenseMatrixMicrobenchmarks[["deformtrans"]]$active <- FALSE
myDenseMatrixMicrobenchmarks[["transpose"]]$active <- FALSE
# Set an appropriate run identifier
runIdentifier <- "no_transpose"
exTransposeResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory, myDenseMatrixMicrobenchmarks)
save(exTransposeResults, file="./DenseMatrixResults/exTransposeResultsFrame.RData")
This example runs only the matrix-matrix microbenchmark, and it modifies the default matrix dimensions to test only a few small matrices.
# Get the default microbenchmark definition objects and select only
# the matrix-matrix multiplication microbenchmark to run.
myMatmatMicrobenchmark <- list()
myMatmatMicrobenchmark[["matmat"]] <- GetDenseMatrixDefaultMicrobenchmarks()[["matmat"]]
# Set the matrix dimensions to be tested to 2000 and 3000. The
# number of trials and number of warmup trials per matrix must also be
# updated.
myMatmatMicrobenchmark[["matmat"]]$dimensionParameters <- as.integer(c(2000, 3000))
myMatmatMicrobenchmark[["matmat"]]$numberOfTrials <- as.integer(c(3, 3))
myMatmatMicrobenchmark[["matmat"]]$numberOfWarmupTrials <- as.integer(c(1, 1))
# Set an appropriate run identifier
runIdentifier <- "matmat"
matmatResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory, microbenchmarks=myMatmatMicrobenchmark)
save(matmatResults, file="./DenseMatrixResults/matmatResultsFrame.RData")
This example runs all of the default sparse matrix microbenchmarks,
saves the summary statistics for each microbenchmark in the directory
SparseMatrixResults
, and saves the data frame returned from
the dense matrix benchmark to a file named
allResultsFrame.RData
.
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./SparseMatrixResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunSparseMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./SparseMatrixResults/allResultsFrame.RData")
This example runs only the Cholesky factorization microbenchmarks and saves the results to the same directory as in the previous example.
# Set an appropriate run identifier
runIdentifier <- "cholesky_only"
# Run only the sparse Cholesky factorization microbenchmarks
choleskyResults <- RunSparseMatrixBenchmark(runIdentifier, resultsDirectory,
matrixVectorMicrobenchmarks=NULL, luMicrobenchmarks=NULL, qrMicrobenchmarks=NULL)
save(choleskyResults, file="./SparseMatrixResults/choleskyResultsFrame.RData")
This example runs all of the default machine learning
microbenchmarks, saves the summary statistics for each microbenchmark in
the directory MachineLearningResults
, and saves the data
frame returned from the dense matrix benchmark to a file named
allResultsFrame.RData
.
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./MachineLearningResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunMachineLearningMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./MachineLearningResults/allResultsFrame.RData")
This example shows how to specify a new clustering microbenchmark and run it.
# Create a new clustering microbenchmark that tests the `clara` method from the
# `cluster` package using a data set with 250 features, 50 clusters, and
# 2000 normally distributed feature vectors per cluster.
claraMicrobenchmark <- list()
claraMicrobenchmark[["clara_cluster_250_50_2000"]] <- methods::new(
"ClusteringMicrobenchmark",
active = TRUE,
benchmarkName = "clara_cluster_250_50_2000",
benchmarkDescription = "Clustering of 100000 250-dimensional feature vectors into 50 clusters using clara function",
numberOfFeatures = as.integer(250),
numberOfClusters = as.integer(50),
numberOfFeatureVectorsPerCluster = as.integer(2000),
dataObjectName = NA_character_,
numberOfTrials = as.integer(3),
numberOfWarmupTrials = as.integer(1),
allocatorFunction = ClusteringAllocator,
benchmarkFunction = ClaraClusteringBenchmark
)
# Set an appropriate run identifier
runIdentifier <- "clara"
# Run the clara microbenchmark
claraResults <- RunMachineLearningBenchmark(runIdentifier, resultsDirectory,
clusteringMicrobenchmarks=claraMicrobenchmark)
save(claraResults, file="./MachineLearningResults/claraResultsFrame.RData")