auditing-rules.tex

\chapter{Auditing Rules}
\label{sec:auditing-rules}

\emph{This chapter contains the auditing policies for the LDBC Benchmarks. The initial draft of the auditing policies was published in the EU project deliverable D6.3.3 ``LDBC Benchmark Auditing Policies''.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This chapter is divided into the following parts:
\begin{itemize}
    \item Motivation of benchmark result auditing
    \item General discussion of auditable aspects of benchmarks
    \item Specific checklists and running rules for \ldbcfinbench workloads
\end{itemize}

Many definitions and general considerations are shared between the benchmarks, hence it is justified to present the
principles first and to refer to these in the context of the benchmark-specific rules. The auditing process, including
the auditor certification exams, the possibility of challenging audited results, \etc, are defined in the LDBC
Byelaws~\cite{ldbc_byelaws}. Please refer to the latest Byelaws document when conducting audits.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Rationale and General Principles}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The purpose of benchmark auditing is to improve the \emph{credibility} and \emph{reproducibility} of benchmark claims by involving a set of detailed execution rules and third-party verification of compliance with these.

Rules may exist separately from auditing but auditing is not meaningful unless the rules are adequately precise.
Aspects like auditor training and qualification cannot be addressed separately from a discussion of the matters the
auditor is supposed to verify. Thus, the credibility of the entire process hinges on a clear and shared understanding
of what a benchmark is expected to demonstrate and on the auditor being capable of understanding the process
and verifying that the benchmark execution is fair and does not abuse the rules or pervert the objectives of
the benchmark.

Due to the open-ended nature of technology and the agenda of furthering innovation via measurement, it is
not feasible or desirable to over-specify the limits of benchmark implementation. Hence, there will always remain
judgment calls for borderline cases. In this respect auditing and the LDBC are not separate. It is expected that
issues of compliance, as well as maintenance of rules, will come before the LDBC as benchmark claims are
made.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Auditing Rules Overview}

\subsection{Auditor Training, Certification, and Selection}
\subsubsection{Auditor Training}
Auditor training consists of familiarization with the benchmark and existing implementations thereof. This involves the auditor candidate running the reference implementations of the benchmark to see what is normal behavior and practice in the workload. The training and practice may involve communication with the benchmark task force for clarifying the intent and details of the benchmark rules. This produces feedback for the task force for further specification of the rules.

\subsubsection{Auditor Certification}
The auditor certification and qualification are done in the form of an examination administered by the task force responsible for the benchmark being audited. The examination may be carried out by teleconference. The task force will subsequently vote on accepting each auditor, by a simple majority. An auditor is certified for a particular benchmark by the task force maintaining the benchmark in question.

\subsubsection{Auditor Selection}
In the default auditor selection, the task force responsible for the benchmark being audited appoints a third-party, impartial auditor. \emph{If needed, a Conflict of Interest Statement will be signed and provided.} The task force may in special cases appoint itself as auditor of a particular result. This is not,
however, the preferred course of action but may be done if no suitable third-party auditor is available.


\subsection{Auditing Process Stages}
\subsubsection{Getting Ready for a Benchmark Audit}
A benchmark result can be audited if it is a \emph{complete implementation} of an LDBC benchmark workload. This includes implementing all operations correctly, using official data sets, using the official LDBC driver (if available), and complying with the auditing rules of the workload (\eg workloads may have different rules regarding query languages, the allowance of materialized views, \etc).
Workloads may specify further requirements such as ACID compliance (checked using the LDBC FinBench ACID test suite).

\subsubsection{Performing a Benchmark Audit}
A benchmark result is to be audited by an LDBC-appointed auditor or the LDBC task force managing the benchmark. An LDBC audit may be performed by remote login and does not require the auditor's physical presence on site. The test sponsor shall grant the auditor any access necessary for validating the benchmark run. This will typically include administrator access to the SUT hardware.

\subsubsection{Benchmark-Specific Checklist}
Each benchmark specifies a checklist to be verified by the auditor. The benchmark run shall be performed by the auditor. The auditor shall make copies of relevant configuration files and test results for future checking and insertion into the full disclosure report.

\subsubsection{Producing the FDR}
The FDR is produced by the auditor or auditors, with any required input from the test sponsor. Each non-default configuration parameter needs to be included in the FDR and justification needs to be provided why the given parameter was changed.
The auditor produces an attestation letter that verifies the authenticity of the presented results. This letter is to be included in the FDR as an addendum. The attestation letter has no specific format requirements but shall state that the auditor has established compliance with a specified version of the benchmark specification.

\subsubsection{Publishing the FDR}
The FDR and any benchmark-specific summaries thereof shall be published on the LDBC website, \url{https://ldbcouncil.org/}.

\subsection{Challenge Procedure}

A benchmark result may be \emph{challenged} for non-compliance with LDBC rules. The benchmark task force responsible for the maintenance of the benchmark will rule on matters of compliance. A result found to be non-compliant will be withdrawn from the list of official LDBC benchmark results.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Auditable Properties of Systems and Benchmark Implementations}

\subsection{Validation of Query Results}
\label{sec:validation}
A benchmark should be published with a deterministically reproducible validation data set. Validation queries applied to the validation data set will deterministically produce a set of correct answers. This is used in the first stage of the benchmark run to test for the correctness of A SUT or benchmark implementation. This validation stage is not timed.

\paragraph{Inputs for validation}
The validation takes the form of a set of data generator parameters, a set of test queries that at least include one instance of each of the workload query templates and the expected results.

\paragraph{Approximate results and error margin}
In certain cases, the results may be approximate. This may happen in cases of non-unique result ordering keys, imprecise numeric data types, random behaviors in certain graph analytics algorithms etc. Therefore, a validation set shall specify the degree of allowable error: For example, for counts, the value must be exact, for sums, averages and the like, at least 8 significant digits are needed, for statistical measures like graph centralities, the result must be within 1\% of the reference result. Each benchmark shall specify its expectation in an unambiguously verifiable manner.

\subsection{ACID Compliance}
\label{sec:acid-compliance}

As part of the auditing process for the Transaction workload, the auditors ascertain that the SUT satisfies the ACID properties,
\ie it provides atomic transactions, complies with its claimed isolation level, and ensures durability in case of failures.
This section outlines the transactional behaviors of SUTs which are checked in the course of auditing A SUT in a given benchmark.

A benchmark specifies transactional semantics that may be required for different parts of the workload. The requirements will typically be different for the initial bulk load of data and for the workload itself. Different sections of the workload may further be subject to different transactionality requirements.

No finite series of tests can prove that the ACID properties are fully supported. Passing the specified tests is a necessary, but not sufficient, condition for meeting the ACID requirements. However, for fairness of reporting, only the tests specified here are required and must appear in the FDR for a benchmark. (This is taken exactly from the \mbox{TPC-C} specification~\cite{tpcc}.)

The properties for ACID compliance are defined as follows:

\paragraph{Atomicity}
Either all the effects of the transaction are in effect after the transaction or none of the effects
is in effect. This is by definition only verifiable after a transaction has finished.

\paragraph{Consistency}
ADS such as secondary indices will be consistent among themselves as well as with the table or other PDS, if any. Such a consistency (compliance to all constraints, if these are declared in the schema, \eg primary key constraint, foreign key constraints and cardinality constraints) may be verified
after the commit or rollback of a transaction. If a single thread of control runs within a transaction, then
subsequent operations are expected to see a consistent state across all data indices of a table
or similar object. Multiple threads which may share a transaction context are not required to observe a
consistent state at all times during the execution of the transaction. Consistency will however always be
verifiable after the commit or rollback of any transaction, regardless of the number of threads that have
either implicitly or explicitly participated in the transaction. Any intra-transaction parallelism introduced
by the SUT will preserve transactional semantics statement-by-statement. If explicit, application created
sessions share a transaction context, then this definition of consistency does not hold: for example, if
two threads insert into the same table at the same time in the same transaction context, these may or may
not see a consistent image of (E)ADS for the parts affected by the other thread. All things will be
consistent after the commit or rollback, however, regardless of the number of threads, implicit or explicit
that have participated in the transaction.

\paragraph{Isolation}
Isolation is defined as the set of phenomena that may (or may not) be observed by operations running within a single transaction context. The levels of isolation are defined as follows:

\begin{description}
    \item[Read uncommitted] No guarantees apply.
    \item[Read committed] A transaction will never read a value that has at no point in time been part of a
        committed state.
    \item[Repeatable read] If a transaction reads a value several times during its execution, then it will see
        the original state with its modifications so far applied to it. If the transaction itself consists of
        multiple reading and updating threads then the ambiguities that may arise are beyond the scope of transaction isolation.
    \item[Serializable] The transactions see values that correspond to a fully serial execution of
        all client transactions. This is like a repeatable read except that if the transaction reads something, and
        repeats the read, it is guaranteed that no new values will appear for the same search condition on a
        subsequent read in the same transaction context. For example, a row that was seen not to exist when
        first checked will not be seen by a subsequent read. Likewise, counts of items will not be seen to
        change.
\end{description}

\paragraph{Durability}
Durability means that once the SUT has confirmed a successful commit, the committed state
will survive any instantaneous failure of the SUT (\eg a power failure, software crash, reboot or
the like). Durability is tied to atomicity in that if one part of the changes made by a transaction survives then
all parts must survive. %This is a special concern in distributed systems which must coordinate durability across multiple physical systems and processes.


\subsection{Data Format and Preprocessing}
\label{sec:auditing-data-format}

When producing the data sets, implementers are allowed to use custom formatting options (\eg use or omission of quotes, separator character, datetime format, \etc).
It is also allowed to convert the output of the DataGen into a format (\eg Parquet) that is loadable by the test-specific implementation of the data importer.
Additional preprocessing steps are also allowed, including adjustments to the CSV files (\eg with shell scripts), splitting and concatenating files, compressing and decompressing files, \etc
However, the preprocessing step shall not include a precomputation of (partial) query results.

\subsection{Query Languages}
\label{sec:query-languages}

In typical RDBMS benchmarks, online transaction processing (OLTP) benchmarks are allowed to be implemented via stored procedures, effectively amounting to explicit query plans.
Meanwhile, online analytical processing (OLAP) benchmarks prohibit the use of using general-purpose programming languages (\eg C, C\texttt{++}, Java) for query implementations and only allow domain-specific query languages.

In the graph processing space, there is currently (as of 2022) no standard query language and the systems are considerably more heterogeneous.
Therefore, the LDBC situation regarding declarative is not as simple as that of for example the \mbox{TPC-H} (where queries should be specified in SQL with the additional constraint of omitting any hints for OLAP workloads) and individual FinBench workloads specify their policy of either requiring a domain-specific query language or allowing the implementation of the queries in a general-purpose programming language.

In the case of domain-specific languages, systems are allowed to implement a FinBench query as a sequence of multiple queries.
A typical example of this is the following sequence:
(1)~create a projected graph,
(2)~run query,
(3)~drop projected graph.
However, it is not allowed to use sub-queries in an unrealistic and contrived manner, \ie the goal of overcoming optimization issues, \eg hard-coding a certain join order in a declarative query language.
It is the responsibility of the auditor to determine whether a sequence of queries can be considered realistic w.r.t.\ how a user would formulate their queries in the language provided by the system.

\subsubsection{Rules for Imperative Implementations Using a General-Purpose Programming Language}
An implementation where the queries are written in a general-purpose programming language (including imperative and ``API-based'' implementations) may choose between semantically equivalent implementations of an operation based on the query parameters. This simulates the behavior of a query optimizer in the presence of literal values in the query. If an implementation does this, all the code must be disclosed as part of the FDR and the decision must be based on values extracted from the database, not on hard-coded threshold values in the implementation.

The auditor must be able to reliably assess the compliance of implementation to guidelines specifying these matters. The actual specification remains benchmark-dependent. Borderline cases may be brought to the task force responsible for arbitration.


\subsubsection{Disclosure of Query Implementations in the FDR}
Benchmarks allowing imperative expression of workload should require full disclosure of all query implementation code.

\subsection{Materialization}

The mix of read and update operations in a workload will determine to which degree precomputation of results is beneficial. The auditor must check that materialized results are kept consistent at the end of each transaction.

\subsection{System Configuration and System Pricing}
\label{sec:system-config}

% The next step is to collect the technical and pricing details of the system under test.

A benchmark execution shall produce a full disclosure report which specifies the hardware and software of the SUT, the benchmark implementation version and any specifics that are detailed in the benchmark specification. This clause gives a general minimum for disclosure for the SUT.

\subsubsection{Details of Machines Driving and Running the Workload}
A SUT may consist of one or more pieces of physical hardware. A SUT may include virtual or bare-metal machines in a cloud service.
For each distinct configuration, the FDR shall disclose the number of units of the type as well as the following:

\begin{enumerate}
    \item The used cloud provider (including the region where machines reside, if applicable).
    \item Common name of the item, \eg Dell PowerEdge xxxx or i3.2xlarge instance.
    \item Type and number of CPUs, cores \& threads per CPU, clock frequency, cache size.
    \item Amount of memory, type of memory and memory frequency, \eg 64GB DDR3 1333MHz.
    \item Disk controller or motherboard type if the disk controller is on the motherboard.
    \item For each distinct type of secondary storage device, the number and specification of the device, \eg 4xSeagate Constellation 2TB SATA 6Gbit/s.
    \item Number and type of network controllers, \eg 1x Mellanox QDR InfiniBand HCA, PCIE 2.0, 2x1GbE on motherboard. If the benchmark execution is entirely contained on a single machine, it must be stated, and the description of network controllers can be omitted.
    \item Number and type of network switches. If multiple switches are used, the wiring between the switches should be disclosed.
          Only the network switches and interfaces that participate in the run need to be reported. If the benchmark execution is entirely contained on a single machine, it must be stated, and the description of network switches can be omitted.
    \item Date of availability of the system as a whole, \ie the latest date of availability of any part.
\end{enumerate}

\subsubsection{System Pricing}
The price of the hardware in question must be disclosed. For cloud setups, the price of a dedicated instance for 3 years must be disclosed. The price should reflect the single quantity list price that any buyer could expect when purchasing one system with the given specification. The price may be either an item-by-item price or a package price if the system is sold as a package.
Reported prices should adhere to the TPC Pricing Specification 2.7.0~\cite{pricing,tpc-pricing}.
It is particularly important to ensure that the maintenance contract guarantees 24/7 support and 4~hour response time for problem recognition.

\subsubsection{Details of Software Components in the System}
The SUT software must be described at least as follows:
\begin{enumerate}
    \item The units of the SUT software are typically the DBMS and operating system.
    \item Name and version of each separately priced piece of the SUT software.
    \item If the price of the SUT software is tied to the platform or the count of concurrent users, these parameters must be disclosed.
    \item Price of the SUT software.
    \item Date of availability.
\end{enumerate}
Reported prices should adhere to the TPC Pricing Specification 2.5.0~\cite{pricing,tpc-pricing}.

The configuration of the SUT must be reported to include the following:
\begin{enumerate}
    \item The used LDBC specification, driver and data generator version.
    \item Complete configuration files of the DBMS, including any general server configuration files, any configuration scripts run on the DBMS for setting up the benchmark run etc.
    \item Complete schema of the DBMS, including eventual specification of storage layout.
    \item Any OS configuration parameters if other than default, \eg \verb+vm.swappiness+, \verb+vm.max_map_count+ in Linux.
    \item Complete source code of any server-side logic, \eg stored procedures, triggers.
    \item Complete source code of driver-side benchmark implementation.
    \item Description of the benchmark environment, including software versions, OS kernel version, DBMS version as well as versions of other major software components used for running the benchmark (Docker, Java Virtual Machine, Python, etc.).
    \item The SUT's highest configurable isolation level and the isolation level used for running the benchmark.
          %\item Use of partitioning or replication across multiple machines shall be disclosed if used. The specific partitioning keys or replication criteria, as well as the transactional behavior of said partitioning or replication shall be described. This shall not be inconsistent with the ACID behaviors specified in the benchmark.
\end{enumerate}


\subsubsection{Audit of System Configuration}
The auditor must ascertain that a reported run has indeed taken place on the SUT in the disclosed configuration.
The full disclosure shall contain any relevant parameters of the benchmark execution itself, including:
\begin{enumerate}
    \item Parameters, switches, configuration file for data generation.
    \item Complete text of any data loading script or program.
    \item Parameters, switches, configuration files for any test driver. If the test driver is not an LDBC supplied open source package or is a modification of such, then the complete text or diff against a specific LDBC package must be disclosed.
    \item Test driver output files shall be part of the disclosure. In general, these must at least detail the following:
          \begin{enumerate}[label=\roman*)]
              \item Time and duration of data load and the timed portion of the benchmark execution.
              \item Count of each workload item (\eg query, transaction) successfully executed within the measurement window.
              \item Min/average/max execution time of each workload item, the specific benchmark shall specify additional details.
          \end{enumerate}
\end{enumerate}

Given this information, the number of concurrent database sessions at each point in the execution must be clearly stated. In the case of a cluster database, the possible spreading of connections across multiple server processes must be disclosed.


All parameters included in this section must be reported in the full disclosure report to guarantee that the benchmark run can be reproduced exactly in the future. Similarly, the test sponsor will inform the auditor of the scale factor to test. Finally, a clean test system with enough space to store the initial data set, the update streams, substitution parameters and anything that is part of the input and output as well as the benchmark run must be provided.

\subsection{Benchmark Specifics}

Similarly to TPC benchmarks, the LDBC benchmarks prohibit so-called benchmark specials (\ie extra software modules implemented in the core DBMS logic just to make a selected benchmark run faster are disallowed). Furthermore, upon request of the auditor, the test sponsor must provide all the source codes relevant to the benchmark.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Auditing Rules for the Transaction Workload}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This section specifies a checklist (in the form of individual sections) that a benchmark audit shall cover in case of the FinBench Transaction workload. An overview of the benchmark audit workflow is shown in \autoref{fig:audit-workflow}. The three major phases of the audit are preparing the input data and validation query results (captured by \emph{Preparations} in the figure), validating the correctness of query results returned by the SUT using the validation scale factor and running the benchmark with all the prescribed workloads (\emph{Benchmarking}), and creating the FDR (\emph{Finalization}). The color codes capture the responsibilities of performing a step or providing some data in the workflow.

\begin{figure}[h]
    \centering
    \includegraphics[scale=\yedscale]{figures/audit-workflow}
    \caption{Benchmark execution and auditing workflow. For non-audited runs, the implementers perform the steps of the auditor.}
    \label{fig:audit-workflow}
\end{figure}

A key objective of the auditing guidelines for the Transaction workload is to \emph{allow a broad range of systems} to implement the benchmark.
Therefore, they do not impose constraints on the data model
(graph, relational, triple, \etc representations are allowed)
or on the query language
(both declarative and imperative languages are allowed).

\subsection{Scaling Factors}
\label{sec:transaction-workload-scaling}

The scale factor of a FinBench data set is the size of the data set in GiB of CSV (comma-separated values) files.
The size of a data set is characterized by scale factors: SF0.1, SF1, SF3 \etc (see \autoref{sec:scale-factors}).
All data sets contain data for three years of financial activities.

The \emph{validation run} shall be performed on the SF1 data set (see \autoref{sec:transaction-workload-validation-data-set}). Note that the auditor may perform additional validation runs of the benchmark implementation using smaller data sets (\eg SF1) and issue queries.

Audited \emph{benchmark runs} of the Transaction workload shall use SF10. The rationale behind this decision is to ensure that there is a sufficient number of update operations available to guarantee 2.5~hours of continuous execution (see \autoref{sec:transaction-workload-measurement-window}).

\subsection{Data Model}

FinBench may be implemented with different data models (\eg relational, RDF, and different graph data models). The reference schema is provided in the specification using a UML-like notation.

\subsection{Precomputation}

Precomputation of query results (both interim and end results) is allowed. However, systems must ensure that precomputed results (\eg materialized views) are kept consistent upon updates.

\subsection{Benchmark Software Components}
\label{sec:finbench-software-components}
LDBC provides a test driver, data generator, and summary reporting scripts. Benchmark implementations shall use a stable
version of the test driver. The SUT's database software should be a stable version that is available publicly or can be
purchased at the time of the release of the audit. Please see \autoref{sec:software-components} for more details.

\subsubsection{Adaptation of the Test Driver to a DBMS}
\label{sec:test-driver}
A qualifying run must use a test driver that adapts the provided test driver to interface with the SUT. Such an implementation, if needed, must be provided by the test sponsor. The parameter generation, result recording, and workload scheduling parts of the test driver should not be changed. The auditor must be given access to the test driver source code used in the reported run.

The test driver produces the following artifacts for each execution as a by-product of the run: Start and end timestamps in wall clock time, recorded with microsecond precision. The identifier of the operation and any substitution parameters.


\subsubsection{Summary of Benchmark Results}
\label{sec:performance-metrics}
A separate test summary tool provided with the test driver analyses the test driver log(s) after a measurement window is completed.

The tool produces for each of the distinct queries and transactions the following summary:
\begin{itemize}
    \item Run time of query in wall clock time.
    \item Count of executions.
    \item Minimum/mean/percentiles/maximum execution time.
    \item Standard deviation from the average execution time.
\end{itemize}
The tool produces for the complete run the following summary:
\begin{itemize}
    \item Operations per second for a given SF (throughput). This is the primary metric of this workload.
    \item The total execution time in wall clock time.
    \item The total number of completed operations.
\end{itemize}


\subsection{Implementation Language and Data Access Transparency}

The queries and updates may be implemented in a domain-specific query language or as procedural code written in a general-purpose programming language (\eg using the API of the database).

\subsubsection{Implementations Using a Domain-Specific Query Language}
\label{sec:finbench-domain-specific-query-language}

If a domain-specific query language is used, \eg SPARQL, SQL, Cypher, or Gremlin, then explicit query plans are prohibited in all read-only queries.%
\footnote{If the queries are not declarative clearly, the auditor must ensure that they do not specify explicit query plans by investigating their source code and experimenting with the query planner of the system (\eg using SQL's \texttt{EXPLAIN} command).}
The update transactions may still consist of multiple statements, effectively amounting to explicit plans.

Explicit query plans include but are not limited to:
\begin{itemize}
    \item Directives or hints specifying a join order or join type
    \item Directives or hints specifying an access path, \eg which index to use
    \item Directives or hints specifying an expected cardinality, selectivity, fanout or any other information that pertains to the expected number of results or cost of all or part of the query.
\end{itemize}

\begin{quote}
    \emph{Rationale behind the applied restrictions.} The updates are effectively OLTP and, therefore, the customary freedoms apply, including the use of stored procedures, however subject to access transparency. Declarative queries in a benchmark implementation should be such that they could plausibly be written by an application developer. Therefore, their formulation should not contain system-specific aspects that an application developer would be unlikely to know. In other words, making a benchmark implementation should not require uncommon sophistication on behalf of the developer. This is a regular practice in analytical benchmarks, \eg \mbox{TPC-H}.
\end{quote}

\subsubsection{Implementations Using a General-Purpose Programming Language}
\label{sec:finbench-general-purpose-programming-language}

Implementations using a general-purpose programming language for specifying the queries (including procedural, imperative, and API-based implementations) are expected to respect the rules described in \autoref{sec:query-languages}.
For these implementations, the rules in \autoref{sec:finbench-domain-specific-query-language} do not apply.

\subsection{Correctness of Benchmark Implementation}

\subsubsection{Validation data set}
\label{sec:transaction-workload-validation-data-set}
The scale factor 1 shall be used as a validation data set.

\subsubsection{ACID Compliance}
\label{sec:transaction-workload-acid-compliance}

The Transaction workload requires full ACID support (\autoref{sec:acid-compliance}) from the SUT.
This is tested using the LDBC ACID test suite.
For the specification of this test suite, see \autoref{sec:acid-test} and the related software repository at \url{https://github.com/ldbc/ldbc_finbench_acid}.

\paragraph{Expected level of isolation}
If a transaction reads the database with the intent to update, the DBMS must guarantee no dirty reads. In other words, this
corresponds to read committed isolation.

\paragraph{Durability and checkpoints}

A checkpoint is defined as the operation which causes data persisted in a transaction log to become durable outside the transaction log. Specifically, this means that A SUT restart after instantaneous failure following the completion of the checkpoint may not have recourse to transaction log entries written before the end of the checkpoint.

A checkpoint typically involves a synchronization barrier at which all data committed before the moment is required to be in durable storage that does not depend on the transaction log.
Not all DBMSs use a checkpoint mechanism for durability. For example, a system may rely on redundant storage of data for durability guarantees against the instantaneous failure of a single server.

The measurement window may contain a checkpoint. If the measurement window does not contain one, then the restart test will involve redoing all the updates in the window as part of the recovery test.

The timed window ends with an instantaneous failure of the SUT. Instantaneously killing all the SUT process(es) is adequate for simulating instantaneous failure. All these processes should be killed within one second of each other with an operating system action equivalent to the Unix \verb+kill -9+. If such is not available, then powering down each separate SUT component that has an independent power supply is also possible.

The restart test consists of restarting the SUT process(es) and finishes when the SUT is back online with all its functionality and the last successful update logged by the driver can be seen to be in effect in the database.
%In the case of a distributed (scale-out) system, a particular partition may be recovered whereas another one is still in the process of recovering. If this is so, then checking for the last update shall not be done until all partitions are online.

If the SUT hardware was powered down, the recovery period does not include the reboot and possible file system check time. The recovery time starts when the DBMS software is restarted.


\paragraph{Recovery}
The SUT is to be restarted after the measurement window and the auditor will verify that the SUT contains the entirety of the last update recorded by the test driver(s) as successfully committed. The driver or the implementation has to make this information available. The auditor may also check the \emph{audit log} of the SUT (if available) to confirm that the operations issued by the driver were saved.

Once an official run has been validated, the recovery capabilities of the system must be tested. The system and the driver must be configured in the same way as in during the benchmark execution. After a warm-up period, execution of the benchmark will be performed under the same terms as in the previous measured run.

\paragraph{Measuring recovery time}
At an arbitrary point close to 2 hours of wall clock time during the run, the machine will be shut down. Then, the auditor will restart the database system and will check that the last committed update (in the driver log file) is actually in the database. The auditor will measure the time taken by the system to recover from the failure. Also, all the information about how durability is ensured must be disclosed. If checkpoints are used, these must be performed for a period of 10 minutes at most.


\subsection{Benchmarking Workflow}
\label{sec:transaction-workload-benchmark-workflow}

A benchmark execution is divided into the following processes (these processes are also shown in \autoref{fig:audit-workflow}):

\begin{description}
    \item[Generate data] This includes running the data generator, placing the generated files in a staging area,
        configuring storage, setting up the SUT configuration and preparing any data partitions in the SUT. This may include
        preallocating database space but may not include loading any data or defining any schema having to do with the
        benchmark.
    \item[Preprocessing] If needed, the output from the data generator is to preprocess the data set (\autoref{sec:auditing-data-format}).
    \item[Create validation data] Using one of the reference implementations of the benchmark, the reference validation data is obtained in JSON format.
    \item[Data loading] The test sponsor must provide all the necessary documentation and scripts to load the data set
        into the database to test. This includes defining the database schema, if any, loading the initial database
        population, making this durably stored and gathering any optimizer statistics. The system under test must support
        the different data types needed by the benchmark for each of the attributes at their specified precision. No data
        can be filtered out, everything must be loaded. The test sponsor must provide a tool to perform arbitrary checks of
        the data or a shell to issue queries in a declarative language if the system supports it.
    \item[Run cross-validation] This step uses the data loader to populate the database, but the load is not timed. The
        validation data set is used to verify the correctness of the SUT. The auditor must load the provided data set and run the driver in validation mode, which will test that the queries provide the
        official results.  The benchmarking workflow will not go beyond this point unless the results match the expected
        output.
    \item[Warm-up] Benchmark runs are preceded by a warm-up which must be performed using the LDBC driver.
    \item[Run benchmark] The bulk load time is reported and is equal to the amount of elapsed wall clock time between
        starting the schema definition and receiving the confirmation message of the end of statistics gathering. The
        workflow runs begin after the bulk load is completed. If the run does not directly follow the bulk load, it must
        start at a point in the update stream that has not previously been played into the database. In other words, a run
        may only include update events whose timestamp is later than the latest message creation date in the database before
        the start of the run. The run starts when the first of the test drivers sends its first message to the SUT. If the
        SUT is running in the same process as the driver, the window starts when the driver starts. Also, make sure that
        the \verb|-rl/--results_log| is enabled. Make sure that all operations are enabled, and the frequencies are
        those for the selected scale factor (see the exact specification of the frequencies in
        \autoref{sec:sf-statistics}).
\end{description}

\subsubsection{Query Timing During Benchmark Run}
\label{sec:ontime-requirements}
A valid benchmark run must last at least 2 hours of wall clock time and at most 2 hours and 15 minutes.
In order to be valid, a benchmark run needs to meet the ``95\% on-time requirement''.
The \texttt{results\_log.csv} file contains the $\mathsf{actual\_start\_time}$ and the $\mathsf{scheduled\_start\_time}$ of each of the issued queries. To have a valid run, 95\% of the queries must meet the following condition:
\begin{equation*}
    \mathsf{actual\_start\_time} - \mathsf{scheduled\_start\_time} < 1\
    \mathrm{second}
\end{equation*}

If the execution of the benchmark is valid, the auditor must retrieve all the files from the directory specified by
\verb|--results_dir| which includes configuration settings used, results log and results summary. All of which must be
disclosed.

\subsubsection{Measurement Window}
\label{sec:transaction-workload-measurement-window}

Benchmark runs execute the workload on the SUT in two phases (\autoref{fig:measurement-window-selection}). First, the
SUT must undergo a warm-up period that takes at least 30 minutes and at most 35 minutes. The goal of this is to put the
system in a steady state which reflects how it would behave in a normal operating environment. The performance of the
operations during warm-up is not considered. Next, the SUT is benchmarked during a two-hour measurement window.
Operation times are recorded and checked to ensure the ``95\% on-time requirement'' is satisfied.

\begin{figure}[h]
    \centering
    \includegraphics[width=.7\linewidth]{figures/measurement-window-selection}
    \caption{Warm-up and measurement window for the benchmark run.}
    \label{fig:measurement-window-selection}
\end{figure}

The FinBench \DataGen produces 3~years worth data of which 3\% is used for updates
(\autoref{sec:transaction-workload-data-sets}), \ie approximately $3 \times 365 \times 0.03 = 32.85~\text{days} =
    788.4~\text{hours}$. To ensure that the 2.5~hours wall clock period has enough input data, the lower bound of TCR is
defined as 0.001 (if $2628$ hours of updates are played back at more than $1000\times$ speed, the benchmark framework
runs out of updates to execute). A system that can achieve a better compression (\ie lower TCR value) on a given scale
factor should use larger SFs for their benchmark runs -- otherwise their total runs will be less than 2.5~hours, making
them unsuitable for auditing.

%The test summary tool may be used for reading the logs created by a test driver.

\subsection{Full Disclosure Report}
\label{sec:transaction-workload-fdr}

Upon successful completion of the audit, an FDR is compiled. In addition to the general requirements, the full disclosure shall cover the following:

\begin{itemize}
    \item General terms: an executive summary and declaration of the credibility of the audit
    \item Conflict of Interest Statement between the auditor and the test sponsor, if needed.
    \item System description and pricing summary
    \item Data generation and data loading
    \item Test driver details
    \item Performance metrics
    \item Validation results
    \item ACID compliance
    \item List of supplementary materials
\end{itemize}

To ensure the reproducibility of the audited results, a supplementary package is attached to the full disclosure report. This package should contain:

\begin{itemize}
    \item A README file with instructions specifying how to set up the system and run the benchmark
    \item Configuration files of the database, including database-level configuration such as buffer size and schema descriptors (if necessary)
    \item Source code or binary of a generic driver that can be used to interact with the DBMS
    \item SUT-specific LDBC driver implementation (similarly to the projects in \url{https://github.com/ldbc/ldbc_finbench_transaction_impls})
    \item Script or instructions to compile the LDBC Java driver implementation
    \item Instructions on how to reach the server through CLI and/or web UI (if applicable), \eg the URL (including port number), username and password
    \item LDBC configuration files (\texttt{.properties}), including the \texttt{time\_compression\_ratio} values used in the audited runs
    \item Scripts to preprocess the input files (if necessary) and to load the data sets into the database
    \item Scripts to create validation data sets and to run the benchmark
    \item The implementations of the queries and the update operations, including their complete source code (\eg declarative queries specifications, stored procedures, \etc)
    \item Implementation of the ACID test suite
    \item Binary package of the DBMS (\eg \texttt{.deb} or \texttt{.rpm})
\end{itemize}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%