Automatically collect and track benchmark results over time #633

chippmann · 2024-05-13T20:23:46Z

Internally we already talked about this a few times in the past and already outlined the first steps;
The goal is to track the performance of our binding over time in an automated and reproducible way.

As a reminder; we already have a benchmark project which tests the performance of our binding in key areas compared to typed and untyped gdscript and it already outputs a usable json with key metrics and raw data.

While this is useful, it never was a clear indication of our performance as it was never run on the same machine over time, and had to be executed manually.

Hence we decided the following which this issue should track:

Setup a self-hosted windows 11 github runner (we cannot use one provided by github as these do not have consistent performance characteristics. Like different CPU models with different IPC and so on)
Setup a github workflow which runs the benchmarks after changes
Aggregate the resulting data automatically in spreadsheets and visualize the performance over time (here we talked about firebase cloud functions and google sheets. more on that later)
Make these metrics publicly available for full transparency

The following has already been done:

I set up a self hosted github runner at home with dedicated hardware. The sole purpose of this runner is to run these benchmarks and nothing else.
A test spread sheet is set up (Note: the current data in this sheet is NOT representative! It's just TEST DATA!) and data is automatically aggregated with the following AppScript (a first draft):

const spreadsheetId = '<sheet_id>'

function doPost(e) {
  var jsonData = JSON.parse(e.postData.contents);
  var timestamp = new Date().toLocaleString();
  var spreadsheet = SpreadsheetApp.openById(spreadsheetId);
  
  // Removing all existing charts in the Dashboard
  var dashboard = spreadsheet.getSheetByName("Dashboard");
  if(dashboard) {
    var charts = dashboard.getCharts();
    for(var i=0; i<charts.length; i++) {
      dashboard.removeChart(charts[i]);
    }
  }
  
  for(var benchmark in jsonData['data']) {
    for(var language in jsonData['data'][benchmark]) {
      // Naming the sheet as 'benchmark|language'
      var sheetName = benchmark+"|"+language;
      var sheet = spreadsheet.getSheetByName(sheetName);
      
      // If the sheet does not exist, create a new one
      if(!sheet) {
        sheet = spreadsheet.insertSheet(sheetName);
        var header = ['Timestamp', 'avg', 'min', 'max', 'median', 'p05', 'p95'];
        sheet.appendRow(header);
      }
      
      // Convert JSON data to row data
      var row = [timestamp];
      row.push(jsonData['data'][benchmark][language]['avg']);
      row.push(jsonData['data'][benchmark][language]['min']);
      row.push(jsonData['data'][benchmark][language]['max']);
      row.push(jsonData['data'][benchmark][language]['median']);
      row.push(jsonData['data'][benchmark][language]['p05']);
      row.push(jsonData['data'][benchmark][language]['p95']);
      
      // Append language data to sheet
      sheet.appendRow(row);
    }
  }

  // Create benchmark comparison line graphs 
  var allSheets = spreadsheet.getSheets();
  if(!dashboard) {
    dashboard = spreadsheet.insertSheet("Dashboard");
  }

  var benchmarkCharts = {};
  var benchmarkChartsSeries = {};

  for(var i=0; i<allSheets.length; i++) {
    var sheetName = allSheets[i].getName();
    if(sheetName != "Dashboard"){
      var benchmarkName = sheetName.split("|")[0];
      var languageName = sheetName.split("|")[1];
      var lastRow = allSheets[i].getLastRow();
      
      // If it's a new benchmark, initialize a new chart builder
      if(!(benchmarkName in benchmarkCharts)) {
        benchmarkChartsSeries[benchmarkName] = [{labelInLegend: languageName}]
        benchmarkCharts[benchmarkName] = dashboard.newChart()
          .asLineChart()
          .setOption('title', 'Performance Trend for ' + benchmarkName)
          .setOption('hAxis.title', 'Time')
          .setOption('vAxis.title', 'Average Score');
      } else {
        benchmarkChartsSeries[benchmarkName].push({labelInLegend: languageName})
      }
      
      // Adding the range from current sheet excluding header and adding language as a series
      benchmarkCharts[benchmarkName].addRange(allSheets[i].getRange(2, 1, lastRow - 1, 2));
    }
  }
  
  // Build and insert all benchmark charts
  var position = 1;
  for(var benchmark in benchmarkCharts) {
    // Update position for the new chart.
    benchmarkCharts[benchmark]
      .setPosition(position * 20, 1, 0, 0)
      .setOption('series', benchmarkChartsSeries[benchmark]);
    var chart = benchmarkCharts[benchmark].build();
    dashboard.insertChart(chart);
    position++; // Update position for the next chart
  }
}

The following needs to be done:

Give RDP access to other maintainers for maintenance
Add self-hosted runner to utopia-rise organisation
Setup workflow to run the benchmarks on changes
Setup final spread sheet and app script

The following points can be improved at a later stage:

Possibly Migrating from AppScript and google sheets to Firebase cloud functions and firestore (or Supabase equivalents)
Run benchmarks as part of PR pipeline to see possible performance problems as part of the PR process

chippmann added the performance Related to performance problem label May 13, 2024

chippmann mentioned this issue May 16, 2024

Improve ci/cd cache setup #638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically collect and track benchmark results over time #633

Automatically collect and track benchmark results over time #633

chippmann commented May 13, 2024

Automatically collect and track benchmark results over time #633

Automatically collect and track benchmark results over time #633

Comments

chippmann commented May 13, 2024