Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Missing definition for ldexpf" and hipErrorTbd run-time errors in benchmark involving __builtin_powf #880

Open
jjennychen opened this issue Jun 30, 2024 · 0 comments

Comments

@jjennychen
Copy link
Collaborator

The following error:

CHIP warning [TID 73532] [1719787184.122714146] : Missing definition for 'ldexpf'
CHIP error [TID 73532] [1719787184.191923889] : hipErrorTbd (ZE_RESULT_ERROR_INVALID_MODULE_UNLINKED ) in /home/tsaini.chen/chipStar/src/backend/Level0/CHIPBackendLevel0.cc:2496:compile

CHIP error [TID 73532] [1719787184.192045641] : Caught Error: hipErrorTbd

was found when running a HeCBench benchmark (bn-cuda), which was compiled with -O3 flag. This error comes up even though the program does not call the ldexpf function, and it is checked that chipStar does have a ldexpf definition here that uses OpenCL ldexp function.
The trace was checked, and it is found that the zeModuleCreate returns ZE_RESULT_SUCCESS even though there was an error in the build log (unresolved external symbol ldexpf), and because the kernel was failed to be created with ZE_RESULT_ERROR_INVALID_MODULE_UNLINKED, the kernel launch fails during run-time.

17:15:25.983287715 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleCreate_entry: { hContext: 0x00005628e97d0be0, hDevice: 0x00005628e97e3b50, desc: 0x00007fff9017adf0, phModule: 0x00007fff9017afd8, phBuildLog: 0x00007fff9017aeb8, desc_val: { stype: ZE_STRUCTURE_TYPE_MODULE_DESC, pNext: 0x00007fff9017ae40, format: ZE_MODULE_FORMAT_IL_SPIRV, inputSize: 0, pInputModule: 0x0000000000000000, pBuildFlags: 0x0000000000000000, pConstants: 0x0000000000000000 }, desc__pBuildFlags_val: "(null)", desc__pConstants_val: , desc__pConstants__pConstantIds_vals: [], desc__pConstants__pConstantValues_vals: [  ] }
17:15:25.983350717 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_entry: { hEvent: 0x00005628e980f3e0 }
17:15:25.983350911 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_exit: { zeResult: ZE_RESULT_SUCCESS }
17:15:25.983351729 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_entry: { hEvent: 0x00005628ea61a3b0 }
17:15:25.983351994 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_exit: { zeResult: ZE_RESULT_SUCCESS }
17:15:25.983607311 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_entry: { hEvent: 0x00005628e980f730 }
17:15:25.983607499 - chiatta00 - vpid: 125926, vtid: 125949 - lttng_ust_ze:zeEventQueryStatus_exit: { zeResult: ZE_RESULT_SUCCESS }
17:15:26.539978394 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze_build:log: { buildLog: "error : unresolved external symbol ldexpf at offset 1036 in instructions segment #1 (aka kernel : _Z13computeKerneliiPKfPKbiiPfPi)\nerror : unresolved external symbol ldexpf at offset 1076 in instructions segment #1 (aka kernel : _Z13computeKerneliiPKfPKbiiPfPi)\n" }
17:15:26.539981951 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleCreate_exit: { zeResult: ZE_RESULT_SUCCESS, phModule_val: 0x00005628e9cd0460, phBuildLog_val: 0x00005628e99e8a50 }
17:15:26.539983380 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleGetKernelNames_entry: { hModule: 0x00005628e9cd0460, pCount: 0x00007fff9017afd8, pNames: 0x0000000000000000, pCount_val: 0 }
17:15:26.539983859 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleGetKernelNames_exit: { zeResult: ZE_RESULT_SUCCESS, pCount_val: 3, pNames_vals: [  ] }
17:15:26.539984107 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleGetKernelNames_entry: { hModule: 0x00005628e9cd0460, pCount: 0x00007fff9017afd8, pNames: 0x00007fff9017ac50, pCount_val: 3 }
17:15:26.539984519 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeModuleGetKernelNames_exit: { zeResult: ZE_RESULT_SUCCESS, pCount_val: 3, pNames_vals: [ 0x00005628ea6e7dd0, 0x00005628e99fbaa0, 0x00005628ea26a610 ] }
17:15:26.539987746 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeKernelCreate_entry: { hModule: 0x00005628e9cd0460, desc: 0x00007fff9017af50, phKernel: 0x00007fff9017aeb8, desc_val: { stype: ZE_STRUCTURE_TYPE_KERNEL_DESC, pNext: 0x0000000000000000, flags: [  ], pKernelName: 0x00005628e9d4f5c0 }, desc__pKernelName_val: "_Z14genScoreKerneliPfPKiPKf" }
17:15:26.539988726 - chiatta00 - vpid: 125926, vtid: 125926 - lttng_ust_ze:zeKernelCreate_exit: { zeResult: ZE_RESULT_ERROR_INVALID_MODULE_UNLINKED, phKernel_val: 0x00005628e99e8a50 }

After digging in, the error seems to be coming from the powf call in the program (please see the reproducer at the end of this issue for reference), and the __builtin_powf in one of the two powf definitions seems to be causing the error:

#if defined __has_builtin && __has_builtin(__builtin_powf)
static inline __device__ float powf(float x, float y) {
return __builtin_powf(x, y);
}
#else
extern "C++" __device__ float pow(float x, float y); // OpenCL
extern "C++" inline __device__ float powf(float x, float y) {
// TODO This function is affected by the --use_fast_math compiler flag
return ::pow(x, y);
}
#endif

When commented out the __builtin_powf and forced using the OpenCL pow function as powf definition, the error disappeared. However, if we only call powf in a reproducer, no errors are observed, so it seems like the __builtin_powf is not the only source of the error. Also, the program has to be compiled with an optimization flag for the error to show up (tested -O, -O1, -O2, and -O3, all of which produce the error).

[Reproducer]

  1. Clone and build chipStar
  2. Create a reproducer.cu file and paste the following code:
__global__ void kernel() {
  float lsinblock[10000] = { 0 };
  int t = 0;
  //int a = 0; // used for following testing

  for (int i=0; i<10; i++) {
    t = (int)lsinblock[(int)powf(2.0, i)+t]; // error
    //powf(2.0, i); // works
    //a = (int)powf(2.0, i); // works
    //a = (int)lsinblock[(int)powf(2.0,i)+t]; // works
    //a = (int)lsinblock[(int)powf(2.0,i)+0]; // works
    //t = (int)lsinblock[(int)powf(2.0,i)+0]; // works
    //t = (int)lsinblock[(int)powf(2.0,i)+5]; // works
  }
}

int main(int argc, char** argv) {
  int N = 1<<20;

  kernel<<<(N+255)/256, 256, 256 * sizeof(float)>>>();

  printf("done\n");
}
  1. Compile the code with nvcc -O3 reproducer.cu
  2. Run the program with ./a.out

The error shown above should pop up.

[Notes on the reproducer]
The lines with // works were individually tested to run without errors.

@jjennychen jjennychen changed the title "Missing definition for ldexpf" and hipErrorTbd run-time errors "Missing definition for ldexpf" and hipErrorTbd run-time errors in benchmarks involves __builtin_powf Jun 30, 2024
@jjennychen jjennychen changed the title "Missing definition for ldexpf" and hipErrorTbd run-time errors in benchmarks involves __builtin_powf "Missing definition for ldexpf" and hipErrorTbd run-time errors in benchmark involving __builtin_powf Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant