-
-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pmproxy[1696354]: segfault at 7 ip 00007f2cd8644e2a sp 00007ffd0335e440 error 4 in libpcp_web.so.1 #1815
Comments
It's not about the data: I can get the data with 'pmseries --load ' loaded on a Fedora38 with pcp-6.1.0-1.fc38.x86_64, and then executing 'wget "http://localhost:44322/series/instances?series=5f8b5c6d7865695d21729b676d8168cfa070d0a1"' succeeds. |
Before segfaulting, pmproxy logs this:
|
@christianhorn do you have a way I can reproduce this? I'm not seeing similar behaviour here, doing instances queries on an openmetrics metric series ID ... must be missing some characteristic of your metrics. Looking at the code, that bad mapping diagnostic suggests some metric/instance metadata has not been indexed in Redis for some reason, ... and if thats the root cause then the crash may be in a cascading error. Re pcp-6.1.0 vs pcp-6.1.1 - does it work OK with one but not the other? Thanks. |
On the affected system, whenever I run the query with wget (or from Grafana), I see the crash again. I can not setup graphes in Grafana for some metrics because of this, as each access fails. Good idea, let me try pcp-6.1.0. |
| Good idea, let me try pcp-6.1.0. Oh, I thought that's what you were saying in the second comment already? |
Ah sorry, I should be more specific. I saw the issue initially on Debian Bookworm/PCP 6.1.1, had for testing loaded these archive files with pmseries into a Fedora system with pcp-6.1.0-1.fc38, and did not see any issues there. More tests: Own build of PCP-6.1.1 on that Debian Bookworm: pmroxy crash On a pristine fresh Debian Bookworm on a KVM guest, the same PCP-6.0.3 from the Debian repo does not crash. checksums of /usr/lib/pcp/bin/pmproxy and all of the libs it was linked against are similiar on both systems. I tried to compare strace runs of pmproxy receiving the request, but did not get far with that. I reckon this must be something fishy about the installation.. but not sure how to further pin the issue down. |
Can you find a minimal openmetrics configuration that I can run locally to reproduce it here? Thanks. |
I see the crashing pmproxy without extra pmdas, also reduced to "pmda: root pmcd pmproxy linux" the crash occurs. I'm not able to replicate it in a newly setup Debian Bookworm amd64 deployment. Both working/not working AMD64, and KVM guests. |
On the working system I installed the same packages as on the non-working system, but not seeing the crashes there. pmproxy logfile has this:
|
Hm.. so I have a plain new deployment of Debian Bookworm with the PCP packages from the Bookworm repos, and do not see the crash there. The last part in an strace on that box (before the strace diverges from the system where pmproxy crashes):
That part on the system where pmproxy crashes:
Handle 15 is the connection with redis, so right after communication with Redis, pmproxy writes the log entry and crashes. |
Is there anything in the Redis log file? You can also run a 'redis-cli monitor' in another terminal & watch redis commands arrive up until the failure point, that may prove helpful. Could there be a difference in redis versions between the working/failure machines I wonder? I've audited the PCP code and can see where that diagnostic is produced ... but cannot see any code issue that could result in SIGSEGV around there. Another helpful piece of information would be the pmproxy stack trace from when it crashes (got abrt there?). |
Ok/notok systems have same versions of redis, package wise and also checksums of the bits on the file systems. On the system with not-crashing-pmproxy, monitor reports:
Ok the system with crashing pmproxy:
These last 4 are queries from the newly spawned pmroxy, after the crash. If I execute the last query towards pmroxy manually with
..then pmproxy is not crashing:
When I look with strace at redis on the box where pmproxy is crashing, I observe this:
redis is writing that to pmproxy. |
stack trace:
|
The np@entry=0x7 is the cause of the sigsegv - its not null (0x0) but is also not a valid pointer address, so when its accessed as such in skip_free_value_set bad things happen. Question is, how did it get to be this value? It could be a use-after-free (valgrind would tell us more there) - auditing the code once more, I cannot see it just from a visual inspection. |
Output from
and then doing the request with wget. |
On Debian 12/bookwork amd64, with a self built PCP as from today (6.1.1), I see this crash when accessing some metrics via grafana or for example 'wget "http://localhost:44322/series/instances?series=5f8b5c6d7865695d21729b676d8168cfa070d0a1"':
Apparently just some metrics are affected, all of them via pmda-openmetrics. Some metrics via pmda-openmetrics work. Accessing the metrics which lead to the crash via 'pmrep -a' directly from the archive files, the metrics are available and look ok.
The text was updated successfully, but these errors were encountered: