-
Notifications
You must be signed in to change notification settings - Fork 460
Server trouble‐shooting
Each server component (scheduler, feeder, transitioner, etc.) has its own log file.
These files are in the log_HOSTNAME subdirectory of the project directory.
The logs have entries for errors,
indicated by CRITICAL
; e.g.
make_work_mt.out:1601921:2022-08-16 20:36:53.1759 [CRITICAL] can't find wu wu_multi_thread_nodelete
This generally means that something that needs to happen isn't happening; you need to figure it out and fix it.
In addition, log entries can describe normal events. To control the verbosity of the log files:
- Scheduler: set the desired logging options
- File upload handler: set fuh_debug_level.
- daemons: pass the cmdline arg "-d N" (1=least verbose, 4=most verbose) If you run server components with -d 4, their database queries will be logged. This is useful for tracking down database-level problems.
If you're interested in the history of a particular job,
grep for WU#12345
or RESULT#12345
(where 12345 represents the ID) in the log files.
The html/ops pages also provide an interface for this.
The admin web interface provides a web-based interface for browsing your project's database.
You can also use MySQL tools such as
- The mysql interpreter. The 'show processlist;' query is useful for diagnosing DB performance problems.
- mytop: like 'top' for MySQL: shows running queries.
- phpMyAdmin: general-purpose web interface to MySQL
The command
bin/show_shmem
will print a textual summary of the contents of the shared-memory structure that caches jobs and information about applications.
- Are workunits (jobs) getting created correctly? Examine the database to see. If you're using a work generator, check its log file.
- Are results (job instances) getting created? Examine the database to see. If you don't see results, check the transitioner log file.
- Are jobs getting into shared memory? Use show_shmem (see above). You should see jobs. If not, check the feeder log file.
- Is the scheduler sending jobs? If not, check its log file, preferably with the following log flags:
- <debug_version/>: show details of app version selection
- <debug_send/>: show details of job assignment
- <debug_quota/>: show details of quota enforcement
- Are clients processing jobs correctly? Check the status and stderr output of completed jobs.
- Are output files getting uploaded? Check the file upload handler log file.
- Are jobs getting validated? Check the validator log file.
- Are jobs getting assimilated? Check the assimilator log file.
If the scheduler is acting incorrectly or crashing,
and you like mucking around in C++ source code,
you can run it under a debugger like gdb
.
The scheduler is a CGI program;
it reads a request from stdin and writes a reply to stdout.
So you can debug it as follows:
- Copy the "scheduler_request_X.xml" file from a client to the machine running the scheduler. (X = your project URL)
- Run the scheduler under the debugger, giving it this file as stdin, i.e.:
gdb cgi
(set a breakpoint if desired)
r < scheduler_request_X.xml
- You may have to doctor the database as follows to keep the scheduler from rejecting the request:
update host set rpc_seqno=0, rpc_time=0 where hostid=N
As an alternative to this, edit sched/handle_request.cpp
,
and put a call to debug_sched("debug_sched");
just before sreply.write(fout, sreq);
.
Then, after recompiling, touch a file called 'debug_sched' in the project root directory.
This will cause transcripts of all subsequent scheduler requests and replies
to be written to the cgi-bin/
directory with separate small files for each request.
The file names are sched_request_H_R
and sched_reply_H_R
where H=hostid and R=rpc sequence number.
This can be turned off by deleting the 'debug_sched' file.
To get core files for scheduler crashes, uncomment the following line in sched/sched_main.cpp, and recompile:
#define DUMP_CORE_ON_SEGV 1