Skip to content

Latest commit

 

History

History
222 lines (153 loc) · 10.7 KB

fault_diagnosis_and_analysis.md

File metadata and controls

222 lines (153 loc) · 10.7 KB
sidebar_label sidebar_position slug
Fault Diagnosis and Analysis
5
/fault_diagnosis_and_analysis

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Fault Diagnosis and Analysis

Client log

JuiceFS client will output logs for troubleshooting while running. The level of logs in terms of fatality follows DEBUG < INFO < WARNING < ERROR < FATAL. Since DEBUG logs are not printed by default, you need to explicitly enable it if needed, e.g. by adding the --debug option when running the JuiceFS client.

Different JuiceFS clients print logs in different ways, which are described as follows.

Mount point

When a JuiceFS file system is mounted with the -d option (indicating running in the background), it will print logs to the system log file and local log file simultaneously. Depending on which user is running when mounting the file system, the paths of the local log files are slightly different. For root, the local log file locates at /var/log/juicefs.log, while it locates at $HOME/.juicefs/juicefs.log for non-root users. Please refer to --log option for details.

Depending on the operating system, there are different commands to retrieve system logs or read local log files directly.

tail -n 100 /var/log/juicefs.log
syslog | grep 'juicefs'
cat /var/log/syslog | grep 'juicefs'
cat /var/log/messages | grep 'juicefs'

You can use the grep command to filter different levels of logs for performance analysis or troubleshooting:

cat /var/log/syslog | grep 'juicefs' | grep '<ERROR>'

Kubernetes CSI Driver

Depending on the version of the JuiceFS CSI Driver, there are different ways to retrieve logs. Please refer to CSI Driver documentation for details.

S3 Gateway

The S3 gateway can only run in the foreground, so client logs are output directly to the terminal. If you deploys the S3 gateway in Kubernetes, you can get logs from the corresponding pods.

Hadoop Java SDK

The JuiceFS client logs will be mixed into the logs of processes using JuiceFS Hadoop Java SDK, e.g. Spark executor. Thus, you need to use keywords, e.g. juicefs (case-insensitive), to filter out the logs you do not want.

Access log

Each JuiceFS client has an access log that records all operations on the file system in detail, such as operation type, user ID, group ID, file inodes and time cost. Access logs can be used for various purposes such as performance analysis, auditing, and troubleshooting.

Access log format

An example format of an access log is as follows:

2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>

The meaning of each column is:

  • 2021.01.15 08:26:11.003330: The time of the current operation
  • [uid:0,gid:0,pid:4403]: User ID, group ID, process ID of the current operation
  • write: Operation type
  • (17669,8666,4993160): The input parameters of the current operation type. For example, the input parameters of the write operation in the example are the inode of the written file, the size of the written data, and the offset of the written file. Different operation types have different parameters. For details, please refer to the vfs.go file.
  • OK: Indicate the current operation is successful or not. If it is unsuccessful, specific failure information will be output.
  • <0.000010>: The time (in seconds) that the current operation takes.

You can debug and analyze performance issues with access log, or try using juicefs profile <mount-point> to see real-time statistics. Run juicefs profile -h or refer to Operations Profiling for details.

Different JuiceFS clients obtain access log in different ways, which are described below.

Mount point

There is a virtual file named .accesslog in the root directory of the JuiceFS file system mount point, the contents of which can be viewed by the cat command (the command will not exit), for example (assuming the root directory of the mount point is /jfs):

$ cat /jfs/.accesslog
2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>
2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014>
2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006>

Kubernetes CSI Driver

Please refer to CSI Driver documentation to find the mount pod or CSI driver pod depending on the version of JuiceFS CSI Driver you are using, and the .accesslog file can be viewed in the root directory of the JuiceFS file system mount point in the pod. The mount point path in the pod is /jfs/<pv_volumeHandle>. Assuming there is a mount pod named as juicefs-1.2.3.4-pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373, in which pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373 is <pv_volumeHandle>, you can then use the following command to view the .accesslog file:

kubectl -n kube-system exec juicefs-chaos-k8s-002-pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373 -- cat /jfs/pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373/.accesslog

S3 Gateway

You need to add the --access-log option when starting the S3 gateway to specify the path to output the access log. By default, the S3 gateway does not output the access log.

Hadoop Java SDK

You need to add the juicefs.access-log configuration item in the client configurations of the JuiceFS Hadoop Java SDK to specify the path of the access log output, and the access log is not output by default.

Runtime information

By default, JuiceFS clients will listen to a TCP port locally via pprof to get runtime information such as Goroutine stack information, CPU performance statistics, memory allocation statistics. You can see the specific port number that the current JuiceFS client is listening on by using the system command (e.g. lsof):

:::note If you mount JuiceFS as the root user, you need to add sudo before the lsof command. :::

$ lsof -i -nP | grep LISTEN | grep juicefs
juicefs   32666 user    8u  IPv4 0x44992f0610d9870b      0t0  TCP 127.0.0.1:6061 (LISTEN)
juicefs   32666 user    9u  IPv4 0x44992f0619bf91cb      0t0  TCP 127.0.0.1:6071 (LISTEN)
juicefs   32666 user   15u  IPv4 0x44992f062886fc5b      0t0  TCP 127.0.0.1:9567 (LISTEN)

By default, pprof listens on port numbers ranging from 6060 to 6099. That's why the actual port number in the above example is 6061. Once you get the listening port number, you can view all the available runtime information by accessing http://localhost:<port>/debug/pprof, and some important runtime information will be shown as follows:

  • Goroutine stack information: http://localhost:<port>/debug/pprof/goroutine?debug=1
  • CPU performance statistics: http://localhost:<port>/debug/pprof/profile?seconds=30
  • Memory allocation statistics: http://localhost:<port>/debug/pprof/heap

To make it easier to analyze this runtime information, you can save it locally, e.g.:

$ curl 'http://localhost:<port>/debug/pprof/goroutine?debug=1' > juicefs.goroutine.txt
$ curl 'http://localhost:<port>/debug/pprof/profile?seconds=30' > juicefs.cpu.pb.gz

```bash
$ curl 'http://localhost:<port>/debug/pprof/heap' > juicefs.heap.pb.gz

If you have the go command installed, you can analyze it directly with the go tool pprof command. For example to analyze CPU performance statistics:

$ go tool pprof 'http://localhost:<port>/debug/pprof/profile?seconds=30'
Fetching profile over HTTP from http://localhost:<port>/debug/pprof/profile?seconds=30
Saved profile in /Users/xxx/pprof/pprof.samples.cpu.001.pb.gz
Type: cpu
Time: Dec 17, 2021 at 1:41pm (CST)
Duration: 30.12s, Total samples = 32.06s (106.42%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30.57s, 95.35% of 32.06s total
Dropped 285 nodes (cum <= 0.16s)
Showing top 10 nodes out of 192
      flat  flat%   sum%        cum   cum%
    14.73s 45.95% 45.95%     14.74s 45.98%  runtime.cgocall
     7.39s 23.05% 69.00%      7.41s 23.11%  syscall.syscall
     2.92s  9.11% 78.10%      2.92s  9.11%  runtime.pthread_cond_wait
     2.35s  7.33% 85.43%      2.35s  7.33%  runtime.pthread_cond_signal
     1.13s  3.52% 88.96%      1.14s  3.56%  runtime.nanotime1
     0.77s  2.40% 91.36%      0.77s  2.40%  syscall.Syscall
     0.49s  1.53% 92.89%      0.49s  1.53%  runtime.memmove
     0.31s  0.97% 93.86%      0.31s  0.97%  runtime.kevent
     0.27s  0.84% 94.70%      0.27s  0.84%  runtime.usleep
     0.21s  0.66% 95.35%      0.21s  0.66%  runtime.madvise

Runtime information can also be exported to visual charts for a more intuitive analysis. The visual charts can be exported to various formats such as HTML, PDF, SVG, PNG, etc. For example, the command to export memory allocation statistics as a PDF file is as follows:

:::note The export to visual chart function relies on Graphviz, so please install it first. :::

go tool pprof -pdf 'http://localhost:<port>/debug/pprof/heap' > juicefs.heap.pdf

For more information about pprof, please see the official documentation.

Profiling with the Pyroscope

Pyroscope

Pyroscope is an open source continuous profiling platform. It will help you:

  • Find performance issues and bottlenecks in your code
  • Resolve issues of high CPU utilization
  • Understand the call tree of your application
  • Track changes over time

JuiceFS supports using the --pyroscope option to pass in the pyroscope server address, and metrics are pushed to the server every 10 seconds. If permission verification is enabled on the server, the verification information API Key can be passed in by the environment variable PYROSCOPE_AUTH_TOKEN:

export PYROSCOPE_AUTH_TOKEN=xxxxxxxxxxxxxxxx
juicefs mount --pyroscope http://localhost:4040 redis://localhost /mnt/jfs
juicefs dump --pyroscope http://localhost:4040 redis://localhost dump.json