-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge memory increase of wazuh-modulesd in Vulnerability Detection performance test #27260
Comments
weird behavior, probably a memory corruption, It is as if a function is called that is only called once and in startup. It may not be referring to VD, since the last correct log is that a rootcheck scan ends
@sebasfalcone run wazuh-modulesd with ASAN. |
Attack plan@MiguelazoDS We are going to install a manager and run modulesd with ASAN to determine if we have some issue with the content updater |
AnalysisTrying to perform the procedure above, I'm not getting any error, not because is has none, but due to the execution of thread sanitizer is not happening. Keep analyzing.
UpdateThanks @pereyra-m for helping setting flags in the right place. Although I couldn't find any error with the content manager yet. I deliverately added a leak in the pubSubPublisher class to verify the flags are working and will detect any anomaly in the content manager. Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x79aba40b4887 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x79ab9469e8cb in PubSubPublisher::publish(UpdaterContext&) const /home/vagrant/wazuh/src/shared_modules/content_manager/src/components/pubSubPublisher.hpp:66
#2 0x79ab9469edcd in PubSubPublisher::handleRequest(std::shared_ptr<UpdaterContext>) /home/vagrant/wazuh/src/shared_modules/content_manager/src/components/pubSubPublisher The only leaks we found is related to the configuration ==195906==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 74 byte(s) in 1 object(s) allocated from:
#0 0x7de74565b9a7 in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:454
#1 0x56853962ba12 in Read_Global config/global-config.c:210
#2 0x568539625821 in read_main_elements config/config.c:85
#3 0x5685396270ff in ReadConfig config/config.c:372
#4 0x56853965ee48 in wm_vulnerability_detection_subnode_read config/wmodules-vulnerability-detection.c:106
#5 0x56853965f260 in Read_Vulnerability_Detection config/wmodules-vulnerability-detection.c:170
#6 0x5685396263a0 in read_main_elements config/config.c:196
#7 0x5685396270ff in ReadConfig config/config.c:372
#8 0x568539602cea in wm_config wazuh_modules/wmodules.c:77
#9 0x5685396000e2 in wm_setup wazuh_modules/main.c:134
#10 0x5685395ffe33 in main wazuh_modules/main.c:82
#11 0x7de743629d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Direct leak of 48 byte(s) in 1 object(s) allocated from:
#0 0x7de7456b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x56853962d83e in Read_Global config/global-config.c:525
#2 0x568539625821 in read_main_elements config/config.c:85
#3 0x5685396270ff in ReadConfig config/config.c:372
#4 0x56853965ee48 in wm_vulnerability_detection_subnode_read config/wmodules-vulnerability-detection.c:106
#5 0x56853965f260 in Read_Vulnerability_Detection config/wmodules-vulnerability-detection.c:170
#6 0x5685396263a0 in read_main_elements config/config.c:196
#7 0x5685396270ff in ReadConfig config/config.c:372
#8 0x568539602cea in wm_config wazuh_modules/wmodules.c:77
#9 0x5685396000e2 in wm_setup wazuh_modules/main.c:134
#10 0x5685395ffe33 in main wazuh_modules/main.c:82
#11 0x7de743629d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 |
I came to the same conclusion as @MiguelazoDS , it does not seem to be easily replicated in any environment, based on the comparison that @sebasfalcone made, although the increase was not seen in other environments, but if the undefined behavior was seen in these test environments(beta), we need to launch and gain access to that environment to proceed. |
UpdateRegarding the following logs:
They are generated by the API querying the configuration of the modules: This has no impact Thanks @pereyra-m for pointing this out |
Performance Vulnerability Tests: Overview and configurationThis document provides insights into the performance vulnerability tests conducted upon request from @sebasfalcone and @Dwordcito Environment
ConfigurationsLocal Internal Options
Simulated Agent DeploymentAgents are simulated using ECSTasks:
Load SimulationThe environment is saturated to simulate 20,000 agents, each generating 1 syslog EPS (Event Per Second). Sample Syslog Data {
"type": "dbsync_osinfo",
"data": {
"checksum": "1634140017886803554",
"architecture": "x86_64",
"hostname": "1-Zhnp8XImWsfJR9Mu-debian8",
"os_codename": "focal",
"os_major": "20",
"os_minor": "04",
"os_name": "Ubuntu",
"os_platform": "ubuntu",
"os_patch": "6",
"os_release": "sp1",
"os_version": "20.04.6 LTS (Focal Fossa)",
"os_build": "4.18.0-305.12.1.el8_4.x86_64",
"release": "6.2.6-76060206-generic",
"scan_time": "2023/12/2011:24:58",
"sysname": "Linux",
"version": "#202303130630~1689015125~22.04~ab2190eSMPPREEMPT_DYNAMIC"
},
"operation": "INSERTED"
} Delete operation {
"type": "dbsync_osinfo",
"data": { ... (same as above) },
"operation": "DELETED"
} Agent Vulnerability Simulation Each agent iterates over a list of mocked vulnerable packages sourced from the repository (100 packages per agent). After processing, agents send deleted events for the same packages. Manager configurationsManager configurations are the default, changing the cluster and the Indexer configuration block in order to deploy the clusters In addition in manager hosts it is changed the socket request queue limit:
Configuration applied running:
|
UpdateWe are having trouble with the debugging symbols generated by the workflow. |
UpdateI initially thought that the huge size of events queue was affecting the feed update process somehow, but the log shown in this manual test is not present in the execution of the performance test. 2024/12/12 15:15:07 wazuh-modulesd:vulnerability-scanner[63341] databaseFeedManager.hpp:372 at operator()(): ERROR: Error updating feed: Error putting data: IO error: No space left on device: While appending to file: queue/vd/feed/002321.sst: No space left on device.
2024/12/12 15:15:07 wazuh-modulesd:content-updater[63341] pubSubPublisher.hpp:51 at publish(): DEBUG: Failed to publish data
2024/12/12 15:15:07 wazuh-modulesd:content-updater[63341] actionOrchestrator.hpp:177 at run(): WARNING: Offset processing failed. Triggered a snapshot download. The scan is looping and the databases do not decrease, although that seems to not be related to the huge increase in volatile memory The alerts file is increasing as a consequence of repeated scans over the same packages There we can see that the events are being re-processed and the event queue does not decrease. @pereyra-m notice that some agents do not have OS information. I verified in one manager and I found that 68 agents do not have that info (there are more, logs rotate). I fixed all dbs that have missing information but the results is the same, events queue does not decrease and report queue keeps growing This would require and extra analysis and it is not related to the problem described in the issue, but I wonder if the slow performance in the offset update is directly related to the misaligned access while working with flatbuffers. Expand2024/12/16 09:14:09 wazuh-modulesd:vulnerability-scanner[6672] databaseFeedManager.hpp:187 at processMessage(): DEBUG: Processing file: queue/vd_updater/tmp/contents/1130987-api_file.json
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/verifier.h:173:42: runtime error: member call on misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/vector.h:164:44: runtime error: member access within misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/verifier.h:174:22: runtime error: member call on misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/vector.h:176:5: runtime error: member call on misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/vector.h:177:40: runtime error: member call on misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^
/home/vagrant/wazuh/src/external/flatbuffers/include/flatbuffers/vector.h:254:47: runtime error: member access within misaligned address 0x6290002f3372 for type 'const struct Vector', which requires 4 byte alignment
0x6290002f3372: note: pointer points here
04 00 00 00 07 00 00 00 54 39 00 00 c8 2e 00 00 14 25 00 00 74 1b 00 00 74 12 00 00 54 11 00 00
^ |
UpdateI've run an analysis to find the bottlenecks in the package scanner, this is the POC To properly compare the time took in each step, I instrumented the scanner with Tracy Basically, the POC:
The Tracy lib was compiled like this, resulting in
Also, had to add at
For each target, link the library:
For each source file
I wasn't able to compile the capture tool in the old CentOS package builder image (I installed the dependencies The branch doesn't compile the profiler, this can be done like:
Once all is set, capture before starting the scan
Once the scan is finished, stop the manager, and open the file with the profiler. Here I highlight the slowest zones for the manager scan with 426 packages We need to keep analyzing these results but with another inputs to compare the performance. Agent package scanWe suspect that the packages used in the test are particularly slow. Details
{
"type": "dbsync_osinfo",
"data": {
"checksum": "1634140017886803554",
"architecture": "x86_64",
"hostname": "1-Zhnp8XImWsfJR9Mu-debian8",
"os_codename": "focal",
"os_major": "20",
"os_minor": "04",
"os_name": "Ubuntu",
"os_platform": "ubuntu",
"os_patch": "6",
"os_release": "sp1",
"os_version": "20.04.6 LTS (Focal Fossa)",
"os_build": "4.18.0-305.12.1.el8_4.x86_64",
"release": "6.2.6-76060206-generic",
"scan_time": "2023/12/2011:24:58",
"sysname": "Linux",
"version": "#202303130630~1689015125~22.04~ab2190eSMPPREEMPT_DYNAMIC"
},
"operation": "INSERTED"
} {
"type": "dbsync_packages",
"data": {
"architecture": "",
"checksum": "EDZ2U2EBW7",
"description": "",
"format": "",
"groups": "editors",
"install_time": "2024/06/06 00:00:00",
"item_id": "4EJE7MC1VD",
"location": "",
"multiarch": "null",
"name": "bsd_os",
"priority": "optional",
"scan_time": "2023/12/1915:32:25",
"size": "2",
"source": "",
"vendor": "bsdi",
"version": "3.1"
},
"operation": "INSERTED"
} In one of them, I send the OS data to allow the cache to store the information, so the request to the Wazuh-DB isn't required In the second one, I only send the package so the query to Wazuh-DB is performed So we can see that:
|
Update
|
UpdateI detected the same issue in the worker 2: [root@ip-172-31-6-145 event]# grep "Stopping writes because" LOG*
LOG:2024/12/23-21:55:13.066178 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4
LOG:2024/12/23-21:55:37.629785 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4
LOG:2024/12/23-21:56:00.856550 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4
[root@ip-172-31-6-145 event]# ps aux | grep "modulesd"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 26807 0.0 0.0 119424 916 pts/0 S+ 22:37 0:00 grep --color=auto modulesd
root 28150 87.8 4.3 1596416 338816 ? Sl 20:23 117:47 /var/ossec/bin/wazuh-modulesd
root 29924 0.6 0.2 1034364 21108 ? Sl 20:26 0:50 /bin/python3 /usr/local/bin/wazuh-metrics -p wazuh-analysisd wazuh-remoted wazuh-db wazuh-modulesd wazuh_apid wazuh_clusterd wazuh-authd wazuh-dbd -v 4.10.0 -s 5 -u KB --store /home/ec2-user/B_818_CLUSTER-Workload_benchmarks_metrics/binaries
[root@ip-172-31-6-145 event]# date
Mon Dec 23 22:38:52 UTC 2024 Memory values doesn't seem out of normal: Tomorrow I will investigate that spike |
QA changes reviewFrom the QA side, we've researched the pipeline to track down any change that motivates this behavior.
Checking:
|
AnalysisAfter syncing with @jnasselle some conclusions arose Why did the test fail?QA sideOne of the first questions that came to our minds was, 'Why did the high-activity test fail, and not the very high-activity one?' Doing some simple math and assuming a perfectly balanced environment:
The Important This is assuming a perfectly balanced environment, for sure some worker receives more load than the other. Example non-balanced: Server sideFirst, we need to consider the architecture of the modules involved:
As explained above by @Dwordcito, step 4 got to a stall. This makes the router queue grow indefinitely, thus spiking the memory consumption. If we analyze the code:
ConclusionSome degradation in the performance of the routers-VD section triggered this behaviour. Increasing the performance of this section will prevent the issue from happening in the current scenario, but this could happen again if the tests are more exigent. I think that the main issue lies in the way the events are being received by the router, it has no drop mechanism for scenarios where the queue is saturated (for example, remoted drops events when it's saturated). On the other hand, the changes introduced here improved the efficiency of the section to the point this issue no longer happens in the current setup. Fixes analysisI am going to continue the analysis using this script as it makes the visualization much simpler Fixes validation (PR)MemoryWe see a reduction in the maximum value perceived and in the mean memory consumption: Values for RSS(KB)
Mean:
(*)_4.10.0_wazuh-modulesd 317795
_4.10.0-fix_wazuh-modulesd 294091
Max:
(*)_4.10.0_wazuh-modulesd 1106044
_4.10.0-fix_wazuh-modulesd 411700
Min:
(*)_4.10.0_wazuh-modulesd 66148
_4.10.0-fix_wazuh-modulesd 71472
Std:
(*)_4.10.0_wazuh-modulesd 138891
_4.10.0-fix_wazuh-modulesd 39094 The spike can be explained by a stall in the rocksDB: 2024/12/23-21:55:13.066178 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4
2024/12/23-21:55:37.629785 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4
2024/12/23-21:56:00.856550 140311840192256 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 Although we see a great improvement in the time to recovery, due to the changes implemented Analyzing 4.10.0 RC1If we compare beta1 and rc1 we see little to no changes in relation to the module efficiency to process events. Three PRs are present:
ConclusionFrom my point of view the issue lies in how the modules interact with one another:
The interaction is suceptible to:
The solution proposed in this PR has shown to improve the efficiency upon which modulesd processes events, and therefore preevented this issue from happening Another possible solution is to drop events when the router queue surpasses some limit (just like remoted does) |
Update
|
UpdateAfter re-launching the test from RC1 but reverting the changes introduced here: We see that the issue no longer occurs: A possible explanation is that all this new logic: wazuh/src/wazuh_modules/vulnerability_scanner/src/scanOrchestrator/buildSingleAgentListContext.hpp Lines 73 to 84 in 00bbbf8
Degraded the event processing BenchmarkRunning the BETA1 and RC1 variations of the buildSingleAgentListContext class: BETA 1
---------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
BuildSingleAgentListFixture/buildSingleAgentListPerformance/iterations:10000/threads:1 0.209 ms 0.044 ms 10000
RC1
---------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
BuildSingleAgentListFixture/buildSingleAgentListPerformance/iterations:10000/threads:1 0.565 ms 0.117 ms 10000 ContinuationI am going to validate this theory by instantiating the simulated agents locally Code criticalityThis method is responsible for preparing the context to trigger a re-scan, over a specific agent, when a prior scan failed: wazuh/src/wazuh_modules/vulnerability_scanner/src/scanOrchestrator/scanAgentList.hpp Lines 257 to 261 in 381e292
The most probable cause for this failure is an exception while retrieving information from the wazuhDB: wazuh/src/shared_modules/utils/socketDBWrapper.hpp Lines 187 to 188 in 3a7a2fb
If the DB is not synchronized, then it will throw this exception triggering a re-scan a minute latter |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
UpdateAttempting to improve the buildSingleAgentList class: Filter data in the wazuhDB queryThis was achieved by apliying filter in the query to the wazuhDB: TSocketDBWrapper::instance().query(WazuhDBQueryBuilder::builder()
.global()
.selectAll()
.fromTable("agent")
.whereColumn("id")
.equalsTo(agentId.data())
.andColumn("connection_status")
.equalsTo("active")
.andColumn("node_name")
.equalsTo(clusterNodeName.data())
.build(),
response); The result were positive ---------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
BuildSingleAgentListFixture/buildSingleAgentListPerformance/iterations:10000/threads:1 0.178 ms 0.030 ms 10000 UpdateThese modifications radically changed the behaviour. This confirms this section of the code is being run |
UpdateToday we are going to:
|
UpdateI've prepared an environment similar to the one used in the tests:
30m check
1hs check
1hs 30m check
|
Related to the above analysis I don't think that code is being exercised in this test, the delayed queue tables have 0 elements written in several tests that review. After review more things in 4.10, I see that when the change was applied: When these changes were incorporated, the cmakelist that is in the root was not taken into account, and this means that the libraries that are compiled here: libindexer_connector, libvulnerability_scanner, librouter and utils helpers(included in each cpp file) are compiled with -O0 (no optimizations) https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html https://cmake.org/cmake/help/v3.22/manual/cmake-buildsystem.7.html#default-and-custom-configurations Compilation output v4.10.0: VERBOSE=1 make TARGET=server -j$(nproc)
...
/usr/bin/c++ ... -std=gnu++17 -fPIC -MD -MT wazuh_modules/vulnerability_scanner/CMakeFiles/vulnerability_scanner.dir/src/vulnerabilityScannerFacade.cpp.o -MF CMakeFiles/vulnerability_scanner.dir/src/vulnerabilityScannerFacade.cpp.o.d -o CMakeFiles/vulnerability_scanner.dir/src/vulnerabilityScannerFacade.cpp.o -c /home/dwordcito/Development/wazuh-pre-5.0/wazuh/src/wazuh_modules/vulnerability_scanner/src/vulnerabilityScannerFacade.cpp
... running a simple stress test (using the same input of the jenkins tests) i can see an improvement of ~100% after apply the fix. fix, add to wazuh/src/CMakeLists.txt if(NOT CMAKE_BUILD_TYPE)
if(CMAKE_SYMBOLS_IN_RELEASE MATCHES "ON")
set(CMAKE_BUILD_TYPE RelWithDebInfo)
else()
set(CMAKE_BUILD_TYPE Release)
endif()
endif() |
UpdateI've implemented the changes proposed by @Dwordcito with no luck: We got stalls on the events DB: 2025/01/02-11:13:03.685418 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2025/01/02-11:13:41.083581 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2025/01/02-11:13:52.895942 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
.
.
.
2025/01/02-12:08:50.886792 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2025/01/02-12:08:56.930417 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2025/01/02-12:09:03.484502 140602559682304 [WARN] [db/column_family.cc:933] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2 Memory consumption got out of hand and the process was killed due to OOM: [ 7391.944922] Out of memory: Kill process 28056 (wazuh-modulesd) score 843 or sacrifice child
[ 7391.950544] Killed process 28056 (wazuh-modulesd) total-vm:8347304kB, anon-rss:6821060kB, file-rss:0kB, shmem-rss:0kB
[ 7392.397910] oom_reaper: reaped process 28056 (wazuh-modulesd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB This happens in both workers 1 and 2 @pereyra-m Is investigating the fact that different instances of rocksDB can interfere with one another due to shared resource usage |
UpdateWe suspect that if the same process opens many RocksDB instances, they share the same thread pool due to the default https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#sharing-cache-and-thread-pool
This could explain why the changes in the way that the delayed thread handles the data could affect the events queue. wazuh/src/wazuh_modules/vulnerability_scanner/src/scanOrchestrator/scanOrchestrator.hpp Lines 311 to 317 in 3a7a2fb
The performance of that method could be improved if we find that is the problem. Benchmark testsI've run some local tests to verify this behavior. Benchmark test
constexpr auto TEST_DB = "test.db";
constexpr auto TEST_CF_DB = "test_cf.db";
constexpr auto TEST_AGENT_NUMBER {1000};
constexpr auto TEST_EVENTS_PER_AGENT {10000};
constexpr auto TEST_THREADS {4};
class ThreadPoolBenchmarkFixture : public benchmark::Fixture
{
protected:
std::shared_ptr<RocksDBQueue<std::string>> queue; ///< RocksDB queue.
std::vector<std::shared_ptr<RocksDBQueueCF<std::string>>> queueCFs; ///< RocksDB queue with column family.
public:
/**
* @brief Benchmark setup routine.
*
* @param state Benchmark state.
*/
void SetUp(const ::benchmark::State& state) override
{
std::error_code ec;
std::filesystem::remove_all(TEST_DB, ec);
queue = std::make_shared<RocksDBQueue<std::string>>(TEST_DB);
for (int i = 0; i < TEST_THREADS; ++i)
{
std::filesystem::remove_all(TEST_CF_DB + std::to_string(i), ec);
queueCFs.emplace_back(std::make_shared<RocksDBQueueCF<std::string>>(TEST_CF_DB + std::to_string(i)));
}
}
/**
* @brief Benchmark teardown routine.
*
* @param state Benchmark state.
*/
void TearDown(const ::benchmark::State& state) override
{
queue.reset();
for (auto& queueCF : queueCFs)
{
queueCF.reset();
}
}
};
BENCHMARK_DEFINE_F(ThreadPoolBenchmarkFixture, ThreadPoolPerformance)(benchmark::State& state)
{
std::atomic<bool> stop {false};
std::vector<std::thread> threads;
auto cfLambda = [&](int threadIndex)
{
while (!stop)
{
for (int i = 0; i < TEST_AGENT_NUMBER; ++i)
{
if (stop)
{
break;
}
for (int j = 0; j < TEST_EVENTS_PER_AGENT; ++j)
{
if (stop)
{
break;
}
queueCFs.at(threadIndex)->push(std::to_string(i), "test");
}
}
for (int i = 0; i < TEST_AGENT_NUMBER; ++i)
{
if (stop)
{
break;
}
queueCFs.at(threadIndex)->clear(std::to_string(i));
}
}
};
for (int i = 0; i < TEST_THREADS; ++i)
{
threads.emplace_back(cfLambda, i);
}
for (auto _ : state)
{
for (int i = 0; i < TEST_AGENT_NUMBER; ++i)
{
for (int j = 0; j < TEST_EVENTS_PER_AGENT; ++j)
{
queue->push("test");
}
std::cout << "Pushed " << i << std::endl;
}
for (int i = 0; i < TEST_AGENT_NUMBER; ++i)
{
for (int j = 0; j < TEST_EVENTS_PER_AGENT; ++j)
{
queue->pop();
}
std::cout << "Popped " << i << std::endl;
}
}
stop.store(true);
for (auto& thread : threads)
{
if (thread.joinable())
{
thread.join();
}
}
}
BENCHMARK_REGISTER_F(ThreadPoolBenchmarkFixture, ThreadPoolPerformance)->Iterations(1)->Threads(1); Loaded testThe test is run as shown above.
Details
# grep "compaction_finished" test_results/loaded/LOG.queue
2025/01/02-16:29:12.426701 140432359392960 (Original Log Time 2025/01/02-16:29:12.426643) EVENT_LOG_v1 {"time_micros": 1735835352426627, "job": 6, "event": "compaction_finished", "compaction_time_micros": 1935342, "compaction_time_cpu_micros": 1922159, "output_level": 1, "num_output_files": 1, "total_output_size": 61493772, "num_input_records": 3672165, "num_output_records": 3672165, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0]}
2025/01/02-16:29:41.323698 140432359392960 (Original Log Time 2025/01/02-16:29:41.323664) EVENT_LOG_v1 {"time_micros": 1735835381323654, "job": 11, "event": "compaction_finished", "compaction_time_micros": 5519622, "compaction_time_cpu_micros": 5471747, "output_level": 1, "num_output_files": 2, "total_output_size": 123061671, "num_input_records": 7343969, "num_output_records": 7343969, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 2, 0, 0]}
2025/01/02-16:30:22.511072 140432359392960 (Original Log Time 2025/01/02-16:30:22.511003) EVENT_LOG_v1 {"time_micros": 1735835422510988, "job": 16, "event": "compaction_finished", "compaction_time_micros": 6051992, "compaction_time_cpu_micros": 5974653, "output_level": 1, "num_output_files": 3, "total_output_size": 150141766, "num_input_records": 11103858, "num_output_records": 8896142, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [1, 3, 0, 0]}
2025/01/02-16:30:57.604270 140432359392960 (Original Log Time 2025/01/02-16:30:57.604231) EVENT_LOG_v1 {"time_micros": 1735835457604220, "job": 21, "event": "compaction_finished", "compaction_time_micros": 2238467, "compaction_time_cpu_micros": 2231745, "output_level": 1, "num_output_files": 1, "total_output_size": 5444946, "num_input_records": 7667247, "num_output_records": 322595, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 3, 0, 0]}
2025/01/02-16:31:34.224484 140432359392960 (Original Log Time 2025/01/02-16:31:34.224452) EVENT_LOG_v1 {"time_micros": 1735835494224441, "job": 26, "event": "compaction_finished", "compaction_time_micros": 2719166, "compaction_time_cpu_micros": 2681834, "output_level": 1, "num_output_files": 1, "total_output_size": 51929384, "num_input_records": 4912941, "num_output_records": 3076901, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [3, 3, 0, 0]}
2025/01/02-16:31:43.309915 140432359392960 (Original Log Time 2025/01/02-16:31:43.309885) EVENT_LOG_v1 {"time_micros": 1735835503309874, "job": 28, "event": "compaction_finished", "compaction_time_micros": 2260477, "compaction_time_cpu_micros": 2247228, "output_level": 1, "num_output_files": 1, "total_output_size": 41889110, "num_input_records": 4316998, "num_output_records": 2481994, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [3, 2, 0, 0]} Total compaction time:
Loaded with no clearThe background threads keep pushing but the clear call was removed
We find a little improvement in the run time. Details
# grep "compaction_finished" test_results/loaded_no_clear/LOG.queue
2025/01/02-16:35:52.628818 140329318880960 (Original Log Time 2025/01/02-16:35:52.628759) EVENT_LOG_v1 {"time_micros": 1735835752628747, "job": 6, "event": "compaction_finished", "compaction_time_micros": 2714394, "compaction_time_cpu_micros": 2699473, "output_level": 1, "num_output_files": 1, "total_output_size": 61491927, "num_input_records": 3672056, "num_output_records": 3672056, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [1, 1, 0, 0]}
2025/01/02-16:36:15.832554 140329318880960 (Original Log Time 2025/01/02-16:36:15.832516) EVENT_LOG_v1 {"time_micros": 1735835775832503, "job": 11, "event": "compaction_finished", "compaction_time_micros": 3963471, "compaction_time_cpu_micros": 3919523, "output_level": 1, "num_output_files": 2, "total_output_size": 123064003, "num_input_records": 7344107, "num_output_records": 7344107, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 2, 0, 0]}
2025/01/02-16:36:54.406449 140329318880960 (Original Log Time 2025/01/02-16:36:54.406403) EVENT_LOG_v1 {"time_micros": 1735835814406392, "job": 16, "event": "compaction_finished", "compaction_time_micros": 6144726, "compaction_time_cpu_micros": 6088112, "output_level": 1, "num_output_files": 3, "total_output_size": 150134791, "num_input_records": 11104271, "num_output_records": 8895729, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [1, 3, 0, 0]}
2025/01/02-16:37:44.152405 140329318880960 (Original Log Time 2025/01/02-16:37:44.152360) EVENT_LOG_v1 {"time_micros": 1735835864152349, "job": 23, "event": "compaction_finished", "compaction_time_micros": 2406994, "compaction_time_cpu_micros": 2367486, "output_level": 1, "num_output_files": 1, "total_output_size": 51925417, "num_input_records": 4913176, "num_output_records": 3076666, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 3, 0, 0]}
2025/01/02-16:37:53.885762 140329318880960 (Original Log Time 2025/01/02-16:37:53.885725) EVENT_LOG_v1 {"time_micros": 1735835873885715, "job": 25, "event": "compaction_finished", "compaction_time_micros": 1724160, "compaction_time_cpu_micros": 1516269, "output_level": 1, "num_output_files": 1, "total_output_size": 36433589, "num_input_records": 3994590, "num_output_records": 2158742, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 3, 0, 0]}
2025/01/02-16:38:05.240737 140329318880960 (Original Log Time 2025/01/02-16:38:05.240710) EVENT_LOG_v1 {"time_micros": 1735835885240703, "job": 27, "event": "compaction_finished", "compaction_time_micros": 896077, "compaction_time_cpu_micros": 876582, "output_level": 1, "num_output_files": 1, "total_output_size": 20943537, "num_input_records": 3076563, "num_output_records": 1240921, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 3, 0, 0]} There is an improvement in the compaction time:
No loadedThe
We have 3 seconds less than the loaded test. Details
# grep "compaction_finished" test_results/not_loaded/LOG.queue
2025/01/02-16:40:09.258775 140450726438592 (Original Log Time 2025/01/02-16:40:09.258723) EVENT_LOG_v1 {"time_micros": 1735836009258712, "job": 6, "event": "compaction_finished", "compaction_time_micros": 1737197, "compaction_time_cpu_micros": 1702207, "output_level": 1, "num_output_files": 1, "total_output_size": 61485164, "num_input_records": 3671655, "num_output_records": 3671655, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0]}
2025/01/02-16:40:31.382071 140450726438592 (Original Log Time 2025/01/02-16:40:31.382016) EVENT_LOG_v1 {"time_micros": 1735836031382002, "job": 11, "event": "compaction_finished", "compaction_time_micros": 3461222, "compaction_time_cpu_micros": 3412208, "output_level": 1, "num_output_files": 2, "total_output_size": 123051844, "num_input_records": 7343381, "num_output_records": 7343381, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 2, 0, 0]}
2025/01/02-16:40:56.323881 140450726438592 (Original Log Time 2025/01/02-16:40:56.323851) EVENT_LOG_v1 {"time_micros": 1735836056323844, "job": 16, "event": "compaction_finished", "compaction_time_micros": 4424581, "compaction_time_cpu_micros": 4362602, "output_level": 1, "num_output_files": 3, "total_output_size": 150143450, "num_input_records": 11103759, "num_output_records": 8896241, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 3, 0, 0]}
2025/01/02-16:41:25.295062 140450726438592 (Original Log Time 2025/01/02-16:41:25.295027) EVENT_LOG_v1 {"time_micros": 1735836085295019, "job": 21, "event": "compaction_finished", "compaction_time_micros": 1434437, "compaction_time_cpu_micros": 1426347, "output_level": 1, "num_output_files": 1, "total_output_size": 5448662, "num_input_records": 7667027, "num_output_records": 322815, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 3, 0, 0]}
2025/01/02-16:41:57.429533 140450726438592 (Original Log Time 2025/01/02-16:41:57.429503) EVENT_LOG_v1 {"time_micros": 1735836117429496, "job": 26, "event": "compaction_finished", "compaction_time_micros": 1661351, "compaction_time_cpu_micros": 1644450, "output_level": 1, "num_output_files": 1, "total_output_size": 51923275, "num_input_records": 4913302, "num_output_records": 3076540, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [3, 3, 0, 0]}
2025/01/02-16:42:05.235331 140450726438592 (Original Log Time 2025/01/02-16:42:05.235276) EVENT_LOG_v1 {"time_micros": 1735836125235261, "job": 28, "event": "compaction_finished", "compaction_time_micros": 1346004, "compaction_time_cpu_micros": 1330165, "output_level": 1, "num_output_files": 1, "total_output_size": 41876290, "num_input_records": 4317477, "num_output_records": 2481233, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [3, 2, 0, 0]} There is a significat improvement in the compaction time:
SummaryThe different instances of RocksDB seem to share resources. |
UpdateI re-run the tests above but with the batch improvement in
The time benchmark is lower that both loaded tests but higher than no load. Details
The same happens with the compaction times. I've proposed also a stats command for |
Description
During the testing of 4.10.0 - RC 1 in vulnerability detection perfomarce test in this issue, comparing it with previous versions an increase in wazuh-modulesd has been detected and needs to be investigated. It has been detected in the case of high activity in the worker 1. Here are the details.
4.10.0-RC 1
4.10.0-Beta 1
4.9.1-RC 3
Environment
The tests have been launched with the CLUSTER-Workload_benchmarks_metrics pipeline, using the following environment:
Build parameters
Build 803
Evidences
4.10.0-RC 1 Artifact
4.10.0-Beta 1 Artifact
The text was updated successfully, but these errors were encountered: