Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Inventory::SyncLoop to Avoid Blocking Agent Stop #445

Closed
cborla opened this issue Dec 18, 2024 · 10 comments · Fixed by #502
Closed

Improve Inventory::SyncLoop to Avoid Blocking Agent Stop #445

cborla opened this issue Dec 18, 2024 · 10 comments · Fixed by #502
Assignees
Labels
level/task Task issue module/agent module/inventory Inventory module type/enhancement Enhancement issue

Comments

@cborla
Copy link
Member

cborla commented Dec 18, 2024

Description

The current implementation of the Inventory::SyncLoop function in the Inventory module may cause unnecessary delays during the agent shutdown process. Specifically, the Scan() method is called after the stop condition (m_stopping) is triggered, leading to potential delays if Scan() is a time-consuming operation.

Problem Details

  • The Scan() function is executed after the condition variable (m_cv.wait_for) returns, even if m_stopping is true.
  • This behavior delays the shutdown of the agent when Scan() involves intensive operations or long-running tasks.
  • This could lead to poor user experience during shutdown and inefficient resource handling.

Steps to Reproduce

  • Start the agent with the Inventory module enabled.
  • Trigger a stop request for the agent while SyncLoop is running.
  • Observe that Scan() executes even after the stop request, delaying the shutdown.

Proposed Solution

Update the SyncLoop function to verify the m_stopping condition after the wait operation and before calling Scan(). This ensures that no unnecessary Scan() operation is performed after a stop request is issued.

Option

void Inventory::SyncLoop()
{
    LogInfo("Module started.");

    if (m_scanOnStart && !m_stopping)
    {
        Scan();
    }

    while (!m_stopping)
    {
        {
            std::unique_lock<std::mutex> lock{m_mutex};
            m_cv.wait_for(lock,
                std::chrono::milliseconds{m_intervalValue}, [&]() { return m_stopping; });
        }

        if (!m_stopping)
        {
            Scan();
        }
    }
    std::unique_lock<std::mutex> lock{m_mutex};
    m_spDBSync.reset(nullptr);
}
@jr0me
Copy link
Member

jr0me commented Dec 19, 2024

As a note for when this issue goes into development, the first scan

    if (m_scanOnStart && !m_stopping)
    {
        Scan();
    }

Can happen while another threads calls Stop() on the Inventory, perhaps this scan should also try to capture the mutex before checking m_stopping, and block the Stop call until the scan finishes.

@xkazinx xkazinx self-assigned this Jan 14, 2025
@wazuhci wazuhci moved this from Backlog to In progress in XDR+SIEM/Release 5.0.0 Jan 14, 2025
@xkazinx
Copy link
Contributor

xkazinx commented Jan 16, 2025

Is there something to do in regards to this comment?

@cborla
Copy link
Member Author

cborla commented Jan 16, 2025

Is there something to do in regards to this comment?

Hi @xkazinx
The ideal would be to investigate the case that @jr0me raises, make a test and according to that we see which solution is more convenient.

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Jan 17, 2025
@wazuhci wazuhci moved this from Pending review to In review in XDR+SIEM/Release 5.0.0 Jan 21, 2025
@xkazinx
Copy link
Contributor

xkazinx commented Jan 23, 2025

  • Worked fully on environment setup for this issue:
    • Had an Ubuntu 24 installation where initializing apps stopped working.
    • Ended up doing a clean installation, and after updating, the same thing happening.
    • Tried to install it again, without updating, and when booting the pendrive, the PC would reboot automatically.
    • Went to Windows, installed some VMs on VMWare and it started to freeze after the OS initialization.
    • Reinstalled VMWare, and the VM started to throw initialization errors due to it not probably being reinstalled with errors.
    • Installed VirtualBox, with a VM, which worked, but I couldn't ping from host machine to the VM.
    • Finally, created an Ubuntu22 pendrive, installed the OS, installed VMWare and it seems to be working, which I have to finish configuring to continue.

@xkazinx
Copy link
Contributor

xkazinx commented Jan 23, 2025

  • After the Ubuntu 22 installation, an Ubuntu 24 VM with an indexer, manager and dashboard was installed, along with another VM with an agent, meant for development, where various issues appeared in the build process, where it now builds after solving them.
  • Environments are now ready for the further development of the issue.

@wazuhci wazuhci moved this from In review to In progress in XDR+SIEM/Release 5.0.0 Jan 24, 2025
@xkazinx
Copy link
Contributor

xkazinx commented Jan 24, 2025

  • A runtime error was encountered where the SSL certificates weren't functioning between the agent and the mock-server which, after an investigation, it was found that it was due to the config file not being read. After fixing it, the time taken for the process to finish after interrupting the execution varies a lot between different tests made by @LucioDonda @cborla and me:

  • Further investigation has to be made to determine why there is such a different time frame between the tests.

@LucioDonda
Copy link
Member

Based on the previous comment by @xkazinx and the analysis done, the lowest limit can have plenty of variation depending on the host, but IMO the stop signal can't cut Scan() process in the middle.

  • Based on the proposed solution:
        if (!m_stopping)
        {
            Scan();
        }

the m_stopping check is redundant because Scan() calls TryCatchTask that do this same check inside:

void Inventory::TryCatchTask(const std::function<void()>& task) const
{
    try
    {
        if (!m_stopping)
        {
            task();
        }
        else
        {
            LogTrace("No Scanning during stopping");
        }
    }

Regarding the comment of @jr0me this is what's actually happening and that's why we cannot stop the scan earlier.

c.c. @cborla

@LucioDonda
Copy link
Member

Update 27/01

  • By removing the mutex usage on the m_stopping variable the instant reaction to a ctrl+c can be achieved.
  • As far as I could test, some of the packages can be lost by stopping and starting the agent in the middle of the scan.
    • This should be double checked with some metrics, regarding how that behavior actually changes.

@xkazinx
Copy link
Contributor

xkazinx commented Jan 28, 2025

  • Added a benchmark class for testing, which accumulates how much time it takes to call certain functions, as well as their function call count. Since testing on my machine takes ~1 second, the branch will be tested by another developer(s) to determine what is taking >5 seconds to run on their machine.

    • Branch: 445-avoid-blocking-agent-stop-in-SyncLoop-benchmark
  • Next, it will be determined whether removing the mutex is a reliable solution as the above comment describes.

@cborla
Copy link
Member Author

cborla commented Jan 29, 2025

Working update

28/01/2025

The change eliminates unnecessary JSON serialization (dump()) and parsing (parse()), allowing direct processing of the nlohmann::json object.

SendDeltaEvent(const nlohmann::json& data)

2025-01-28T21:41:52.458Z: Benchmark for PACKAGES: took 16432ms in 1 calls
2025-01-28T21:41:52.458Z: Benchmark for PACKAGES_NORMALIZE: took 0ms in 2765 calls
2025-01-28T21:41:52.458Z: Benchmark for PACKAGES_REMOVE_EXCLUDED: took 0ms in 2765 calls
2025-01-28T21:41:52.458Z: Benchmark for PACKAGES_SYNCTXNROW: took 14449ms in 2765 calls
2025-01-28T21:41:52.458Z: Benchmark for PACKAGES_GETDELETEDROWS: took 1ms in 1 calls
2025-01-28T21:41:52.458Z: Benchmark for PROCESSES: took 1154ms in 1 calls
2025-01-28T21:41:52.458Z: Benchmark for PROCESSES_SYNCTXNROW: took 928ms in 278 calls
2025-01-28T21:41:52.458Z: Benchmark for PROCESSES_GETDELETEDROWS: took 0ms in 1 calls
2025-01-28T21:41:52.458Z: Benchmark for SENDDELTAEVENT: took 0ms in 3142 calls

SendDeltaEvent(const std::string& data)

2025-01-28T21:53:11.309Z: Benchmark for PACKAGES: took 19013ms in 1 calls
2025-01-28T21:53:11.309Z: Benchmark for PACKAGES_NORMALIZE: took 0ms in 2765 calls
2025-01-28T21:53:11.309Z: Benchmark for PACKAGES_REMOVE_EXCLUDED: took 0ms in 2765 calls
2025-01-28T21:53:11.309Z: Benchmark for PACKAGES_SYNCTXNROW: took 17034ms in 2765 calls
2025-01-28T21:53:11.309Z: Benchmark for PACKAGES_GETDELETEDROWS: took 1ms in 1 calls
2025-01-28T21:53:11.309Z: Benchmark for PROCESSES: took 1195ms in 1 calls
2025-01-28T21:53:11.309Z: Benchmark for PROCESSES_SYNCTXNROW: took 986ms in 278 calls
2025-01-28T21:53:11.309Z: Benchmark for PROCESSES_GETDELETEDROWS: took 0ms in 1 calls
2025-01-28T21:53:11.309Z: Benchmark for SENDDELTAEVENT: took 17ms in 3140 calls

Although the proposal is really an improvement in how the json object is handled, there is no noticeable time difference.

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Jan 30, 2025
@wazuhci wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue module/agent module/inventory Inventory module type/enhancement Enhancement issue
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants