Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism to invalidate dependent segment archives #22296

Open
wants to merge 8 commits into
base: 5.x-dev
Choose a base branch
from

Conversation

michalkleiner
Copy link
Contributor

@michalkleiner michalkleiner commented Jun 7, 2024

Description:

Add new Archiver method getDependentSegments to return a list of dependent segments, and update the existing getDependentSegmentsToArchive method to use it by default while being backward compatible (see e.g. Goals or Media Analytics archivers).

Add a mechanism to invalidate dependent segments to the ArchiveInvalidator class.

Adds code to invalidate dependent segment archives right before they should be archived (again).

Fixes #18772
Ref. DEV-14109

Massive thanks to @sgiehl for his help, guidance and a productive pairing session!

Review

@michalkleiner michalkleiner requested a review from a team June 7, 2024 11:58
@michalkleiner michalkleiner added this to the 5.2.0 milestone Jun 7, 2024
@sgiehl sgiehl force-pushed the dev-14109 branch 7 times, most recently from d04bb10 to a87fa95 Compare June 13, 2024 22:09
@tsteur
Copy link
Member

tsteur commented Jun 14, 2024

fyi just came across this PR randomly and saw the many code changes. You might have already thought about it, but wanted to mention it in case we haven't.

When we're archiving and processDependentArchive() is called, I believe we can be certain that the dependent archive has to be reprocessed too. It means in those cases we can maybe force to invalidate. We maybe want to test a change like below.

diff --git a/core/ArchiveProcessor.php b/core/ArchiveProcessor.php
index 19ae9c992a..7b1d06de8d 100644
--- a/core/ArchiveProcessor.php
+++ b/core/ArchiveProcessor.php
@@ -771,6 +771,10 @@ class ArchiveProcessor
 
         self::$isRootArchivingRequest = false;
         try {
+            $invalidator = StaticContainer::get('Piwik\Archive\ArchiveInvalidator');
+            $invalidator->markArchivesInvalidated($params->getIdSites(), [date...], $newSegment,
+                            $params->getPeriod()->getLabel() != 'range', $forceInvalidateNonexistentRanges, $plugin);
+
             $parameters = new ArchiveProcessor\Parameters($params->getSite(), $params->getPeriod(), $newSegment);
             $parameters->onlyArchiveRequestedPlugin();

Might be worth a test if we haven't checked that yet.

I believe this would also avoid any possible race conditions should multiple archivers be running concurrently etc.

@sgiehl
Copy link
Member

sgiehl commented Jun 14, 2024

Thanks @tsteur for pointing towards that. Doing the invalidation only in case it should be archived again makes everything a bit easier I guess. I will try to adjust the code and see if everything still works as expected.

@sgiehl
Copy link
Member

sgiehl commented Jun 14, 2024

Ok. So only adding the invalidation right before also doesn't work as expected. While the final result is fine, it actually will archive the dependent segment twice. The invalidation record, which is created during invalidation, isn't removed after processing the dependent segment, causing the archiver to invalidate the data right away and archive it again 🙈

Edit: Actually the data isn't processed again, as the segment hashes in the invalidation are not stored anywhere. Therefor the archiver will produce a lot of warnings as the segments can't be found. I'll add code to avoid adding invalidations in that case.

@sgiehl sgiehl force-pushed the dev-14109 branch 6 times, most recently from 65896a6 to 0818e12 Compare June 18, 2024 13:50
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should now be ready for a final review. I've changed the approach to Thomas suggestion, which I hadn't thought about before.
I've also added a couple of comments to make it easier to understand certain additional required changes.

) {
$plugin = null;
if ($name && strpos($name, '.') !== false) {
list($plugin) = explode('.', $name);
} elseif ($name) {
$plugin = $name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small fix to the plugin detection. When a name is provided, but doesn't contain a ., it is actually the plugin name only. So we should check that correctly.

<revenue_new_visit>160</revenue_new_visit>
<conversion_rate_new_visit>91.18%</conversion_rate_new_visit>
<conversion_rate_new_visit>94.12%</conversion_rate_new_visit>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is caused by invalidating the VisitsSummary archive, before archiving the goal metrics. As the goal archiver is using data from like nb_visits, it was using outdated data before.

@@ -191,16 +191,17 @@ public function testPluginOnlyArchivingDoesNotRelaunchChildArchivesWhenReusingAl
'date2' => '2020-01-20',
'period' => '1',
],
// archive 4 is missing as VisitsSummary is archived twice, as it doesn't contain data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by some weird behavior of archiving. When archiving is triggered for anything, the first thing that is done is checking if core metrics (aka VisitsSummary) needs to be archived. This is done by checking if there is an archive that contains the visits metric. If this isn't the case VisitsSummary will be archived.
In case we have an archive that doesn't contain any visits, the archive will only contain a done flag, but no other metrics. This causes the archiving to be triggered again, which will create a new (empty) archive, while removing the previous one.
As archiving dependent segments first triggers archiving VisitsSummary, it creates an empty archive. Afterwards when archiving the Goals plugin it will archive VisitsSummary again, as the previous one is empty. Which then causes this missing archive id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some of the explanation from the GH comment as part of the code comment?

@@ -417,7 +422,7 @@ public function deleteOlderArchives(Parameters $params, $name, $tsArchived, $idA
$numericTable = ArchiveTableCreator::getNumericTable($dateStart);
$blobTable = ArchiveTableCreator::getBlobTable($dateStart);

$sql = "SELECT idarchive FROM `$numericTable` WHERE idsite = ? AND date1 = ? AND date2 = ? AND period = ? AND name = ? AND ts_archived < ? AND idarchive < ?";
$sql = "SELECT idarchive FROM `$numericTable` WHERE idsite = ? AND date1 = ? AND date2 = ? AND period = ? AND name = ? AND ts_archived <= ? AND idarchive < ?";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small fix, which caused multiple archives to exist for the same parameters when they were created in the same second. (See update test)

@@ -743,24 +745,6 @@ public function getTestDataForArchiving()
'name' => 'nb_visits',
'value' => '1',
),
array (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those archives are now correctly removed even if they are created in the same second.

@sgiehl sgiehl requested a review from a team June 24, 2024 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dependent archives don't get invalidated when invalidating reports making them show outdated data
3 participants