Skip to content

HPCC-30365 Add XRef Sasha service to K8s#19639

Merged
ghalliday merged 1 commit intohpcc-systems:candidate-9.12.xfrom
jackdelv:xrefContainerized
Apr 17, 2025
Merged

HPCC-30365 Add XRef Sasha service to K8s#19639
ghalliday merged 1 commit intohpcc-systems:candidate-9.12.xfrom
jackdelv:xrefContainerized

Conversation

@jackdelv
Copy link
Copy Markdown
Contributor

@jackdelv jackdelv commented Mar 19, 2025

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@jackdelv jackdelv requested a review from jakesmith March 19, 2025 13:20
@github-actions
Copy link
Copy Markdown

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-30365

Jirabot Action Result:
Workflow Transition To: Merge Pending
Additional PR: #19639

Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - I think changes look logical, but there are various problems with the changes to expandMask.
Could do with some unittests added to prove for various input it is creating the expected output.

Comment thread dali/base/dautils.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
}
if (isDirPerPart) {
// MORE: Should maybe check this doesn't contain any subdirectories to make
// sure it is really a dirPerPart directory. Is an all numbers subdirectory valid in ecl?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in practice yes, a scope must have a leading alpha char. So worth clarifying the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified comment.

Comment thread dali/sasha/saxref.cpp Outdated
isDirPerPart = false;
}
if (isDirPerPart) {
// MORE: Should maybe check this doesn't contain any subdirectories to make
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can check pdir.dirs after the scanDirectory, should be empty.
Let's add a check and throw an exception if not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check (pdir->dirs.ordinality()>0)

Comment thread dali/sasha/saxref.cpp Outdated
unsigned maxMb = serverConfig->getPropInt("DfuXRef/@memoryLimit", DEFAULT_MAXMEMORY);
unsigned maxMb;
if (isContainerized()) {
const char *resourcedMemory = getComponentConfigSP()->queryProp("resources/@memory");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now 'props' is saved, should used it instead of getComponentConfigSP() here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use new props member.

Comment thread system/jlib/jutil.cpp Outdated
char next = *(s++);
if (next=='$') {
if (dirPerPart)
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial/formatting: Allman vs K&R

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed change.

Comment thread system/jlib/jutil.cpp Outdated
{
const char * start = buf.str();
const char * slash = start + buf.length();
while (slash > start && *slash != PATHSEPCHAR)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'slash' at the start is potentially beyond the end of the allocated buffer I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changed.

Comment thread system/jlib/jutil.cpp Outdated
while (slash > start && *slash != PATHSEPCHAR)
slash--;
buf.insert(slash-start, PATHSEPCHAR);
buf.insert((slash+1)-start, p+1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not safe, buf may have reallocated after 1st insert, meaning 'slash' points to a free'd pointer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These insert also don't look right in that they're inserting "/", but the mask is typically something like "myfilename._$P$_of_10", so it's going to end up with e.g. "/5myfilename._5_of_10" afaics.

If it did find a slash, it would prob be okay, but there isn't necessarily a full path so there's no guarantee there's a slash.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the mask can be "myfilename._$P$_of_$N$" (see 'N' handleing below)
This insert code is going to be hit twice in that case, and therefore insert the dir-per-part directory twice...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changes to expandMask in favor of storing prefix and scope directory paths in cFileDesc and building up full paths in getPartName instead.

Comment thread system/jlib/jutil.cpp Outdated
StringBuffer &expandMask(StringBuffer &buf, const char *mask, unsigned p, unsigned n, unsigned stripeNum, bool dirPerPart)
{
if (stripeNum>0)
addPathSepChar(buf.append('d').append(stripeNum));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, how is this strip dir going to be in right place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed changes to expandMask.

@jackdelv jackdelv requested a review from jakesmith March 20, 2025 20:48
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in our meeting, there's are some issues with the way the info. is being scanned and stored in the hierarchical structure built up during the physical file scan and used subsequently.
Currently too much information is being stored per cFileDesc - as before, the file path shouldn't be needed, only the deduced filename mask.
Lost files are probably being misreported (at least during initial phase), because the paths they walk do not match the physical representation on disk (due to striping and dir-per-part directories).

saxref should:

  • ensure runXRef is dealing with 1 plane a time (i.e. if multiple selected, process 1 at a time).
  • store plane details (IPT) so accessible to other xref tasks during scan.
  • during scanDirectories, if striped, detected when at stripe level, and 'skip'
  • detect dir-per-part directory and 'skip'. But, check after recursing the dir, that contains no subdirs.
  • Keep the filename mask only in cFileDesc
  • Add scope/lfn to listOrphans. Build up in same way as 'basedir' is now.
    Think we can get rid of baseDir altogether.
  • during scanLogicalFiles, parse file paths fetched from parts (add helper func), to remove stripe and dir-part-part, so can marry the pathing with the cDirDesc tree.
    NB: LATER- may be better to assume that the cDirDesc is a representation of scopes, and walk scopes of logical file (+part endpoints), instead of getting part directories.
  • In listOrphans(cFileDesc), deduce file path from lfn, partNum and plane. Add a utility func that deduces and uses stripe num and dir-per-part if relevant (based on plane details).

@jackdelv

Comment thread dali/base/dautils.cpp Outdated
if (mname.isEmpty())
// Assume that if prefix is passed in a match is required
if (prefix && prefix->isEmpty())
throw makeStringExceptionV(-1, "Could not find matching prefix in plane definition for file %s", filename);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think throwing an exception here, will mean 1 path that fails to match, will cause the whole xref process to fail?

True of any exception in parseFilename. NB: addFile only issues warnings.

addFile should likely have a try/catch and issue warnings in case of any exception in parseFilename.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something that doesn't make sense here in fact I think..

How would the files not match prefix?
And, why is it scanning planes every file being added, to determine which plane the prefix path is in?
Given, we are specifically xref'ing a given plane, and therefore start at the prefix path..
This relates to HPCC-33151.

Let's discuss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of matching the prefix for each file, the XRefManager stores the current storage plane.

Comment thread dali/base/dautils.cpp Outdated
mname.append((d+1)-name, name).append(cur-(tailSlash+1), tailSlash+1);
if (dirs)
dirs->append((d+1)-name, name);
mname.append(cur-(tailSlash+1), tailSlash+1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common with line 3881, could go outside if/else

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed use of parseFileName from addfile. Now the striped and dir-per-part status are deduced from the directory structure while building up cDirDescs in scanDirectory and passed to recursive calls.

Comment thread dali/sasha/saxref.cpp Outdated

StringBuffer &getPartName(StringBuffer &buf,unsigned p)
{
// In baremetal, buf can be prepoulated with replicate directory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code was:

    StringBuffer &getPartName(StringBuffer &buf,unsigned p)
    {
        StringBuffer mask;
        getName(mask);
        return expandMask(buf, mask, p, N);
    }

If buf is prepopulated, didn't that mean that as it was, it added the whole path again?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's masking a problem, rather than the correct fix.

Something not quite right here..
the cFileDesc (and cDirDesc) contain the name only, not a path.

getPartName should be returning an expanded form of the name mask only, as it was before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed cFileDesc to contain just the file name.

@jackdelv jackdelv force-pushed the xrefContainerized branch from 71efafe to 1edf7ce Compare April 2, 2025 14:29
@jackdelv
Copy link
Copy Markdown
Contributor Author

jackdelv commented Apr 2, 2025

@jakesmith I made the changes to remove adding path and scope to cFileDesc and instead pass the plane information and striped and dir-per-part statuses down to cFIleDesc::create. This avoids having to match the prefix and lookup storage planes for every file.

It seems in scanLogicalFiles there is more wrong than just not being able to include stripe and dir-per-part numbers. It is trying to match the current directory with the root. In bare-metal the root prefix looks like /var/lib/HPCCSystems/hpcc-data/thor based on which cluster is currently running xref. This will always fail to find files in directories for other components e.g. it will always say files under /var/lib/HPCCSystems/hpcc-data/roxie are potential lost files if doing the search on thor. In containerized this doesn't happen since the prefix won't contain the component name at the tail and will instead be /var/lib/HPCCSystems/hpcc-data. In this case the missing stripeNum and dir-per-part nums do cause an issue.

Other than that, my tests for found/orphan files are working and everything is showing up correctly in results on bm and containerized versions.

@jackdelv jackdelv requested a review from jakesmith April 2, 2025 14:39
@AttilaVamos
Copy link
Copy Markdown
Contributor

@jackdelv There are some Unit test errors. See: https://github.com/hpcc-systems/HPCC-Platform/actions/runs/14222208906/job/39855851609?pr=19639
(I need to fix that test to report this kind of errors. I opened a JIRA to do that.)

@jakesmith
Copy link
Copy Markdown
Member

@jackdelv There are some Unit test errors. See: https://github.com/hpcc-systems/HPCC-Platform/actions/runs/14222208906/job/39855851609?pr=19639 (I need to fix that test to report this kind of errors. I opened a JIRA to do that.)

@streeterd - hit similar problems when testing https://hpccsystems.atlassian.net/browse/HPCC-27051.
He has some pending changes to make to the k8s github action, which provide more insight when a deployment fails like this. See https://hpccsystems.atlassian.net/issues/HPCC-33739

@jackdelv
Copy link
Copy Markdown
Contributor Author

jackdelv commented Apr 3, 2025

@AttilaVamos @jakesmith I fixed the issue with the DFUXrefLibTests. I had made some changes to parseFileName and didn't notice the tests were failing. If the unittests are reporting errors, why is the github action being marked as "Succeeded" (I just saw your comment about opening a jira. Disregard) ? Also, it seems like the DFUAccessTests failures are occurring in other PR's. Should I ignore them?

@jackdelv
Copy link
Copy Markdown
Contributor Author

jackdelv commented Apr 3, 2025

@jakesmith I noticed another error is occurring that I have never seen before.

Waiting for deployment "sasha-xref" rollout to finish: 0 of 1 updated replicas are available...
error: timed out waiting for the condition

In the test: https://github.com/hpcc-systems/HPCC-Platform/actions/runs/14222208906/job/39856041969

This probably has something to do with how I instantiated the sasha xref service, but I'm not sure what I did wrong.

Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - please see feedback.

Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
f->getPartName(path,pn);
RemoteFilename rfn;
rfn.setPath(grp->queryNode((pn+drv)%numnodes).endpoint(),path.str());
rfn.setPath(grp->queryNode((pn+drv)%numnodes).endpoint(),getPhysicalPartName(path,storagePlane,lfn.get(), pn, f->N, f->isDirPerPart, f->isStriped).str());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite a long hard to read line, would benefit from the getPhysicalPartName call begin on its own line.

But, can you check if constructPartFilename (dadfs.hpp) doesn't do what's required already?
I think it also covers the replicate case (in BM case only).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved getPhysicalPartName to its own line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like constructPartFilename does build up the correct physical part name, and all the information it needs is available in listOrphans. However, right before constructPartFilename returns the finished string, it lowercases the return buffer which causes checkOrphanPhysicalFile to not find the file.

I'm not sure why it does this, but I didn't want to change since it is probably relied on for other uses.

Copy link
Copy Markdown
Member

@jakesmith jakesmith Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it does this, but I didn't want to change since it is probably relied on for other uses.

I think it's because all logical files are caseless by definition.. e.g. a::b::c is same as A::B::C - physical files should be lower cased in DFS.

which causes checkOrphanPhysicalFile to not find the file.

it failed because there was a physical file with mixed case? (there shouldn't be)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use constructPartFilename and removed lower casing from the function.

Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
if (!scanDirectory(node,ep,path,drv,pdir,NULL,!isStriped,isStriped))
return false;
if (!isStriped && pdir->dirs.ordinality()>0)
throw makeStringExceptionV(-1, LOGPFX "Directory Per Part %s contains other subdirectories.", path.str());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to be careful that 1 malformed file, doesn't cause all efforts to xref to be unrunnable.
Should report and skip, but not abort completely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log an error rather than throw an exception.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment, we need to revisit these rejections, they should appear in the orphaned report I think (needs JIRA to be done after this one).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is calling OERRLOG to report that the dir per part contains sub dirs. Is there a better way to log what is happening?

Comment thread dali/sasha/saxref.cpp Outdated
// sure it is really a dirPerPart directory. Is an all numbers subdirectory valid in ecl?
if (!scanDirectory(node,ep,path,drv,pdir,NULL))
if (isSpecialDir) {
// Files inside a special directory should be scanned with the current directory as the parent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on this. Explaining why structure is being built up without stripe and dir-per-part cDirDesc's

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment.

Comment thread dali/sasha/saxref.cpp Outdated
if (!scanDirectory(node,ep,path,drv,pdir,NULL))
if (isSpecialDir) {
// Files inside a special directory should be scanned with the current directory as the parent
if (!scanDirectory(node,ep,path,drv,pdir,NULL,!isStriped,isStriped))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isStriped shouldn't be a recursively passed property.
We know whether files should be in stripe directories or not from the outset (from the plane details).
scanDirectories should check that the 1st level is a stripe directory format, skip it like this if it is, report an error if not and ignore directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to tell if there should be a stripe directory from the plane details.

I was trying to use numDevices, but in bare-metal I have a test file that does not have a stripeDir despite numDevices>1 being in the plane info.

Also, if there is something in the plane details saying there should be stripeDir, does this apply to all files in the plane i.e. could there be a plane where some files are striped on it and some aren't?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to use numDevices, but in bare-metal I have a test file that does not have a stripeDir despite numDevices>1 being in the plane info.

If a plane has numDevices and it is not a 'hostGroup' (!plane->hasProp("@HoStGroup")) then it is striped.
@HoStGroup planes have a list of planes, although supported in containerized, they are generally going to be BM planes mapped over a set of hosts.

Also, if there is something in the plane details saying there should be stripeDir, does this apply to all files in the plane i.e. could there be a plane where some files are striped on it and some aren't?

No, if a plane implies striped, all files must be. It's because the plane defined the storage, the lfn's only know they layout based on the plane.
You could imagine a different scenario where the lfn's stored more detail - that might imply that the underlying original plane could change, and the existing lfn's remained associated with the old info, but that is not the design. Plane essentially need to be thought of as immutable, if different storage configuration were needed, then a new plane would need defining, and e.g. set as default, and pre-existing files would either remain on the old plane, or would need migrating.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  1. isStriped should be deduced at the start of the scan (NB: dirPerPart could vary per file).
  2. Should be remove as a member of cFileDesc
  3. scanDirectory should assert that the top-level directory looks like a strip-directory, and reject the files if not.

NB: TBD (open a new sub JIRA place) - rejecting files both for this and other places, e.g. when "Directory Per Part %s contains other subdirectories.", should probably not short-circuit and ignore them, but add them, such that they are picked up as orphans, or misplaced.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed isStriped from cFileDesc and added a check to scanDirectories to log an error when expected stripe directories are not found in the toplevel.

Comment thread dali/sasha/saxref.cpp Outdated
StringBuffer path;
if (drv)
setReplicateFilename(path,drv);
setReplicateFilename(addPathSepChar(path.append(basedir)),drv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be able to use constructPartFilename vs getPhysicalPartName and not special case here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If constructPartFilename didn't return a lowercased string, then it could be used here to avoid the special case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did lower casing causes issues? (see other comment).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lower casing in constructPartFilename because it was lower casing the prefix along with the filename.

@jackdelv jackdelv requested a review from jakesmith April 3, 2025 21:19
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - changes look sensible. Please see follow up comments.

Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
if (!scanDirectory(node,ep,path,drv,pdir,NULL))
if (isSpecialDir) {
// Files inside a special directory should be scanned with the current directory as the parent
if (!scanDirectory(node,ep,path,drv,pdir,NULL,!isStriped,isStriped))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to use numDevices, but in bare-metal I have a test file that does not have a stripeDir despite numDevices>1 being in the plane info.

If a plane has numDevices and it is not a 'hostGroup' (!plane->hasProp("@HoStGroup")) then it is striped.
@HoStGroup planes have a list of planes, although supported in containerized, they are generally going to be BM planes mapped over a set of hosts.

Also, if there is something in the plane details saying there should be stripeDir, does this apply to all files in the plane i.e. could there be a plane where some files are striped on it and some aren't?

No, if a plane implies striped, all files must be. It's because the plane defined the storage, the lfn's only know they layout based on the plane.
You could imagine a different scenario where the lfn's stored more detail - that might imply that the underlying original plane could change, and the existing lfn's remained associated with the old info, but that is not the design. Plane essentially need to be thought of as immutable, if different storage configuration were needed, then a new plane would need defining, and e.g. set as default, and pre-existing files would either remain on the old plane, or would need migrating.

Comment thread dali/sasha/saxref.cpp
Comment thread dali/sasha/saxref.cpp Outdated
if (!scanDirectory(node,ep,path,drv,pdir,NULL))
if (isSpecialDir) {
// Files inside a special directory should be scanned with the current directory as the parent
if (!scanDirectory(node,ep,path,drv,pdir,NULL,!isStriped,isStriped))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  1. isStriped should be deduced at the start of the scan (NB: dirPerPart could vary per file).
  2. Should be remove as a member of cFileDesc
  3. scanDirectory should assert that the top-level directory looks like a strip-directory, and reject the files if not.

NB: TBD (open a new sub JIRA place) - rejecting files both for this and other places, e.g. when "Directory Per Part %s contains other subdirectories.", should probably not short-circuit and ignore them, but add them, such that they are picked up as orphans, or misplaced.

Comment thread dali/sasha/saxref.cpp Outdated
f->getPartName(path,pn);
RemoteFilename rfn;
rfn.setPath(grp->queryNode((pn+drv)%numnodes).endpoint(),path.str());
rfn.setPath(grp->queryNode((pn+drv)%numnodes).endpoint(),getPhysicalPartName(path,storagePlane,lfn.get(), pn, f->N, f->isDirPerPart, f->isStriped).str());
Copy link
Copy Markdown
Member

@jakesmith jakesmith Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it does this, but I didn't want to change since it is probably relied on for other uses.

I think it's because all logical files are caseless by definition.. e.g. a::b::c is same as A::B::C - physical files should be lower cased in DFS.

which causes checkOrphanPhysicalFile to not find the file.

it failed because there was a physical file with mixed case? (there shouldn't be)

Comment thread dali/sasha/saxref.cpp Outdated
if (!scanDirectory(node,ep,path,drv,pdir,NULL,!isStriped,isStriped))
return false;
if (!isStriped && pdir->dirs.ordinality()>0)
throw makeStringExceptionV(-1, LOGPFX "Directory Per Part %s contains other subdirectories.", path.str());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment, we need to revisit these rejections, they should appear in the orphaned report I think (needs JIRA to be done after this one).

Comment thread dali/sasha/saxref.cpp Outdated
// sure it is really a dirPerPart directory. Is an all numbers subdirectory valid in ecl?
if (!scanDirectory(node,ep,path,drv,pdir,NULL))
if (isSpecialDir) {
// Files inside a special directory should be scanned with the current directory as the parent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment.

Comment thread dali/sasha/saxref.cpp Outdated
StringBuffer path;
if (drv)
setReplicateFilename(path,drv);
setReplicateFilename(addPathSepChar(path.append(basedir)),drv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did lower casing causes issues? (see other comment).

@jackdelv
Copy link
Copy Markdown
Contributor Author

jackdelv commented Apr 9, 2025

@jakesmith Removing the call to StringBuffer::lowercase in constructPartFilename provided the expected physical filename, so I have removed getPhysicalPartName. I removed isStriped from the parameters of scanDirectory and the cFileDesc member since it can be deduced from the plane info.

I've taken note of the two JIRAs you have mentioned. Am I missing anything?

  • Revisit errors/exceptions being thrown to ensure XRef keeps running even if a file causes an error
  • Reject Files not under stripe directory if plane details show directories are striped and subdirectories under dir-per-part directories.

@jackdelv jackdelv requested a review from jakesmith April 9, 2025 20:04
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - please see comments.

Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
if (!isdigit(*(dir++)))
isSpecialDir = false;
}
// If numDevices>0, check that top-level directories are striped
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be >1 (see earlier comments)

Comment thread dali/sasha/saxref.cpp
Comment thread dali/sasha/saxref.cpp Outdated
Comment thread dali/sasha/saxref.cpp Outdated
@jackdelv jackdelv requested a review from jakesmith April 10, 2025 14:51
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - 1 issue re. "d" still I think + 2 trivial comments.

Comment thread dali/sasha/saxref.cpp Outdated
if (plane)
{
storagePlane.set(plane);
isPlaneStriped = !storagePlane->hasProp("@hostGroup") && (storagePlane->getPropInt("@numDevices", 1)>1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: no point in the default value in this context.

Comment thread dali/sasha/saxref.cpp Outdated
unsigned numuniqnodes = 0;
Owned<IUserDescriptor> udesc;
Linked<IPropertyTree> storagePlane;
bool isPlaneStriped;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: even though itialized in ctor, it's good practice for all members (for new code at least) to have member initializers. Helps if code is ever refactored etc., and compiler is usually smart enough not to initialize twice anyway.

Comment thread dali/sasha/saxref.cpp
Comment thread dali/sasha/saxref.cpp Outdated
}

StringBuffer &getName(StringBuffer &buf)
bool getName(StringBuffer &buf)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial/not new: should be a const method really (as should get some of the members below)

@jakesmith
Copy link
Copy Markdown
Member

I've taken note of the two JIRAs you have mentioned. Am I missing anything?

Revisit errors/exceptions being thrown to ensure XRef keeps running even if a file causes an error
Reject Files not under stripe directory if plane details show directories are striped and subdirectories under dir-per-part directories.

Not that I can recall. Please open sub-jira's under parent, if haven't already.

@jackdelv
Copy link
Copy Markdown
Contributor Author

@jakesmith Fixed the trivial items. I added a check to the initializer for isDirStriped that dir[1] != '\0'. It seemed to be the best way to fix it, but is there a better way? Can the whole stripe/dirperpart check be refactored in a cleaner way?

@jackdelv jackdelv requested a review from jakesmith April 10, 2025 16:16
@jackdelv
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith Fixed the trivial items. I added a check to the initializer for isDirStriped that dir[1] != '\0'. It seemed to be the best way to fix it, but is there a better way? Can the whole stripe/dirperpart check be refactored in a cleaner way?

I think it works as is.
I'd probably have introduced a isRestDigits inline helper and code like:

                bool isDirStriped = false; // format 'd[0-9]+'
                bool isSpecialDir = false; // isDirStriped || dirPerPart format '[0-9]+'
                if (*dir == 'd')
                {
                    dir++;
                    isDirStriped = (*dir != '\0') && isRestDigits(dir);
                    isSpecialDir = isDirStriped;
                }
                else
                    isSpecialDir = isRestDigits(dir);

untested!

@jackdelv - looks good I think. Please squash, and I'll hopefully do a final review in the morning.

Comment thread dali/sasha/saxref.cpp
@jackdelv jackdelv force-pushed the xrefContainerized branch from f989f89 to d9e01f4 Compare April 10, 2025 18:31
@jackdelv jackdelv requested a review from jakesmith April 10, 2025 18:33
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - a few final comments.

Comment thread dali/base/dadfs.cpp Outdated
makePhysicalPartName(lname, partNo+1, max, fullname, 0, DFD_OSdefault, prefix, dirPerPart, stripeNum);

// revisit: constructPartFilename should be refactored not to deal with replicate directories, by pre-determining the alternate prefix if copy>0
// If copy>0 it could do calPartLocation, find the replicate plane, get it's prefix, and pass to makePhysicalPartName
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: typo calcPartLocation

Comment thread dali/base/dafdesc.cpp Outdated
IPropertyTree * plane = storage->addPropTree("planes");

// Revisit: Ignore hthor planes when running XRef on storage planes
if (groupType == grp_hthor)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 99% sure these group types only exist in BM/not created in containerized.
Comment should be updated.

Comment thread dali/base/dautils.cpp Outdated
#endif

// takes filename and creates mask filename with $P$ extension
extern da_decl void parseFileName(const char *name,StringBuffer &mname,unsigned &num,unsigned &max,unsigned &stripeNum,unsigned &dirPerPart,bool &replicate)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to move this back into dfuxreflib.cpp where it was, as not shared anymore.

Comment thread dali/datest/datest.cpp Outdated
for (unsigned m=0; m<n; m++) {
RemoteFilename rfn;
constructPartFilename(group,m+1,n,NULL,partmask.str(),dir.str(),false,1,rfn);
constructPartFilename(group,m+1,1,n,0,0,false,"",dir.str(),partmask.str(),0,rfn);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the 3d parameter be 0 ? (copy=0)

Comment thread dali/sasha/saxref.cpp Outdated
ndone[0] = 0;
ndone[1] = 0;
for (drv=0;drv<2;drv++) {
unsigned numDevices = storagePlane->hasProp("@HostGroup") ? 1 : storagePlane->getPropInt("@numDevices", 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be @hostGroup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. but would make sense to add numDevices as a member and set at same time as isPlaneStriped

Comment thread dali/sasha/saxref.cpp
Comment thread testing/unittests/dalitests.cpp Outdated
ClusterPartDiskMapSpec mspec;
Owned<IGroup> grp = createIGroup("10.150.10.1-3");
RemoteFilename rfn;
IStoragePlane *plane = getDataStoragePlane("mystorageplane", true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaks, but also not used - this line should be deleted.

@jackdelv jackdelv requested a review from jakesmith April 11, 2025 14:58
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - looks good. Please squash, and I'll give it one more pass hopefully.

@jackdelv jackdelv force-pushed the xrefContainerized branch from 8f9269d to d0ef745 Compare April 15, 2025 13:09
@jackdelv
Copy link
Copy Markdown
Contributor Author

@jakesmith Squashed.

@jackdelv jackdelv requested a review from jakesmith April 15, 2025 13:10
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - looks good.

As discussed, the dfuxreflib should likely be deprecated altogether. It is not the default or ever used in BM afaik (saxref is the standard setuip).
It would be good to assert that it is never used in k8s, and remove changes to it in this PR.

Comment thread dali/dfuXRefLib/dfuxreflib.cpp Outdated
if (!plane)
throw makeStringExceptionV(0, "Plane definition \"%s\" is missing for File", grpname.str());
unsigned numDevices = plane->numDevices();
bool r = replicate?1:0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be unsigned r =

@jackdelv
Copy link
Copy Markdown
Contributor Author

@jakesmith Removed the dfuXRefLib changes. Since the constructPartFilename function was updated and is used in saxref, I changed dfuXRefLib to use the now deprecated version. I also added an assert that containerized should be using the new saxref path.

@jackdelv jackdelv requested a review from jakesmith April 16, 2025 14:54
{
// MORE: Containerized should only be using saxref and dfuXRefLib should probably be removed
assert(xRefNode->useSasha());
xRefNode->setStatus("Submitted");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to remove this line and the else. i.e. just the assert (should use assertex),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to catch/assert here, but what I meant/was referring to, was the code in saxref.cpp:

        bool eclwatchProviderDefault = false;
        // NB: Default to start Xref using sasha service in containerized without having to change values.yaml
        // eclwatchProvider sets useSasha in call to setSubmittedOk
        if (isContainerized())
            eclwatchProviderDefault = true;
        bool eclwatchprovider = props->getPropBool("@eclwatchProvider",eclwatchProviderDefault);

That should explicitly only support the eclwatch provide option in containerized, e.g.:

        bool eclwatchprovider = true;
        if (!isContainerized()) // NB: containerized does not support xref any other way.
        {
            // eclwatchProvider sets useSasha in call to setSubmittedOk
            eclwatchprovider = props->getPropBool("@eclwatchProvider");
        }

Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - a couple of comments.

I notice some whitespaces creeping in (at end of lines).
(more obvious when using git diff on the CLI)
I haven't used it personally, but there is a vscode option:

"files.trimTailingWhitespace" : true

which may be useful

@jackdelv
Copy link
Copy Markdown
Contributor Author

@jakesmith Fixed the assert. I also turned on that setting to hopefully prevent that from happening in the future. Which files had trailing whitespaces? I know the dfuXRefLib diff shows whitespace changes still, but those are removals.

@jackdelv jackdelv requested a review from jakesmith April 16, 2025 16:14
@jakesmith
Copy link
Copy Markdown
Member

@jakesmith Fixed the assert. I also turned on that setting to hopefully prevent that from happening in the future. Which files had trailing whitespaces? I know the dfuXRefLib diff shows whitespace changes still, but those are removals.

hm, I don't see them now. Maybe I was looking at wrong diff.

Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - looks good. Please squash.

@jakesmith
Copy link
Copy Markdown
Member

@jackdelv - now that candidate-9.12.x exists, this needs to be rebased onto it.

@jackdelv jackdelv force-pushed the xrefContainerized branch from 706fd44 to 7e55d41 Compare April 16, 2025 17:20
@jackdelv jackdelv changed the base branch from master to candidate-9.12.x April 16, 2025 17:21
@jackdelv
Copy link
Copy Markdown
Contributor Author

@jakesmith Squashed, and Rebased on 9.12.x. Back to you.

@jackdelv jackdelv requested a review from jakesmith April 16, 2025 17:22
Copy link
Copy Markdown
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - looks good.

@ghalliday - please take a look.

@jakesmith jakesmith requested a review from ghalliday April 17, 2025 09:23
Copy link
Copy Markdown
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv looks good. One question, otherwise I am happy to merge. (One additional minor comment.)

Comment thread dali/sasha/saxref.cpp
static cFileDesc * create(CLargeMemoryAllocator &mem,const char *_name,unsigned n)
static cFileDesc * create(CLargeMemoryAllocator &mem,const char *_name,unsigned n,bool d,unsigned fnLen)
{
size32_t sl = strlen(_name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future: add an assert that fnLen <= 255

Comment thread dali/sasha/saxref.cpp Outdated
if (dir.length()&&isPathSepChar(dir.charAt(dir.length()-1)))
dir.setLength(dir.length()-1);
unsigned drv = getPathDrive(dir.str()); // should match c
unsigned drv = isContainerized() ? getPathDrive(dir.str()) : 0; // should match c
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test around the correct way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is wrong. It should be isContainerized() ? 0 : getPathDrive(dir.str());

@jackdelv jackdelv requested a review from ghalliday April 17, 2025 14:54
Copy link
Copy Markdown
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv please squash

@jackdelv jackdelv force-pushed the xrefContainerized branch from 457cef5 to c4c0177 Compare April 17, 2025 14:59
@jackdelv
Copy link
Copy Markdown
Contributor Author

@ghalliday Squashed.

Comment thread dali/sasha/saxref.cpp Outdated
ForEachItemIn(i1, list) {
const char *planeName = list.item(i1);
Owned<IPropertyTree> plane = getStoragePlane(planeName);
if (isContainerized) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isContainerized()?

@jackdelv jackdelv force-pushed the xrefContainerized branch from c4c0177 to 858cef8 Compare April 17, 2025 15:17
@ghalliday ghalliday merged commit 7695468 into hpcc-systems:candidate-9.12.x Apr 17, 2025
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants