Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor JournalAbbreviationRepository class to use on-disk maps #12606

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

marwanemad07
Copy link
Contributor

@marwanemad07 marwanemad07 commented Mar 3, 2025

Closes #12571

This PR refactors the JournalAbbreviationRepository to use MVMap for all abbreviation maps and creates these maps in JournalListMvGenerator to use these maps on demand. This change reduces memory overhead and improves consistency by eliminating redundant in-memory structures.

I found that the total execution time decreased to 42ms with memory usage at 2.08MB, compared to the previous 458ms and 146.95MB on my device using IntelliJ Idea's profiler.

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@marwanemad07 marwanemad07 changed the title Fix issue 12571 Refactor JournalAbbreviationRepository class to use on-disk maps Mar 3, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

}

public void addCustomAbbreviation(Abbreviation abbreviation) {
Objects.requireNonNull(abbreviation);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this comment removed? Please readd it provides helpful insights

private final String abbreviation;
private transient String dotlessAbbreviation;
private final String dotlessAbbreviation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these no longer marked as transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous solution returned null for the name and dotlessAbbreviation fields because they were marked as transient. As a result, these values were lost during deserialization from the h2 database, and new Abbreviation objects were created using only the key and abbreviation. I now need the deserialization process to correctly restore these fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests failed, despite the test passes on my local, I think the transient fields may be the problem, I will try with them again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clue, why I introduced them - I think with the new system, there should be no need for it.

private final TreeSet<Abbreviation> customAbbreviations = new TreeSet<>();

/**
* Initializes the internal data based on the abbreviations found in the given MV file
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readd the javadoc

}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wha did you remove all the javadoc comments?

@@ -14,111 +12,72 @@
import org.h2.mvstore.MVMap;
import org.h2.mvstore.MVStore;

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not remove the javadoc comments!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the oversight. While testing a solution—where I created the MV maps in JournalAbbreviationRepository—I added some debugging comments. Unfortunately, I accidentally removed them during the cleanup. I will roll back those changes.

}

private void openMaps(MVStore store) {
String fullToAbbreviationMapName = "FullToAbbreviation";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract all the strings as constanst

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to make these as final fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static final

public Optional<Abbreviation> get(String input) {
// Clean up input: trim and unescape ampersand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@marwanemad07
Copy link
Contributor Author

I'm getting an out-of-memory error on the unit tests even though they passed locally. The error also mentioned that some tests failed, but when I tried them locally, they passed.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@marwanemad07 marwanemad07 marked this pull request as draft March 6, 2025 12:25
@koppor koppor requested a review from Copilot March 6, 2025 13:22
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors the JournalAbbreviationRepository to use on-disk MVMap structures instead of in-memory HashMaps, reducing memory overhead and execution time. It also updates JournalListMvGenerator accordingly and adjusts the Abbreviation class to mark some fields as transient final.

  • Refactored repository constructors and map usage in JournalAbbreviationRepository.
  • Updated JournalListMvGenerator to populate MVMaps with cache configuration.
  • Modified the Abbreviation class to declare transient fields as final.

Reviewed Changes

File Description
src/main/java/org/jabref/logic/journals/JournalAbbreviationRepository.java Switched from in-memory maps to MVMap and updated retrieval logic.
src/main/java/org/jabref/cli/JournalListMvGenerator.java Updated map initialization and added caching for MVStore.
src/main/java/org/jabref/logic/journals/Abbreviation.java Declared transient fields as final to support immutability.

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/main/java/org/jabref/logic/journals/JournalAbbreviationRepository.java:24

  • [nitpick] The constant name 'ABBREVIATION_TO_ABBREVIATION_MAP_NAME' does not clearly align with the corresponding variable 'abbreviationToNameMap'. Consider renaming the constant to 'ABBREVIATION_TO_NAME_MAP_NAME' for improved clarity.
private static final String ABBREVIATION_TO_ABBREVIATION_MAP_NAME = "AbbreviationToName";

src/main/java/org/jabref/logic/journals/Abbreviation.java:10

  • Marking the 'name' field as both transient and final may lead to issues during deserialization since final transient fields are not reinitialized. Consider removing the final modifier or implementing a proper deserialization mechanism (e.g., a readResolve method) if deserialization is required.
private transient final String name;

Comment on lines 10 to 13
private transient final String name;
private final String abbreviation;
private transient String dotlessAbbreviation;
private transient final String dotlessAbbreviation;

Copy link
Preview

Copilot AI Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, declaring 'dotlessAbbreviation' as transient final can cause deserialization problems since it won’t be restored automatically. Consider removing the final modifier or handling its restoration during deserialization.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this fully works.

Can the Abbreviation object be removed completely removed from the MV Store - and only used when queried at for the AbbreviationViewModel?

@@ -45,8 +45,13 @@ public static void main(String[] args) throws IOException {
MVStore store = new MVStore.Builder().
fileName(journalListMvFile.toString()).
compressHigh().
cacheSize(128).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Add a Java comment. If unsure, remove this. - The generator is run only once and the map closed upon run completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was taking around 1.1GB of memory, I thought the out-of-memory error was from here, so I tried to reduce the cache size of the MVStore. So I will remove it.

}

/**
* Initializes the repository with demonstration data. Used if no abbreviation file is found.
*/
public JournalAbbreviationRepository() {
// this will persist in memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found in the H2 database docs that when we don't provide a path to the store, the whole maps of the store will persist in the memory. So I wanted to mention this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to the documentation.

Even in 2025, the IDEs do not inline comments in PRs in code comments. Thus, we need to do this for ourselves.

Comment on lines 175 to 206
return fullToAbbreviationObject.values();
List<Abbreviation> values = new ArrayList<>();
fullToAbbreviationMap.forEach((name, abbr) -> {
String abbreviationString = abbr.getAbbreviation();
String shortestUniqueAbbreviation = abbr.getShortestUniqueAbbreviation();
Abbreviation newAbbreviation = new Abbreviation(
name,
abbreviationString,
shortestUniqueAbbreviation
);
values.add(newAbbreviation);
});
return values;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is deserialized from the store, and since the name and dotlessAbbreviation fields are transient, they will be null. So I needed to transform the values so they would be set properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will close this PR and I will try to think of another good solution trying to learn from my faults in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - no need to close this PR, as everyone has more context via the comments now. You can keep working on this one itself.

private static final String SHORTEST_UNIQUE_TO_ABBREVIATION_MAP_NAME = "ShortestUniqueToName";

private final MVStore store;
private MVMap<String, Abbreviation> fullToAbbreviationMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this map still needed? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Since the name and dotlessAbreviation fields are transient, I use this to create the abbreviation from its name (which is the key), the abbreviation string, and the shortUniqueAbbrevition fields which can be contained from the value of this map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the abbreviation created? I think, it is only required for the UI when abbreviations are changed. When using the logic module for abbreviations, there should be no need for this object?!

In call cases, when updating an abbreviation in the UI the correct behavior needs to be (manually) tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the get function, I need to create the abbreviation again for the same reason that the abbreviation contains transient fields. If you can confirm that do we need the fields of the abbreviation to be transient or not this may help a lot. From what I understand, I see we don't need them to be transient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please investigate the usage for yourself. I have no clue about the transient - i think, this property needs to be removed if they are kept be stored in the mv store, which I doubt!

@koppor
Copy link
Member

koppor commented Mar 17, 2025

This PR will maybe have conflicts with #12544

@koppor
Copy link
Member

koppor commented Mar 17, 2025

PR #12652 modified this class, too.

@marwanemad07 Can you please resolve the conflicts. Then, this PR should be ready for review (isn't it). Then, we should get this in - and then #12544 should be adapted accordingly.

Copy link

trag-bot bot commented Mar 17, 2025

@trag-bot didn't find any issues in the code! ✅✨

Copy link
Contributor

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@koppor
Copy link
Member

koppor commented Mar 17, 2025

org.h2.mvstore.MVStoreException: java.lang.OutOfMemoryError: Java heap space [2.3.232/3]
	org.h2.mvstore.DataUtils.newMVStoreException(DataUtils.java:996)
	org.h2.mvstore.FileStore.serializeAndStore(FileStore.java:1445)
	org.h2.mvstore.FileStore.lambda$storeIt$2(FileStore.java:1380)
	java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	java.base/java.lang.Thread.run(Thread.java:1575)
java.lang.OutOfMemoryError: Java heap space

OMG 😅

@@ -47,6 +47,10 @@ public static void main(String[] args) throws IOException {
compressHigh().
open()) {
MVMap<String, Abbreviation> fullToAbbreviation = store.openMap("FullToAbbreviation");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one really need this - I think, this can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need them to create the mv file to be reused in the JorunalAbbreviationRepository

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the whole abbreviation functions can be written without using the full object?!

@marwanemad07
Copy link
Contributor Author

I didn't feel disappointed until I saw it 😂.

@koppor
Copy link
Member

koppor commented Mar 17, 2025

I didn't feel disappointed until I saw it 😂.

I have no "quick" clue how to proceed here. Think, I can dive into it this Friday the earliest; probably next week.

@subhramit
Copy link
Member

I have no "quick" clue how to proceed here. Think, I can dive into it this Friday the earliest; probably next week.

Here's some more good news - this happens only on CI and not on his local.

@marwanemad07
Copy link
Contributor Author

marwanemad07 commented Mar 18, 2025

Here's some more good news - this happens only on CI and not on his local.

The bad news is when I profiled the JournalAbbreviationRepositoryTest.java I found that the function findBestFuzzyMatched took 500MB (this is from streaming the MVMap), despite when I profiled the app itself it took 170MB. I thought the reason was coming from the default constructor, but I tried to play around with it but in no vain.

Followup: A question for anyone who may work on this issue if he follows the same solution or the current solution. when we profile this it takes around 500MB to 1GB although the file itself is around 4MB.

@koppor
Copy link
Member

koppor commented Mar 19, 2025

Fuzzy matching was introduced at #12652

@koppor
Copy link
Member

koppor commented Mar 19, 2025

Is this now the time to move to Apache Lucene or to Postgres for journal abbreviations?

The best thing would be to use postgres and Soundex (https://www.postgresql.org/docs/current/fuzzystrmatch.html) (first google hit)

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR should rewrite

  • org.jabref.gui.journals.UndoableAbbreviator --> to use methods of the journalAbbreviationRepository not returning Abbreviation object, but strings
  • org.jabref.logic.journals.JournalAbbreviationRepository#getShortestUniqueAbbreviation should not use get, but use the maps
  • org.jabref.logic.journals.JournalAbbreviationRepository#get should be removed
  • org.jabref.gui.journals.UndoableUnabbreviator --> similar rewrite as org.jabref.gui.journals.UndoableAbbreviator
  • Rewrite org.jabref.gui.preferences.journals.JournalAbbreviationsTabViewModel to use the String methods - and create the Abbreviation object for itself
  • Inline org.jabref.logic.journals.Abbreviation#getNext to org.jabref.logic.journals.JournalAbbreviationRepository#getNextAbbreviation with the help of the Maps.

  • org.jabref.gui.preferences.journals.AbbreviationViewModel can stay IMHO
  • org.jabref.logic.journals.AbbreviationWriter can also stay - thus org.jabref.logic.journals.Abbreviation is also staying in logic.

(I was about to say that org.jabref.logic.journals.Abbreviation should be moved to GUI as AbbreviationViewModel, but then I did not know how to code the writer properly)

ALTERNATIVE: Move org.jabref.logic.journals.Abbreviation to GUI and introduce org.jabref.logic.journals.Abbreviation as record without any logic in it.

@marwanemad07
Copy link
Contributor Author

I will look at the PR again and rewrite it. I want to mention something I noticed, the current solution and my solution take too much memory because of iterating over the MVMap (any part iterating over the MVMap takes too much memory). I tried to read H2 database docs but they are not clear on how they handle it. I will read how they do the deserialization for MVMap.

@koppor
Copy link
Member

koppor commented Mar 20, 2025

@marwanemad07 No need to dive into the MVStore implementation. Just rewrite the class interface to be more "dumb", then, the class can be rewritten to use another backend.

Iterating on everything should be necessary only if all abbreviations should be displayed. For abbreviate/unabbreviate it should not be necessary.

Oh, there is fuzzy matching. - With this requirement, another backend should be used. -- Thus, proposal:

  1. Rewrite class interface to return String only (as outlined above)
  2. Rewrite class to use postgres - see https://stackoverflow.com/q/7730027/873282 for hints for fuzzy matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite journal abbrevation repository
4 participants