Skip to content

Conversation

Averroes90
Copy link

@Averroes90 Averroes90 commented Jul 21, 2025

Problem

  • Backend throws MissingDependencyException when processing .docx files due to missing markitdown[docx] dependency
  • Application fails with Ollama model not found error for nomic-embed-text:137m-v1.5-fp16

Solution

  • Added markitdown[docx] to requirements.txt
  • Added Ollama nomic-embed-text:137m-v1.5-fp16 model setup to setup.sh

Testing

  • Verified .docx file processing works after installing markitdown[docx]
  • Confirmed application runs successfully after pulling the required Ollama model

Summary by CodeRabbit

  • Chores
    • Expanded backend dependencies to add new packages and an editable dependency pointing to a remote repository for development/integration.
    • Enhanced setup script to automatically check for and pull two AI models during installation if they are not already present.

- Add markitdown[docx] to requirements.txt to resolve DocxConverter MissingDependencyException
- Add Ollama model pull instructions to setup documentation
- Fixes issue with .docx file processing in backend
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

Walkthrough

Updated backend Python dependencies (added cobble==0.1.4, mammoth==1.9.1, markitdown[docx], and an editable GitHub dependency) and extended setup.sh to check for and pull two Ollama models (gemma3:4b and nomic-embed-text:137m-v1.5-fp16) if missing.

Changes

Cohort / File(s) Change Summary
Backend dependencies
apps/backend/requirements.txt
Added cobble==0.1.4, mammoth==1.9.1, updated markitdown to include [docx] extras, and added an editable GitHub dependency pointing to a specific commit/subdirectory.
Setup / Ollama model pulls
setup.sh
Inserted presence checks and conditional pull blocks for two Ollama models: gemma3:4b and nomic-embed-text:137m-v1.5-fp16, with logging and error handling, placed after existing Ollama installation logic.

Sequence Diagram(s)

sequenceDiagram
    participant Setup as setup.sh
    participant Ollama as ollama CLI
    Note over Setup: After Ollama install step
    Setup->>Ollama: ollama list | grep -q "gemma3:4b"
    alt not present
        Setup->>Ollama: ollama pull gemma3:4b
        Ollama-->>Setup: success / failure
    else present
        Ollama-->>Setup: already present
    end
    Setup->>Ollama: ollama list | grep -q "nomic-embed-text:137m-v1.5-fp16"
    alt not present
        Setup->>Ollama: ollama pull nomic-embed-text:137m-v1.5-fp16
        Ollama-->>Setup: success / failure
    else present
        Ollama-->>Setup: already present
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

I hopped through the repo, sniffed new seeds,
Cobble and mammoth fulfilling new needs.
Markitdown learned docx, a clever new trick,
Two Ollama models pulled—perfect and quick.
A joyful rabbit, I dance and I cheer! 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apps/backend/requirements.txt (1)

28-28: mammoth is already pulled in by markitdown[docx]; consider dropping the direct pin.

markitdown[docx] declares mammoth as a transitive dependency. Keeping a separate
mammoth==1.9.1 line:

  • adds maintenance overhead,
  • risks version conflicts if markitdown upgrades its declared range, and
  • offers no benefit unless you deliberately need a stricter pin.

Unless there is a conscious reason to override the version, delete the redundant line:

-mammoth==1.9.1
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64371f7 and 7bad0ab.

📒 Files selected for processing (1)
  • apps/backend/requirements.txt (3 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
setup.sh (2)

131-137: Factor out duplicated model-pull logic into a helper (ensure_ollama_model)

The new block correctly handles the extra Ollama model, but the code now contains two near-identical “check & pull” sequences (for gemma3:4b and nomic-embed-text:137m-v1.5-fp16). Consolidating them will:

• eliminate duplication
• make it trivial to add/remove models in one place
• keep the main script body readable

Example refactor:

+#–– Helper: ensure an Ollama model is present ––#
+ensure_ollama_model() {
+  local model="$1"
+  if ! ollama list | grep -q "^${model}"; then
+    info "Pulling ${model} model…"
+    ollama pull "${model}" || error "Failed to pull ${model}"
+    success "${model} model ready"
+  else
+    info "${model} model already present—skipping"
+  fi
+}
+
-# existing per-model blocks
-if ! ollama list | grep -q 'gemma3:4b'; then
-  ...
-fi
-
-if ! ollama list | grep -q 'nomic-embed-text:137m-v1.5-fp16'; then
-  ...
-fi
+# concise usage
+ensure_ollama_model "gemma3:4b"
+ensure_ollama_model "nomic-embed-text:137m-v1.5-fp16"

Side note: anchoring the grep with ^ avoids accidental matches where the tag appears as part of another string.


38-45: Update usage/help text to mention the second model

The help block still states “pull gemma3:4b model” (singular). Now that the script always ensures both gemma3:4b and nomic-embed-text:137m-v1.5-fp16, tweak the wording to avoid surprising users.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8e86b3 and 6bc2423.

📒 Files selected for processing (1)
  • setup.sh (1 hunks)

@MohaimenH
Copy link

Stumbled upon this issue as well, was about to create a PR but great to see it here already!

Might want to update the README/Setup as well to include the dependency on nomic-embed-text:137m-v1.5-fp16.

Thanks!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bc2423 and 98016a3.

📒 Files selected for processing (2)
  • apps/backend/requirements.txt (3 hunks)
  • setup.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/backend/requirements.txt

Comment on lines +138 to +154

if ! ollama list | grep -q 'gemma3:4b'; then
info "Pulling gemma3:4b model…"
ollama pull gemma3:4b || error "Failed to pull gemma3:4b"
success "gemma3:4b model ready"
else
info "gemma3:4b model already present—skipping"
fi

if ! ollama list | grep -q 'nomic-embed-text:137m-v1.5-fp16'; then
info "Pulling nomic-embed-text:137m-v1.5-fp16 model…"
ollama pull nomic-embed-text:137m-v1.5-fp16 || error "Failed to pull nomic-embed-text:137m-v1.5-fp16"
success "nomic-embed-text:137m-v1.5-fp16 model ready"
else
info "nomic-embed-text:137m-v1.5-fp16 model already present—skipping"
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

DRY the model pull logic and avoid hard-coded unconditional pulls

You already have ollama_check_or_pull (Lines 110-119). These two blocks duplicate its behavior and hard-code models irrespective of provider/config. Prefer reusing the helper to keep behavior consistent and reduce maintenance. Also consider not pulling models unless the provider is Ollama or an explicit opt-in flag is set, since you already pull models based on backend .env later (Lines 186-196).

Proposed change: replace both blocks (Lines 138-154) with a compact loop that uses the helper.

- if ! ollama list | grep -q 'gemma3:4b'; then
-   info "Pulling gemma3:4b model…"
-   ollama pull gemma3:4b || error "Failed to pull gemma3:4b"
-   success "gemma3:4b model ready"
- else
-   info "gemma3:4b model already present—skipping"
- fi
-
- if ! ollama list | grep -q 'nomic-embed-text:137m-v1.5-fp16'; then
-   info "Pulling nomic-embed-text:137m-v1.5-fp16 model…"
-   ollama pull nomic-embed-text:137m-v1.5-fp16 || error "Failed to pull nomic-embed-text:137m-v1.5-fp16"
-   success "nomic-embed-text:137m-v1.5-fp16 model ready"
- else
-   info "nomic-embed-text:137m-v1.5-fp16 model already present—skipping"
- fi
+# Option A (simple, DRY): always ensure these two models exist
+for model in "gemma3:4b" "nomic-embed-text:137m-v1.5-fp16"; do
+  ollama_check_or_pull "$model"
+done
+
+# Option B (conditional): only ensure when explicitly requested
+# if [[ "${PULL_DEFAULT_OLLAMA_MODELS:-true}" == "true" ]]; then
+#   for model in "gemma3:4b" "nomic-embed-text:137m-v1.5-fp16"; do
+#     ollama_check_or_pull "$model"
+#   done
+# fi

Additionally, to make the presence check more robust (avoid partial matches), improve the helper to match the first column exactly. This change is outside the modified hunk; sharing suggested update for the helper:

# Suggested improvement outside the changed lines (Lines 110-119)
ollama_check_or_pull() {
  local model_name="$1"
  if ! ollama list | awk '{print $1}' | grep -qx "$model_name"; then
    info "Pulling $model_name model…"
    ollama pull "$model_name" || error "Failed to pull $model_name model"
    success "$model_name model ready"
  else
    info "$model_name model already present—skipping"
  fi
}
🤖 Prompt for AI Agents
In setup.sh around lines 138 to 154, the two repeated ollama pull blocks
duplicate existing helper behavior and unconditionally attempt pulls; replace
those blocks with a compact loop that calls ollama_check_or_pull for each model
(e.g., gemma3:4b and nomic-embed-text:137m-v1.5-fp16) and ensure the loop only
runs when Ollama is the configured provider or an explicit opt-in flag is set;
also update the ollama_check_or_pull helper (lines ~110-119) to check the first
column exactly (use ollama list piped to awk '{print $1}' | grep -qx
"$model_name") so presence checks avoid partial matches.

💡 Verification agent

🧩 Analysis chain

Confirm necessity of hard-coding gemma3:4b and document the model requirement

The PR objective was to fix the embedding model error (nomic-embed-text:137m-v1.5-fp16). Adding gemma3:4b here is potentially out of scope unless it’s the default LLM in your .env. If not required, consider removing it to save setup time/bandwidth and rely on the backend’s env-driven pull (Lines 186-196). Also, per the PR comment suggestion, add a note in README/SETUP docs instructing users to pull the embedding model during setup.

I can open a follow-up PR to:

  • Update README/SETUP to mention nomic-embed-text:137m-v1.5-fp16 and markitdown[docx].
  • Gate the default model pulls behind a flag (e.g., PULL_DEFAULT_OLLAMA_MODELS) or provider checks.

Run this script to verify whether gemma3:4b is actually referenced by defaults:


🏁 Script executed:

#!/bin/bash
# Verify whether gemma3:4b and nomic-embed-text:137m-v1.5-fp16 are referenced in env/configs

set -euo pipefail

echo "Searching for LLM/Embedding provider & model defaults…"
rg -n -A2 -B2 '(LLM_PROVIDER|LL_MODEL|EMBEDDING_PROVIDER|EMBEDDING_MODEL|gemma3:4b|nomic-embed-text:137m-v1.5-fp16)'

echo
echo "Enumerating candidate env files…"
fd -H '(.env|.env.*|env.sample|env.example)$' -t f | sed 's/^/- /'

Length of output: 12324


Consolidate and conditionally gate static Ollama model pulls

Gemma3:4b is indeed the default LLM_MODEL (apps/backend/app/core/config.py L22), so pulling it by default is valid. However, nomic-embed-text:137m-v1.5-fp16 isn’t referenced in any defaults or sample .env (apps/backend/app/core/config.py L26 uses dengcao/Qwen3-Embedding-0.6B:Q8_0), so its unconditional pull in setup.sh (Lines 137–154) may waste time and bandwidth when the embedding model differs.

Please consolidate and gate these pulls behind your existing env-driven logic:

• In setup.sh (Lines 137–154): remove or replace the two hard-coded “ollama pull” blocks and rely on
ollama_check_or_pull $LL_MODEL / $EMBEDDING_MODEL in the conditional section at Lines 186–196.
• Optionally introduce a flag (e.g. PULL_DEFAULT_OLLAMA_MODELS) or wrap the pull logic in provider checks so you only pull models actually in use.
• Update SETUP.md (and setup.ps1, SETUP.es.md) to document:
– Default LL_MODEL (“gemma3:4b”) and EMBEDDING_MODEL (“dengcao/Qwen3-Embedding-0.6B:Q8_0”)
– How to customize EMBEDDING_MODEL in .env and trigger its pull
– Any new PULL_DEFAULT_OLLAMA_MODELS flag and its default behavior.

These changes will avoid unnecessary downloads and keep setup.sh aligned with your .env-driven model configuration.

🤖 Prompt for AI Agents
In setup.sh around lines 138 to 154, the script unconditionally pulls gemma3:4b
and nomic-embed-text:137m-v1.5-fp16 which duplicates or conflicts with the
env-driven model logic later; remove those two hard-coded ollama pull blocks and
instead invoke the existing ollama_check_or_pull $LL_MODEL and
ollama_check_or_pull $EMBEDDING_MODEL in the conditional section around lines
186–196 so models are only pulled based on .env values; optionally add a
PULL_DEFAULT_OLLAMA_MODELS flag (default true) or wrap pulls in provider checks
to avoid unnecessary downloads; finally update SETUP.md, setup.ps1 and
SETUP.es.md to document the default LL_MODEL (gemma3:4b), default
EMBEDDING_MODEL (dengcao/Qwen3-Embedding-0.6B:Q8_0), how to customize
EMBEDDING_MODEL in .env and how to control the new PULL_DEFAULT_OLLAMA_MODELS
flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants