Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
9a96ded
wip
lukaszkolodziejczyk Sep 25, 2025
d1e3f8d
wip
lukaszkolodziejczyk Sep 26, 2025
6a4cf15
wip
lukaszkolodziejczyk Sep 26, 2025
4253039
simplify prepare_training_data
lukaszkolodziejczyk Sep 29, 2025
65d9692
simplify
lukaszkolodziejczyk Sep 29, 2025
101dd55
ref
lukaszkolodziejczyk Sep 29, 2025
2b518f4
Merge remote-tracking branch 'origin/main' into feat-ai-smartselect
lukaszkolodziejczyk Sep 29, 2025
87e6dfc
wip
lukaszkolodziejczyk Sep 29, 2025
7c64679
wip
lukaszkolodziejczyk Sep 29, 2025
d8c817f
wip
lukaszkolodziejczyk Sep 30, 2025
8ebc6a9
our encodings
lukaszkolodziejczyk Sep 30, 2025
e49424e
our encoders - part 2
lukaszkolodziejczyk Oct 1, 2025
8010267
claude code - null support
lukaszkolodziejczyk Oct 1, 2025
7064474
claude code - temperature and top_k
lukaszkolodziejczyk Oct 1, 2025
60a9ac2
claude code - use _is_null column
lukaszkolodziejczyk Oct 2, 2025
1d20535
cleanup
lukaszkolodziejczyk Oct 2, 2025
47bbb91
bump engine
lukaszkolodziejczyk Oct 2, 2025
7de38ca
fix
lukaszkolodziejczyk Oct 2, 2025
84818fe
test
lukaszkolodziejczyk Oct 2, 2025
b6a8d02
fix
lukaszkolodziejczyk Oct 3, 2025
564eb8c
Implement PartitionedDataset for efficient FK processing with memory …
abon-mostly Oct 6, 2025
428f163
Add unlimited memory caching to PartitionedDataset
abon-mostly Oct 6, 2025
80748df
Fix FK interleaving for independent parent pools per child
abon-mostly Oct 7, 2025
75c0688
linting
abon-mostly Oct 7, 2025
03bbc66
new fk_model data pulling logic
abon-mostly Oct 8, 2025
f080e80
reset_index bug fix
abon-mostly Oct 9, 2025
df2edc9
code simplifications
abon-mostly Oct 10, 2025
c83c58d
improved logging and default parameters
abon-mostly Oct 10, 2025
dd24ded
improved parent batch sampling strategy to improve efficiency
abon-mostly Oct 10, 2025
5667dbc
split embedding and similarity layer computations
abon-mostly Oct 10, 2025
9907654
code cleanups
abon-mostly Oct 10, 2025
0ab270d
code cleanups
abon-mostly Oct 12, 2025
3c9e481
adjust temperature, fix shared batch id issue
abon-mostly Oct 12, 2025
8568e42
code cleanup
abon-mostly Oct 13, 2025
71278e3
improvements on partitioned dataset
abon-mostly Oct 13, 2025
f60b66f
moved timeit to utils
abon-mostly Oct 13, 2025
b43a59f
added timing statement
abon-mostly Oct 14, 2025
7214ec9
modify fk model architecture to make similarity layer computation cheap
abon-mostly Oct 15, 2025
85bbdc6
made fk model training deterministic
abon-mostly Oct 15, 2025
98d9fd4
fixed partitioned dataset tests
abon-mostly Oct 15, 2025
4fdf2f4
hyper parameter tuning
abon-mostly Oct 15, 2025
1620b2f
code cleanup
abon-mostly Oct 16, 2025
4dff517
model fine tunings
abon-mostly Oct 16, 2025
c6a5899
code cleanups
abon-mostly Oct 18, 2025
cde72f8
reduced negative samples for faster training
abon-mostly Oct 20, 2025
7e63f7b
FKModelsStore/{table_name}/{parent_key}/*
lukaszkolodziejczyk Oct 20, 2025
f764044
remove set_seeds
lukaszkolodziejczyk Oct 21, 2025
546629a
simplify
lukaszkolodziejczyk Oct 21, 2025
6c2c8d3
move temperature and top_k defaults to fk_models module
lukaszkolodziejczyk Oct 21, 2025
9f4c0bf
remove timeit
lukaszkolodziejczyk Oct 21, 2025
ed13d03
support datetime
lukaszkolodziejczyk Oct 21, 2025
d44d9a8
safe paths
lukaszkolodziejczyk Oct 21, 2025
d3f3168
safe paths
lukaszkolodziejczyk Oct 21, 2025
615f846
remove determinsm
lukaszkolodziejczyk Oct 21, 2025
55ed08b
update engine
lukaszkolodziejczyk Oct 21, 2025
49137c8
Merge remote-tracking branch 'origin/main' into feat-ai-smartselect
lukaszkolodziejczyk Oct 21, 2025
802ebd1
remove prints
lukaszkolodziejczyk Oct 21, 2025
f5f1f5b
added try catch
abon-mostly Oct 21, 2025
f66c7ab
Merge remote-tracking branch 'origin/main' into feat-ai-smartselect
lukaszkolodziejczyk Oct 22, 2025
ea79d6b
revert prohgress change
lukaszkolodziejczyk Oct 22, 2025
17ec81f
remove some hardcode
lukaszkolodziejczyk Oct 22, 2025
d2fb933
remove some hardcode
lukaszkolodziejczyk Oct 22, 2025
37b27ab
improvs
lukaszkolodziejczyk Oct 22, 2025
fce3132
improvs
lukaszkolodziejczyk Oct 22, 2025
79275a8
improvs
lukaszkolodziejczyk Oct 22, 2025
7618802
improvs
lukaszkolodziejczyk Oct 22, 2025
21cb241
merge fk_models.py and non_context.py
lukaszkolodziejczyk Oct 22, 2025
299fe54
refs
lukaszkolodziejczyk Oct 22, 2025
65d62af
refs
lukaszkolodziejczyk Oct 22, 2025
20fa265
refs
lukaszkolodziejczyk Oct 22, 2025
f7f1c81
partition by partition in random fk assignment;
lukaszkolodziejczyk Oct 22, 2025
58a9539
partition by partition in random fk assignment;
lukaszkolodziejczyk Oct 22, 2025
555cace
refs
lukaszkolodziejczyk Oct 22, 2025
1d2ea8b
fix infinite loop
lukaszkolodziejczyk Oct 22, 2025
9926f59
safe
lukaszkolodziejczyk Oct 22, 2025
551b5c6
top_k=20
lukaszkolodziejczyk Oct 22, 2025
1804f72
remove defaults
lukaszkolodziejczyk Oct 22, 2025
e6497ca
normalize probs
lukaszkolodziejczyk Oct 22, 2025
6512691
softmax
lukaszkolodziejczyk Oct 22, 2025
e27c47c
softmax
lukaszkolodziejczyk Oct 22, 2025
d66419c
fix typing
lukaszkolodziejczyk Oct 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion mostlyai/sdk/_data/file/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import re
import time
from abc import abstractmethod
from collections.abc import Generator, Iterable
from collections.abc import Generator, Iterable, Iterator
from enum import Enum
from pathlib import Path
from typing import Any
Expand Down Expand Up @@ -375,6 +375,17 @@ def handle_if_exists(self, if_exists: str = "fail") -> str:
return "a"
return "w"

def iter_partitions(self) -> Iterator[tuple[int, Path, pd.DataFrame]]:
"""Iterate over dataset partitions yielding (index, file_path, dataframe)."""
for idx, file_path in enumerate(self.dataset.files):
data = pd.read_parquet(file_path)
yield idx, Path(file_path), data

@property
def files(self) -> list[Path]:
"""Get the list of partition files."""
return [Path(f) for f in self.dataset.files]


class FileContainer(DataContainer):
SCHEMES = ["http", "https"]
Expand Down
Loading
Loading