-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Dushyant Behl <[email protected]>
- Loading branch information
1 parent
d19b8ec
commit 615ed74
Showing
2 changed files
with
45 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,34 @@ | ||
# Advanced Data Processing | ||
# Advanced Data Processing | ||
Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including | ||
1. Providing multiple datasets | ||
1. Creating custom data processing pipeline for the datasets. | ||
1. Combining multiple datasets into one with even differnt formats. | ||
1. Mixing datasets as requried and sampling if needed each with different weights. | ||
|
||
These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next, | ||
|
||
## Data Config | ||
|
||
Data config is a configuration file which users can provide to sft trainer.py | ||
|
||
What is data config schema | ||
|
||
How can use write data configs | ||
|
||
What are data handlers | ||
|
||
Preexisting data handlers | ||
|
||
Extra data handlers | ||
|
||
How can use pass the datasets | ||
|
||
What kind of datasets can be passed | ||
|
||
How can user perform sampling | ||
- What does sampling means? | ||
- How will it affect the datasets | ||
|
||
How can user create a data config for the existing use cases. | ||
|
||
Corner cases which needs attention. |