diff --git a/.readthedocs.yaml b/.readthedocs.yaml deleted file mode 100644 index e515bae..0000000 --- a/.readthedocs.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# .readthedocs.yaml -# Read the Docs configuration file -# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details - -# Required -version: 2 - -# Set the version of Python and other tools you might need -build: - os: ubuntu-22.04 - tools: - python: "3.11" - commands: - - echo $READTHEDOCS_OUTPUT - - mkdir -p $READTHEDOCS_OUTPUT/ - - cp --recursive oxen/docs/build/* $READTHEDOCS_OUTPUT/ - - -# Build documentation in the docs/ directory with Sphinx -sphinx: - configuration: oxen/docs/source/conf.py - -# If using Sphinx, optionally build your docs in additional formats such as PDF -# formats: -# - pdf - -# Optionally declare the Python requirements required to build your docs -python: - install: - - requirements: oxen/docs/requirements.txt diff --git a/Annotation.md b/Annotation.md deleted file mode 100644 index 1b4de05..0000000 --- a/Annotation.md +++ /dev/null @@ -1,11 +0,0 @@ -# Annotation - -When it comes to a supervised learning problem, or simply evaluating your model, data is much more useful if it has annotations or labels. - -There are many tools that can be used to annotate or label data. Oxen is agnostic to what tools you use for annotation, but can track the changes to your data over time so that it is easy to audit or restore an older version. - -Some popular annotation tools include - -* [Label Studio](https://labelstud.io/) Label Studio is a configurable data annotation tool that works with different data types -* [Make Sense](https://github.com/SkalskiP/make-sense) Modern tool for labeling images - diff --git a/Branching.md b/Branching.md deleted file mode 100644 index ab9cc3d..0000000 --- a/Branching.md +++ /dev/null @@ -1,54 +0,0 @@ - -## Branching - -It is probably a good idea to do these changes on a new branch as we have already made a few significant changes to the data. - -```bash -oxen checkout -b is_famous_and_smiling -``` - -Now let's add and commit our updated DataFrame. - -```bash -oxen add list_attr_celeba.csv - -```bash -oxen commit -m "removing all attributes except Smiling and Is_Famous" -``` - -TODO: ...... - - -## Schemas - -Oxen needs to detect changes to your data schemas over time. To see the schema that Oxen is tracking you can use the `schemas` command. - -```bash -oxen schemas -``` - -``` -+------+----------------------------------+-------------------------------------+ -| name | hash | fields | -+===============================================================================+ -| ? | 36d0edc8779f42e30b0d630aa83bc83c | [file, label, ..., width, height] | -|------+----------------------------------+-------------------------------------| -| ? | 9d277b6a412ba4890265ec7d2a98e10b | [file, label, ..., height, is_cute] | -+------+----------------------------------+-------------------------------------+ -``` - -We can see that neither of these schemas are named yet. If you want to reference a schema by name you can name it with the `schema name` sub command. This is useful for example if you are building an tool on top of Oxen with a specific schema in mind. - -```bash -oxen schemas name 9d277b6a412ba4890265ec7d2a98e10b "my_bounding_box" -``` - -Knowing whether a data schema has changed is useful for making sure that there are no breaking changes in your data that could have downstream consequences. They also allow us to index all of the data into [Apache Arrow](https://arrow.apache.org/) DataFrames which will be useful down the line. - -Data point level version control and schema tracking are essential building blocks that can be built upon to enable some powerful workflows. - -We explore one of these workflows by [building an training data annotation tool](BuildingAnAnnotationTool.md) on top of Oxen. - - - - diff --git a/DataFrames.md b/DataFrames.md deleted file mode 100644 index 1f88a36..0000000 --- a/DataFrames.md +++ /dev/null @@ -1,529 +0,0 @@ - -# Tabular Data in Oxen - -As a data scientist or machine learning engineer we deal with a lot of tabular data. Whether it is csv, parquet, or line delimited json, it is useful to store your training data in data frames that we can filter, aggregate, slice and dice. - -To follow along with the examples below feel free to grab to grab the example data from our public [CatDogBoundingBox](https://www.oxen.ai/ox/CatDogBoundingBox) repository. - -```bash -oxen clone http://hub.oxen.ai/ox/CatDogBoundingBox -``` - -```bash -cd CatDogBoundingBox -``` - -## oxen df - -Oxen has a convenient `df` (short for "DataFrame") command to deal with tabular data. This example data has 10,000 rows and 6 columns of bounding boxes around cats or dogs. The shape hint at the top of the output can be useful for making sure you are transforming the data correctly. - -```bash -oxen df annotations/data.csv -``` - -``` -shape: (10000, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000257301.jpg ┆ dog ┆ 84.85 ┆ 161.09 ┆ 33.1 ┆ 51.26 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog ┆ 51.63 ┆ 157.14 ┆ 53.13 ┆ 29.75 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000215471.jpg ┆ cat ┆ 126.18 ┆ 71.95 ┆ 36.19 ┆ 47.81 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000251246.jpg ┆ cat ┆ 58.23 ┆ 13.27 ┆ 90.79 ┆ 97.32 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ -``` - -Oxen uses a powerful [DataFrame library](https://pola-rs.github.io/polars-book/user-guide/introduction.html) under the hood, and uses the [Apache Arrow](https://arrow.apache.org/) data format to provide powerful cross application functionality. A lot of time and effort can be saved by transforming the data on the command line before writing a single line of application specific code or even opening a python repl or Juptyer notebook. - -# Useful Commands - -There are many ways you might want to view, transform, and filter your data on the command line before committing to the version of the dataset. - -To quickly see all the options on the `df` command you can run `oxen df --help`. - -## Output Data Formats - -The `--output` option is handy for quickly transforming data files between data formats on disk. Some formats like parquet and arrow are more efficient for data different [tasks](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a), but are not human readable like tsv or csv. Data format is always a trade off you'll have to decide on for your application. - -Oxen currently supports these file extensions: `csv`, `tsv`, `parquet`, `arrow`, `json`, `jsonl`. - -```bash -oxen df annotations/data.csv -o annotations/data.parquet -``` - -``` -shape: (10000, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000257301.jpg ┆ dog ┆ 84.85 ┆ 161.09 ┆ 33.1 ┆ 51.26 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog ┆ 51.63 ┆ 157.14 ┆ 53.13 ┆ 29.75 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000215471.jpg ┆ cat ┆ 126.18 ┆ 71.95 ┆ 36.19 ┆ 47.81 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000251246.jpg ┆ cat ┆ 58.23 ┆ 13.27 ┆ 90.79 ┆ 97.32 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ - -Writing "annotations/data.parquet" -``` - -## View Schema - -Sometimes a DataFrame will have many columns and the default command collapses them so they are hard to see. You can use the `--schema` flag to just display the schema of this data frame. Note this is an exclusive flag. - -```bash -oxen df annotations/train.csv --schema -``` - -``` -+--------+-------+ -| column | dtype | -+================+ -| file | str | -|--------+-------| -| label | str | -|--------+-------| -| min_x | f64 | -|--------+-------| -| min_y | f64 | -|--------+-------| -| width | f64 | -|--------+-------| -| height | f64 | -+--------+-------+ -``` - -## Slice - -Say you want to take a subset of the datafile and save it in another data file. You can do this with the `--slice` option. This can be handy when creating train, test, and validation sets. The two numbers represent the start and end indices you want to slice into. - -```bash -oxen df annotations/data.csv --slice '0..9000' -o annotations/train.parquet -``` - -``` -shape: (9000, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000431980.jpg ┆ dog ┆ 98.3 ┆ 110.46 ┆ 42.69 ┆ 26.64 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000071025.jpg ┆ cat ┆ 55.33 ┆ 105.45 ┆ 160.15 ┆ 73.57 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000518015.jpg ┆ cat ┆ 43.72 ┆ 4.34 ┆ 72.98 ┆ 129.1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000171435.jpg ┆ dog ┆ 22.86 ┆ 100.03 ┆ 125.55 ┆ 41.61 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ -``` - -## Randomize - -Often you will want to randomize data before splitting into train and test sets, or even just to peek at different data values. - -```bash -$ oxen df annotations/data.csv --randomize - -shape: (10000, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000335955.jpg ┆ dog ┆ 28.98 ┆ 88.35 ┆ 39.22 ┆ 84.05 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000082475.jpg ┆ dog ┆ 0.6 ┆ 23.08 ┆ 200.92 ┆ 198.2 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000515777.jpg ┆ dog ┆ 109.83 ┆ 124.28 ┆ 58.89 ┆ 93.94 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000176089.jpg ┆ cat ┆ 106.62 ┆ 86.23 ┆ 56.53 ┆ 54.44 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000401308.jpg ┆ dog ┆ 21.12 ┆ 0.81 ┆ 202.42 ┆ 221.75 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000105030.jpg ┆ cat ┆ 11.62 ┆ 95.38 ┆ 60.21 ┆ 120.43 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000514890.jpg ┆ dog ┆ 36.76 ┆ 99.58 ┆ 12.27 ┆ 11.18 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000519218.jpg ┆ dog ┆ 71.24 ┆ 58.51 ┆ 8.57 ┆ 22.26 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ -``` - -## View Specific Columns - -Maybe you have many columns, and only need to work with a few. You can specify column names in a comma separated list with `--columns`. - -```bash -$ oxen df annotations/data.csv --columns 'file,label' - -shape: (10000, 2) -┌─────────────────────────┬───────┐ -│ file ┆ label │ -│ --- ┆ --- │ -│ str ┆ str │ -╞═════════════════════════╪═══════╡ -│ images/000000128154.jpg ┆ cat │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000257301.jpg ┆ dog │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000215471.jpg ┆ cat │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ images/000000251246.jpg ┆ cat │ -└─────────────────────────┴───────┘ -``` - -## Filter Rows - -Oxen has some powerful filter commands built into the CLI. You can quickly filter data down based on a expression involving a column name, an operation, and a row value. - -Supported filter operations: ==, !=, >, <, <= , >= - -Supported logical operations: &&, || - -Supported row dtypes: str, i32, i64, f32, f64 - -```bash -$ oxen df annotations/data.csv --filter 'label == dog && height >= 200' - -shape: (5356, 6) -┌─────────────────────────┬───────┬────────┬────────┬───────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪═══════╪════════╡ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 216.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000001360.jpg ┆ dog ┆ 101.56 ┆ 178.2 ┆ 35.22 ┆ 238.34 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000362567.jpg ┆ dog ┆ 90.96 ┆ 36.65 ┆ 86.2 ┆ 285.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000201969.jpg ┆ dog ┆ 167.24 ┆ 73.99 ┆ 37.0 ┆ 264.94 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000237419.jpg ┆ dog ┆ 49.64 ┆ 104.53 ┆ 31.31 ┆ 248.88 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000314708.jpg ┆ dog ┆ 47.17 ┆ 138.18 ┆ 54.72 ┆ 359.55 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000257301.jpg ┆ dog ┆ 84.85 ┆ 161.09 ┆ 33.1 ┆ 251.26 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog ┆ 51.63 ┆ 157.14 ┆ 53.13 ┆ 229.75 │ -└─────────────────────────┴───────┴────────┴────────┴───────┴────────┘ -``` - -## Concatenate (vstack) - -Maybe you have filtered down data, and want to stack the data back into a single frame. The `--vstack` option takes a variable length list of files you would like to concatenate. - -```bash -$ oxen df annotations/data.csv --filter 'label=dog' -o /tmp/dogs.parquet -$ oxen df annotations/data.csv --filter 'label=cat' -o /tmp/cats.parquet -$ oxen df /tmp/cats.parquet --vstack /tmp/dogs.parquet -o annotations/data.parquet -``` - -## Take Indices - -Sometimes you have a specific row or set of rows of data you would like to look at. This is where the `--take` option comes in handy. - -```bash -$ oxen df annotations/data.csv --take '1,13,42' - -shape: (3, 6) -┌─────────────────────────┬───────┬───────┬───────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪═══════╪═══════╪════════╪════════╡ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000279829.jpg ┆ cat ┆ 30.01 ┆ 13.58 ┆ 82.51 ┆ 176.39 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000209289.jpg ┆ dog ┆ 72.75 ┆ 42.06 ┆ 111.52 ┆ 153.09 │ -└─────────────────────────┴───────┴───────┴───────┴────────┴────────┘ -``` - -## Add Column - -Your data might not match the schema of a data frame you want to combine with, in this case you may need to add a column to match the schema. You can do this and project default values with `--add-col 'col:val:dtype'` - -```bash -$ oxen df annotations/data.csv --add-col 'is_cute:unknown:str' - -shape: (10000, 7) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┬─────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height ┆ is_cute │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╪═════════╡ -│ images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000257301.jpg ┆ dog ┆ 84.85 ┆ 161.09 ┆ 33.1 ┆ 51.26 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog ┆ 51.63 ┆ 157.14 ┆ 53.13 ┆ 29.75 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000215471.jpg ┆ cat ┆ 126.18 ┆ 71.95 ┆ 36.19 ┆ 47.81 ┆ unknown │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ -│ images/000000251246.jpg ┆ cat ┆ 58.23 ┆ 13.27 ┆ 90.79 ┆ 97.32 ┆ unknown │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┴─────────┘ -``` - -## Add Row - -Sometimes it can be a pain to append data to a data file without writing code to do so. The `--add-row` option makes it as easy as a comma separated list and automatically parses the data to the correct dtypes. - -```bash -$ oxen df annotations/data.csv --add-row 'images/my_cat.jpg,cat,0,0,0,0' - -shape: (10001, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000128154.jpg ┆ cat ┆ 0.0 ┆ 19.27 ┆ 130.79 ┆ 129.58 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000544590.jpg ┆ cat ┆ 9.75 ┆ 13.49 ┆ 214.25 ┆ 188.35 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000000581.jpg ┆ dog ┆ 49.37 ┆ 67.79 ┆ 74.29 ┆ 116.08 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000236841.jpg ┆ cat ┆ 115.21 ┆ 96.65 ┆ 93.87 ┆ 42.29 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000130399.jpg ┆ dog ┆ 51.63 ┆ 157.14 ┆ 53.13 ┆ 29.75 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000215471.jpg ┆ cat ┆ 126.18 ┆ 71.95 ┆ 36.19 ┆ 47.81 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000251246.jpg ┆ cat ┆ 58.23 ┆ 13.27 ┆ 90.79 ┆ 97.32 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/my_cat.jpg ┆ cat ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ -``` - -## Aggregate - -Oxen DataFrame aggregations can be helpful to quickly get statistics about your data. You can save these statistics to disk and commit them to track stats about your data over time. - -The format for an aggregation query is similar to a lambda function. The inputs to the function are the column name(s) you want to group by. The outputs are functions you want to run over the grouped results. - -``` -('col_0') -> (min('col_1'), max('col_2')) -``` - -This simple example aggregation query would be if you wanted to find a distribution of labels in a dataset. - -For example in our cats vs dogs dataset you can group by the `'label'` column, and then run the `count()` function value over all the values in the `'file'` column. - -``` -$ oxen df annotations/train.csv -a "('label') -> (count('file'))" - -shape: (2, 2) -┌───────┬───────────────┐ -│ label ┆ count('file') │ -│ --- ┆ --- │ -│ str ┆ u32 │ -╞═══════╪═══════════════╡ -│ cat ┆ 4140 │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ dog ┆ 4860 │ -└───────┴───────────────┘ -``` - -You can specify multiple functions in the output. For example if you wanted the unique file count as well as the raw count you can add the `n_unique()` function. - -``` -$ oxen df annotations/train.csv -a "('label') -> (count('file'), n_unique('file'))" - -shape: (2, 3) -┌───────┬───────────────┬──────────────────┐ -│ label ┆ count('file') ┆ n_unique('file') │ -│ --- ┆ --- ┆ --- │ -│ str ┆ u32 ┆ u32 │ -╞═══════╪═══════════════╪══════════════════╡ -│ dog ┆ 4860 ┆ 3798 │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ cat ┆ 4140 ┆ 3525 │ -└───────┴───────────────┴──────────────────┘ -``` - -Here is a list of supported output aggregation functions: - -* `list` aggregate column values into a list -* `count` count the aggregated values -* `n_unique` unique count of the aggregated values -* `min` minimum value of the group -* `max` maximum value of the group -* `arg_min` index of minimum value in the group -* `arg_max` index of maximum value in the group -* `mean` mean value of the group -* `median` median value of the group -* `std` standard deviation of the group -* `var` variance of the group -* `first` first value of the group -* `last` last value in the group -* `head` first 5 values of group -* `tail` last 5 values of the group - -## Unique - -Oxen can efficiently compute all the unique values given a column name, or comma separated list of column names. - -``` -$ oxen df annotations/train.csv --unique "file" -$ oxen df annotations/train.csv -u "file,label" -``` - -## Sort - -Sorting can be achieved with the `sort` flag. For example you may want to find the largest bounding boxes by sorting on the height column. - -``` -oxen df annotations/train.csv --sort "height" - -shape: (9000, 6) -┌─────────────────────────┬───────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════╪═══════╪════════╪════════╪════════╪════════╡ -│ images/000000580919.jpg ┆ dog ┆ 61.28 ┆ 88.31 ┆ 2.71 ┆ 1.83 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000577310.jpg ┆ dog ┆ 132.25 ┆ 193.86 ┆ 3.28 ┆ 1.95 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000393384.jpg ┆ dog ┆ 138.85 ┆ 89.89 ┆ 1.25 ┆ 2.11 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000477398.jpg ┆ dog ┆ 185.11 ┆ 195.93 ┆ 2.51 ┆ 2.6 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000069205.jpg ┆ dog ┆ 0.0 ┆ 0.0 ┆ 224.0 ┆ 224.0 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000554737.jpg ┆ cat ┆ 0.0 ┆ 0.0 ┆ 224.0 ┆ 224.0 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000213819.jpg ┆ cat ┆ 8.32 ┆ 0.0 ┆ 207.77 ┆ 224.0 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ images/000000397212.jpg ┆ cat ┆ 0.36 ┆ 0.0 ┆ 115.5 ┆ 224.0 │ -└─────────────────────────┴───────┴────────┴────────┴────────┴────────┘ -``` - -Sort is also useful in the context of aggregations. When aggregating up statistics they do not come back in a guaranteed order. If you want to see which files have the most labels, you can group the output if an aggregation `count()` function. - -``` -$ oxen df annotations/train.csv -a "('file') -> (list('label'), count('label'))" --sort "count('label')" - -shape: (7128, 3) -┌─────────────────────────┬───────────────────────────┬────────────────┐ -│ file ┆ list('label') ┆ count('label') │ -│ --- ┆ --- ┆ --- │ -│ str ┆ list[str] ┆ u32 │ -╞═════════════════════════╪═══════════════════════════╪════════════════╡ -│ images/000000060202.jpg ┆ ["dog"] ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000518156.jpg ┆ ["cat"] ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000347879.jpg ┆ ["cat"] ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000290136.jpg ┆ ["dog"] ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000398285.jpg ┆ ["dog", "dog", ... "dog"] ┆ 14 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000244933.jpg ┆ ["cat", "cat", ... "cat"] ┆ 17 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000016950.jpg ┆ ["dog", "dog", ... "dog"] ┆ 19 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000315555.jpg ┆ ["dog", "dog", ... "dog"] ┆ 19 │ -└─────────────────────────┴───────────────────────────┴────────────────┘ -``` - -## Reverse - -You can also reverse the order of a data table. By default `--sort` sorts in ascending order, but can be reversed with the `--reverse` flag. - -``` -oxen df annotations/train.csv -a "('file') -> (count('label'))" --sort "count('label')" --reverse - -shape: (7128, 2) -┌─────────────────────────┬────────────────┐ -│ file ┆ count('label') │ -│ --- ┆ --- │ -│ str ┆ u32 │ -╞═════════════════════════╪════════════════╡ -│ images/000000315555.jpg ┆ 19 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000016950.jpg ┆ 19 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000244933.jpg ┆ 17 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000092869.jpg ┆ 14 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000038827.jpg ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000470862.jpg ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000292101.jpg ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ images/000000345432.jpg ┆ 1 │ -└─────────────────────────┴────────────────┘ -``` \ No newline at end of file diff --git a/DataPointLevelVersionControl.md b/DataPointLevelVersionControl.md deleted file mode 100644 index 5982e9d..0000000 --- a/DataPointLevelVersionControl.md +++ /dev/null @@ -1,302 +0,0 @@ - -# Data Point Level Version Control with Oxen - -What is "data point level version control" and why does it matter? - -We built Oxen from the ground up to be able to version large data sets. This includes the ability to version many files at once, as well as the ability to version the data associated with these files which we call "data point level version control". - -Machine learning is all about learning a mapping from inputs to outputs. Each input and output can be thought of as a data point. Examples of inputs could be an image, a video, a piece of text, an audio clip, or just a list of numeric [features](https://en.wikipedia.org/wiki/Feature_(machine_learning)) that represent the input at a higher level. Basic outputs are usually labels during [classification](https://en.wikipedia.org/wiki/Statistical_classification) or numeric values during [regression](https://en.wikipedia.org/wiki/Regression_analysis). With the advent of [Generative AI](https://www.sequoiacap.com/article/generative-ai-a-creative-new-world/) outputs are also becoming more complex in the form of human readable text, images, videos and audio. - -## Stop Using Git For Data Files - -Usually version control systems track changes on a file level. The hash of the file contents determines whether this file has been changed, moved, removed, etc. - -Each file could be considered an input and an output, but there is no inherent mapping between them. A quick way to add a mapping between inputs and outputs is to store them within another file such as a csv or line delimited json file. - -An example of this mapping for classifying images as cats or dogs might be a csv file that looks like this: - -``` -filename,label -images/00001.jpg,cat -images/00002.jpg,dog -images/00003.jpg,cat -images/00004.jpg,dog -``` - -At first it seems perfectly reasonable to add this file to your standard version control system. After all usually there is a `diff` command that can show you added and modified lines within a file. - -The problem is unlike code, which is easy to scan line by line for changes, data that is fed into modern machine learning systems can consist of hundreds of thousands if not millions of data points. Imagine adding thousands of new rows or new columns to the your data. It quickly becomes unwieldy to debug and manage these changes with a line by line diff. We need to store these mappings in a more efficient data structure. - -## Oxen DataFrames - -Oxen takes version control a step further by versioning each row in structured files such as `csv`, `tsv`, `parquet`, `arrow` into an [Apache Arrow](https://arrow.apache.org/) DataFrame. - -> Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead. - -This allows Oxen to track changes between the inputs and outputs of your system, as well as fast access to any changes or subsets of your data you need for training, validation, or testing. - -# CelebA Example - -To illustrate the power of Oxen, let's work with the [CelebA dataset](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). This dataset has over 200,000 images of celebrities as well as attributes about their faces. Tracking the input images alone is no small feat. If you add the images directory alone to a git repository it would quickly become apparent that the tool was not built for this use case. - -Staging and committing the images in Oxen is an easy two step process. - -```bash -$ oxen add images/ -$ oxen commit -m "adding all the images in the dataset" -``` - -If you have ever tried to index hundreds of thousands of images into [git](https://git-scm.com/) or even [git lfs](https://git-lfs.github.com/) you will see Oxen is a [significant performance boost](Performance.md). - -## Going Beyond Versioning Files - -It quickly becomes unmanageable to have a file for each attribute, feature, or output you want to predict. This is much more suited for a DataFrame. - -For example if you wanted to extend the cats vs dogs csv from above to include the bounding boxes around each individual cat and dog, you can represent the input files and outputs like this: - -``` -file,label,min_x,min_y,width,height -images/000000128154.jpg,cat,0.0,19.27,130.79,129.58 -images/000000544590.jpg,cat,9.75,13.49,214.25,188.35 -images/000000000581.jpg,dog,49.37,67.79,74.29,116.08 -images/000000236841.jpg,cat,115.21,96.65,93.87,42.29 -``` - -The CelebA dataset includes a similar mapping from inputs to outputs, but is a much more rich and full featured. It contains 202,599 rows and 41 columns of information ranging from "Bushy_Eyebrows" to "Wearing_Necklace". - -A couple lines of the csv file looks like this: - -``` -$ head -n 3 list_attr_celeba.csv -image_id,5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,Blurry,Brown_Hair,Bushy_Eyebrows,Chubby,Double_Chin,Eyeglasses,Goatee,Gray_Hair,Heavy_Makeup,High_Cheekbones,Male,Mouth_Slightly_Open,Mustache,Narrow_Eyes,No_Beard,Oval_Face,Pale_Skin,Pointy_Nose,Receding_Hairline,Rosy_Cheeks,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young -000001.jpg,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,1,1,-1,1,-1,-1,1,-1,-1,1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,1 -000002.jpg,-1,-1,-1,1,-1,-1,-1,1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,1,-1,1,-1,-1,1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,1 -``` - -It is hard to make sense of this data table without opening some sort of programming environment. Oxen comes in handy for initial data exploration even before you get to versioning. - -## Exploring the Data - -Oxen has a convenience command `df` for exploring and transforming DataFrames. If we load the data into an Oxen DataFrame out of the gate we get a much more manageable output. - -To learn more about the `df` command and common exploratory data analysis operations you might want to perform check out the [Oxen DataFrame documentation](DataFrames.md). - -```bash -$ oxen df list_attr_celeba.csv -shape: (202599, 41) -┌───────┬────────────┬────────────┬──────────┬─────┬────────────┬────────────┬────────────┬───────┐ -│ image ┆ 5_o_Clock_ ┆ Arched_Eye ┆ Attracti ┆ ... ┆ Wearing_Li ┆ Wearing_Ne ┆ Wearing_Ne ┆ Young │ -│ _id ┆ Shadow ┆ brows ┆ ve ┆ ┆ pstick ┆ cklace ┆ cktie ┆ --- │ -│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ i64 │ -│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ │ -╞═══════╪════════════╪════════════╪══════════╪═════╪════════════╪════════════╪════════════╪═══════╡ -│ 00000 ┆ -1 ┆ 1 ┆ 1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -│ 1.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 00000 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -│ 2.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 00000 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -│ 3.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 00000 ┆ -1 ┆ -1 ┆ 1 ┆ ... ┆ 1 ┆ 1 ┆ -1 ┆ 1 │ -│ 4.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 20259 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -│ 6.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 20259 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -│ 7.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 20259 ┆ -1 ┆ 1 ┆ 1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -│ 8.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ 20259 ┆ -1 ┆ 1 ┆ 1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -│ 9.jpg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │ -└───────┴────────────┴────────────┴──────────┴─────┴────────────┴────────────┴────────────┴───────┘ -``` - -## Committing DataFrames - -The real power is when you stage and commit a DataFrame. Oxen processes and hashes each row and tracks the data schema over time, so that we can detect changes. Instead of seeing line by line changes on a diff, you can tell the individual rows and columns that changed. - -Committing DataFrames is no different than committing any other file. Behind the scenes Oxen works its magic to track all the changes. - -Let's stage and commit the list_attr_celeba.csv file so that we have the original version tracked. - -```bash -$ oxen add list_attr_celeba.csv -$ oxen commit -m "adding all attributes about the faces" -``` - -## Diff Changes - -Then we can use the [df](DataFrames.md) command with the [--add_col](DataFrames.md#add-column) flag to project a new column onto the data for whether or not this person "Is_Famous". Everyone in this dataset is a celebrity, and by definition is famous, so we will have the default value be "1". - -```bash -$ oxen df list_attr_celeba.csv --add_col 'Is_Famous:1:i64' --output list_attr_celeba.csv - -.... - -$ oxen diff list_attr_celeba.csv - -Added Columns - -shape: (202599, 1) -┌───────────┐ -│ Is_Famous │ -│ --- │ -│ i64 │ -╞═══════════╡ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ ... │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -└───────────┘ -``` - -We can see that Oxen is taking advantage of the structure of the DataFrame and just returning the columns that were added during the diff. - -Imagine for the specific application we are working on we actually just care about a few of the attributes in the table. To view all the columns in a non-collapsed view you can use the [--schema](DataFrames.md#view-schema) flag on the [df](DataFrames.md) command. - -``` -$ oxen df list_attr_celeba.csv --schema - -+---------------------+-------+ -| column | dtype | -+=============================+ -| image_id | str | -|---------------------+-------| -| 5_o_Clock_Shadow | i64 | -|---------------------+-------| - -...... -I am collapsing, but feel free -to try in your terminal -...... - -|---------------------+-------| -| Wearing_Necktie | i64 | -|---------------------+-------| -| Young | i64 | -|---------------------+-------| -| Is_Famous | i64 | -+---------------------+-------+ -``` - -Then let's narrow this down to the `image_id`, whether they are `Smiling`, and our new `Is_Famous` column and overwrite the file with the output. - -```bash -$ oxen df list_attr_celeba.csv --columns 'image_id,Smiling,Is_Famous' --output list_attr_celeba.csv - -shape: (202599, 3) -┌────────────┬─────────┬───────────┐ -│ image_id ┆ Smiling ┆ Is_Famous │ -│ --- ┆ --- ┆ --- │ -│ str ┆ i64 ┆ i64 │ -╞════════════╪═════════╪═══════════╡ -│ 000001.jpg ┆ 1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 000002.jpg ┆ 1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 000003.jpg ┆ -1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 000004.jpg ┆ -1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 202596.jpg ┆ 1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 202597.jpg ┆ 1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 202598.jpg ┆ 1 ┆ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ 202599.jpg ┆ -1 ┆ 1 │ -└────────────┴─────────┴───────────┘ -``` - -If we use the `oxen diff` command again we will see there is 1 added column, and 39 removed. - -```bash -$ oxen diff list_attr_celeba.csv -Added Columns - -shape: (202599, 1) -┌───────────┐ -│ Is_Famous │ -│ --- │ -│ i64 │ -╞═══════════╡ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ ... │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -├╌╌╌╌╌╌╌╌╌╌╌┤ -│ 1 │ -└───────────┘ - - -Removed Columns - -shape: (202599, 39) -┌─────┬─────┬────────────┬─────┬─────┬─────┬─────┬─────┬───────┐ -│ 5_o ┆ Arc ┆ Attractive ┆ Bag ┆ ... ┆ Wea ┆ Wea ┆ Wea ┆ Young │ -│ _Cl ┆ hed ┆ --- ┆ s_U ┆ ┆ rin ┆ rin ┆ rin ┆ --- │ -│ ock ┆ _Ey ┆ i64 ┆ nde ┆ ┆ g_L ┆ g_N ┆ g_N ┆ i64 │ -│ _Sh ┆ ebr ┆ ┆ r_E ┆ ┆ ips ┆ eck ┆ eck ┆ │ -│ ado ┆ ows ┆ ┆ yes ┆ ┆ tic ┆ lac ┆ tie ┆ │ -│ w ┆ --- ┆ ┆ --- ┆ ┆ k ┆ e ┆ --- ┆ │ -│ --- ┆ i64 ┆ ┆ i64 ┆ ┆ --- ┆ --- ┆ i64 ┆ │ -│ i64 ┆ ┆ ┆ ┆ ┆ i64 ┆ i64 ┆ ┆ │ -╞═════╪═════╪════════════╪═════╪═════╪═════╪═════╪═════╪═══════╡ -│ -1 ┆ 1 ┆ 1 ┆ -1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ -1 ┆ -1 ┆ 1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ -1 ┆ 1 ┆ -1 ┆ ... ┆ 1 ┆ 1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ -1 ┆ -1 ┆ -1 ┆ ... ┆ -1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ 1 ┆ 1 ┆ -1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤ -│ -1 ┆ 1 ┆ 1 ┆ -1 ┆ ... ┆ 1 ┆ -1 ┆ -1 ┆ 1 │ -└─────┴─────┴────────────┴─────┴─────┴─────┴─────┴─────┴───────┘ -``` - -Note even though the [--output](DataFrames.md#output-data-formats) flag has overwritten our initial data, we have nothing to worry about since we already versioned our data. With Oxen we can revert to the original at any time 😄. - -Hopefully you can see that taking advantage of the innate structure of the data is already a better option than treating it like code, and sifting line by line through `git diff`. This is just one of many advantages you will see by using Oxen. - -Next up see the power of [Oxen Indices](Indices.md) for fast access to your data. diff --git a/Indices.md b/Indices.md deleted file mode 100644 index d1b5b2d..0000000 --- a/Indices.md +++ /dev/null @@ -1,214 +0,0 @@ -# Oxen Schemas & Indices - -Indexing is a powerful tool within your Oxen toolchain. Oxen combines the concepts of DataFrames, Schemas, and Indices to allow you to quickly explore to subsets of your data at a specific point in time. - -If you have not read about Oxen DataFrames, it would be good to read up on [Data Point Level Version Control](DataPointLevelVersionControl.md) as well as [DataFrames](DataFrames.md) before continuing. - -⛔️ 👷 Caution, these features are still in development. - -# Schemas - -When you add and commit a DataFrame to Oxen, it will automatically track the schema associated with the file. - -TODO: add command to clone data - -For example let's use a subset of the MSCoco Dataset as an example that has been processed into a DataFrame. - -```bash -$ oxen df processed/annotations/coco/bb_train2017.csv - -shape: (860001, 6) -┌─────────────────────────────────────┬──────────────┬────────┬────────┬────────┬────────┐ -│ file ┆ label ┆ min_x ┆ min_y ┆ width ┆ height │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ -╞═════════════════════════════════════╪══════════════╪════════╪════════╪════════╪════════╡ -│ raw/images/coco/train2017/000000... ┆ motorcycle ┆ 359.17 ┆ 146.17 ┆ 112.45 ┆ 213.57 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ person ┆ 339.88 ┆ 22.16 ┆ 153.88 ┆ 300.73 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ person ┆ 471.64 ┆ 172.82 ┆ 35.92 ┆ 48.1 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ bicycle ┆ 486.01 ┆ 183.31 ┆ 30.63 ┆ 34.98 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ cup ┆ 195.73 ┆ 267.76 ┆ 13.14 ┆ 25.15 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ sink ┆ 270.32 ┆ 260.22 ┆ 114.92 ┆ 67.4 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ person ┆ 474.76 ┆ 158.66 ┆ 25.24 ┆ 69.33 │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ -│ raw/images/coco/train2017/000000... ┆ refrigerator ┆ 105.04 ┆ 325.97 ┆ 187.84 ┆ 49.03 │ -└─────────────────────────────────────┴──────────────┴────────┴────────┴────────┴────────┘ -``` - -Add and commit the training data file. - -```bash -$ oxen add processed/annotations/coco/bb_train2017.csv -$ oxen commit -m "adding training data file" -``` - -Then run the `schemas` command to see a list of the detected schemas. - -``` -$ oxen schemas - -+------+----------------------------------+-----------------------------------+ -| name | hash | fields | -+=============================================================================+ -| ? | 36d0edc8779f42e30b0d630aa83bc83c | [file, label, ..., width, height] | -+------+----------------------------------+-----------------------------------+ -``` - -You'll see that each schema has a hash that is generated from the combination of its field names and data types. - -To see a more expanded view of a schema you can run the `schema show` subcommand. - -``` -$ oxen schemas show 36d0edc8779f42e30b0d630aa83bc83c - -Schema has no name, to name run: - - oxen schemas name 36d0edc8779f42e30b0d630aa83bc83c "my_schema" - -+----+--------+-------+ -| id | name | dtype | -+=====================+ -| 0 | file | str | -|----+--------+-------| -| 1 | label | str | -|----+--------+-------| -| 2 | min_x | f64 | -|----+--------+-------| -| 3 | min_y | f64 | -|----+--------+-------| -| 4 | width | f64 | -|----+--------+-------| -| 5 | height | f64 | -+----+--------+-------+ -``` - -Schemas can either be referenced by their hash, or their name, to give a schema a more readable name you can use the `schema name` subcommand. - -``` -$ oxen schemas name 36d0edc8779f42e30b0d630aa83bc83c "bounding_box" -``` - -One of the benefits of having schemas versioned is it helps you know how your data changes over time. The other added benefit is indexing the data. - -# Indices - -Indices are a useful construct in Oxen to allow constant time O(1) access to your data. This can become crucial when datasets increase in size to millions if not billions of entries. - -One way of searching and finding a subset of a DataFrame is the [--filter](DataFrames.md#filter-rows) option on the [df](DataFrames.md) command. This works fine for relatively small datasets, but at the end of the day filter scans the entire column of data to find what you need, and can be quite slow to scan larger datasets. - -If you are willing to put some upfront cost in indexing on commit, Oxen has the ability to lookup data in O(1) constant time. - -Once you create an index on a repository, it will be updated every time you make a new commit with new data. - -## Primary Key Indices - -Since by default Oxen has no context about the DataFrames or the fields being ingested, the first use case for user generated indices is to create constant time access for an internal id in your system to a versioned row in Oxen. - -Maybe you know what the primary key of this data should be, or you simply have more information stored in an external database that you'll want to map to later. - -Indices can be added with the `schema create_index` subcommand by providing the field name you want to index. - -``` -$ oxen schema create_index my_schema --field 'my_id' -``` - -You can then quickly get back to any row within that schema that contains a specific ID value - -``` -$ oxen schema query my_schema --query 'my_id=1234' -``` - -You can also see the state of the row at a specific commit id by passing the `--source` option. Source can be a commit id or a branch name. This helps build tools to see how the data points evolves over time. - -``` -$ oxen schema query my_schema --query 'my_id=1234' --source $COMMIT_ID -``` - -If you want to see all the indices that exist within a schema, you can run the `schema indices` subcommand. - -``` -$ oxen schema indices my_schema -``` - -TODO: To delete an index within a schema, you can run the `schema delete_index` subcommand. - -``` -$ oxen schema delete_index my_schema --field 'my_id' -``` - - -## Train/Test/Val Indices - -Primary key lookups are one example use case for indices. Another example is to split your data into subsets. Indexed values do not have to be unique. - -You may want a field that indicates whether this example belongs to train, test, or the validation set of the data. Then quickly pull the data from just the evaluation set. - -``` -$ oxen schema create_index bounding_box --field 'eval_set' -``` - -To get a quick summary of the distribution of values Oxen indexed you can run a `--query` with the `count()` function. - -``` -$ oxen schema query bounding_box --query 'count(eval_set)' - -shape: (3, 2) -┌───────────┬───────────────────┐ -│ eval_set ┆ count │ -│ --- ┆ --- │ -│ i64 ┆ u32 │ -╞═══════════╪═══════════════════╡ -│ train ┆ 162770 │ -├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ test ┆ 19867 │ -├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ -│ valid ┆ 19962 │ -└───────────┴───────────────────┘ -``` - -To grab the subset of data that applies to a indexed subset, pass in an `=` expression. - -``` -$ oxen schema query bounding_box --query 'eval_set=train' - -# TODO: implement -``` - - -TODO: Show how you can push, and access these subsets from a remote server (pull, fetch with apache arrow, etc) - - -## File Path Indices - -You may want an index on a field that contains "file" path that references some external data in the system. - -If you create an index with the `file` field, OxenHub will know to map this to a file in the repository and try to render the data in place. - -``` -$ oxen schema create_index bounding_box --field 'file' -``` - -## Classification Label Indices - -Another example index might be on a classification label. Say you want to be able to quickly get to all the entries that are tagged as `"person"` - -``` -$ oxen schema create_index bounding_box --field 'label' -$ oxen schema query bounding_box --query 'label=person' -``` - -Now let your imagination run wild on the type of data you work with, and how you want to version and see it evolve over time. - -# Conclusion - -Indexes are a powerful tool that compromise write time for constant time access to your data. They also have the benefit of being versioned so that you know exactly what the state of the index was at any given point in time. This makes time series analysis and debugging your data at any point in time much easier. - -Let us know what you think or if you have any other feature requests at hello@oxen.ai. diff --git a/SelfHosting.md b/SelfHosting.md index 96d604b..9b78e34 100644 --- a/SelfHosting.md +++ b/SelfHosting.md @@ -62,7 +62,7 @@ Once you have committed data locally and are ready to share them with colleagues You can either create a remote through the web UI on [OxenHub](https://oxen.ai) or if you have setup a server your self, you will have to run the `create-remote` command. ```bash -$ oxen create-remote MyNamespace MyRepoName +$ oxen create-remote --name MyNamespace/MyRepoName --host 0.0.0.0:3001 --scheme http ``` Repositories that live on an Oxen Server have the idea of a `namespace` and a `name` to help you organize your repositories. diff --git a/Stats.md b/Stats.md deleted file mode 100644 index 860a372..0000000 --- a/Stats.md +++ /dev/null @@ -1,3 +0,0 @@ -# Get some stats on release downloads - -curl -s https://api.github.com/repos/oxen-ai/oxen-release/releases | jq \ No newline at end of file diff --git a/WhyOxen.md b/WhyOxen.md deleted file mode 100644 index 514fd5a..0000000 --- a/WhyOxen.md +++ /dev/null @@ -1,17 +0,0 @@ -# Why Oxen? - -Why build a new version control system for data? - -* Speed, these datasets are large - * Fast Hashing - * Native Parallelization - * Fast access kv dbs - * Indexed rows into content addressable data frames - * Data streaming for distributed training -* Explore your data - * Data != code, browsing line by line is not sufficient - * To debug you need native ability to slice, dice, cluster, within a version -* Better than zip files and checkpoints - * Can do EDA and run experiments without downloading everything - -* Just like there are many humans, will by many models, trained on different data \ No newline at end of file diff --git a/annotation/LabelStudio.md b/annotation/LabelStudio.md deleted file mode 100644 index 6d634a2..0000000 --- a/annotation/LabelStudio.md +++ /dev/null @@ -1,128 +0,0 @@ -# Label Studio Integration - -Label Studio is a popular Open Source data labeling tool that anyone can build custom UIs or use pre-built labeling templates. - -For this example we will be labeling bounding boxes around Cats and Dogs. - -## Installation - -You can find more ways to install Label Studio in their [Documentation](https://github.com/heartexlabs/label-studio). For now we will just use `pip`. - -```bash -$ pip install label-studio -``` - -## Run - -To run Label Studio and have it save off the images to a specified directory you pass the `--data-dir` flag - -```bash -$ label-studio start --data-dir label-studio/uploads/ -``` - -## Login - -Create a profile for the annotator you would like to annotate the initial images with. Label Studio is nice because you can have multiple accounts annotate the same data and cross validate their work. - -Label Studio Login - -## Setup Project - -Next create a project - -Label Studio Create Project - -Name it CatDogBBox and give it an optional description - -Label Studio Name Project - -Pick the annotation task to be "Object Detection" - -Label Studio Annotation Task - -Create the labels to be "Cat" and "Dog" - -Label Studio Create Labels - -## Annotate Data - -Upload your images - -Label Studio Upload images - -Click on the label then drag a bounding box around the dog or cat (or both) in the photos - -Label Studio Label Images - -Once you are done with the first couple of annotations, export the annotations as a csv file. Instead of using the suggested file name with the convoluted timestamp, simply export a file called `annotations.csv` within the same directory you specified as the `--data-dir` when starting Label Studio. - -Label Studio Export Annotations - -## Track in Oxen - -With a couple examples uploaded, annotated, and exported now would be a good time to setup the initial version control. - -Open your terminal and navigate into the directory with all the data and initialize your Oxen repository. - -```bash -$ oxen init -``` - -Then stage the initial annotations file as well as the uploads directory. - -```bash -$ oxen add label-studio/annotations.csv -$ oxen add label-studio/uploads/media/upload/ -$ oxen status -``` - -Oxen Status Staged - -There are some other files in the `label-studio/uploads/` directory that we do not need to track right now including the sqlite database and an export directory. - -Next commit the annotations file and the images. - -```bash -$ oxen commit -m "Adding first couple annotations and images." -``` - -Now feel free to label a few more images and export to the same `label-studio/annotations.csv` file. - -If you run `oxen status` again you will see it detects the changes. - -```bash -$ oxen status - -On branch main -> 61e672a9a88e4dcb - -Modified files: (use "oxen add ..." to update what will be committed) - modified: label-studio/annotations.csv -``` - -Then you can run `oxen diff` on the file to see the annotations that were added since the last commit. - -```bash -$ oxen diff label_studio/annotations.csv - -Added Rows - -shape: (3, 8) -┌──────────────┬─────┬──────────────┬───────────┬────────────┬─────────────┬─────────────┬───────────┐ -│ image ┆ id ┆ label ┆ annotator ┆ annotation ┆ created_at ┆ updated_at ┆ lead_time │ -│ --- ┆ --- ┆ --- ┆ --- ┆ _id ┆ --- ┆ --- ┆ --- │ -│ str ┆ i64 ┆ str ┆ i64 ┆ --- ┆ str ┆ str ┆ f64 │ -│ ┆ ┆ ┆ ┆ i64 ┆ ┆ ┆ │ -╞══════════════╪═════╪══════════════╪═══════════╪════════════╪═════════════╪═════════════╪═══════════╡ -│ /data/upload ┆ 7 ┆ [{"x": 25.68 ┆ 1 ┆ 7 ┆ 2022-11-08T ┆ 2022-11-08T ┆ 5.906 │ -│ /1/49e2dae1- ┆ ┆ 421052631579 ┆ ┆ ┆ 04:42:04.85 ┆ 04:42:04.85 ┆ │ -│ 00000000... ┆ ┆ , "y": 2... ┆ ┆ ┆ 2993Z ┆ 3035Z ┆ │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ /data/upload ┆ 5 ┆ [{"x": 19.36 ┆ 1 ┆ 5 ┆ 2022-11-08T ┆ 2022-11-08T ┆ 7.166 │ -│ /1/9cede6c3- ┆ ┆ 842105263158 ┆ ┆ ┆ 04:41:33.50 ┆ 04:41:33.50 ┆ │ -│ 00000000... ┆ ┆ , "y": 2... ┆ ┆ ┆ 5882Z ┆ 5921Z ┆ │ -├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ -│ /data/upload ┆ 6 ┆ [{"x": 9.894 ┆ 1 ┆ 6 ┆ 2022-11-08T ┆ 2022-11-08T ┆ 11.478 │ -│ /1/2eeef79e- ┆ ┆ 736815053673 ┆ ┆ ┆ 04:41:54.20 ┆ 04:41:54.20 ┆ │ -│ 00000000... ┆ ┆ , "y": 9... ┆ ┆ ┆ 9253Z ┆ 9292Z ┆ │ -└──────────────┴─────┴──────────────┴───────────┴────────────┴─────────────┴─────────────┴───────────┘ -``` diff --git a/commands/Restore.md b/commands/Restore.md deleted file mode 100644 index 48a09fc..0000000 --- a/commands/Restore.md +++ /dev/null @@ -1,31 +0,0 @@ -# Oxen Restore - -The `oxen restore` command can be useful for reverting changes in your working directory to the HEAD commit, or some specific point in the history. - -For example if you modified or deleted a file that you did not intend to, simply run - -```bash -$ oxen restore path/to/file.txt -``` - -It also works recursively to restore all changes within a directory - -```bash -$ oxen restore path/to/dir -``` - -## Restore to version - -If you want to restore to a specific version in your commit history, you can supply the commit id or branch name with the `--source` flag. - -```bash -$ oxen restore path/to/file.txt --source COMMIT_ID -``` - -## Unstage a file - -If you accidentally staged a file or directory with the `oxen add` command, and want to unstage it, you can also use the `restore` command for this. - -```bash -$ oxen restore --staged path/to/dir -``` diff --git a/data/CatDogBBox.tar.gz b/data/CatDogBBox.tar.gz deleted file mode 100644 index 0f1be5e..0000000 Binary files a/data/CatDogBBox.tar.gz and /dev/null differ