diff --git a/README.md b/README.md index 37990ad..43bd9ec 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,10 @@ +# Introduction + +The text_search project can be used to create ASR (automatic speech recognition) dataset with long-form audios and even longer texts. + +The core of text_search is a general audio alignment pipeline, which aims to align the audio files to the corresponding text and split them into short segments, while also excluding segments of audio that do not correspond exactly with the aligned text. + + # Installation ## With pip @@ -36,3 +43,23 @@ python3 -c "import textsearch; print(textsearch.__file__)" We only set the environment variable `PYTHONPATH`. + +# Recipes + +- [libriheavy](examples/libriheavy) +- [subtitle](examples/subtitle) + + +# References +More explainations are available in the following paper: + +``` +@misc{kang2023libriheavy, + title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, + author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey}, + year={2023}, + eprint={2309.08105}, + archivePrefix={arXiv}, + primaryClass={eess.AS} +} +``` \ No newline at end of file diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst new file mode 100644 index 0000000..6e29516 --- /dev/null +++ b/docs/source/getting-started/index.rst @@ -0,0 +1,59 @@ +Getting started +=============== + +About +----- + +The text_search project can be used to create ASR (automatic speech recognition) dataset with long-form audios and even longer texts. + +The core of text_search is a general audio alignment pipeline, which aims to align the audio files to the corresponding text and split them into short segments, while also excluding segments of audio that do not correspond exactly with the aligned text. + +Installation +------------ + +With pip +******** + +.. code-block:: bash + + pip install fasttextsearch + + +For developers +************** + +Please use the following commands to install `fasttextsearch`_: + +.. code-block:: bash + + pip install numpy + + git clone https://github.com/k2-fsa/text_search + cd text_search + + mkdir build + cd build + cmake .. + make -j + make test + + # set PYTHONPATH so that you can use "import textsearch" + + export PYTHONPATH=$PWD/../textsearch/python:$PWD/lib:$PYTHONPATH + +To test the you have installed `fasttextsearch`_ successfully, please run: + +.. code-block:: bash + + python3 -c "import textsearch; print(textsearch.__file__)" + +It should print something like below: + +.. code-block:: bash + + /Users/fangjun/open-source/text_search/textsearch/python/textsearch/__init__.py + +.. hint:: + We did not use either `python3 setup.py install` or `pip install`. + We only set the environment variable `PYTHONPATH`. + diff --git a/docs/source/index.rst b/docs/source/index.rst index 63965bb..0e3fec4 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -10,6 +10,6 @@ Welcome to fasttextsearch's documentation! :maxdepth: 2 :caption: Contents: - ./install/index.rst + ./getting-started/index.rst ./tutorials/index.rst ./python-api/index.rst diff --git a/docs/source/install/developers.rst b/docs/source/install/developers.rst deleted file mode 100644 index ba31837..0000000 --- a/docs/source/install/developers.rst +++ /dev/null @@ -1,35 +0,0 @@ -For developers -============== - -Please use the following commands to install `fasttextsearch`_: - -.. code-block:: bash - - pip install numpy - - git clone https://github.com/danpovey/text_search - cd text_search - - mkdir build - cd build - cmake .. - make -j - make test - - # set PYTHONPATH so that you can use "import textsearch" - - export PYTHONPATH=$PWD/../textsearch/python:$PWD/lib:$PYTHONPATH - -To test the you have installed `fasttextsearch`_ successfully, please run: - -.. code-block:: bash - - python3 -c "import textsearch; print(textsearch.__file__)" - -It should print something like below: - -.. code-block:: bash - - /Users/fangjun/open-source/text_search/textsearch/python/textsearch/__init__.py - - diff --git a/docs/source/install/index.rst b/docs/source/install/index.rst deleted file mode 100644 index d965bca..0000000 --- a/docs/source/install/index.rst +++ /dev/null @@ -1,7 +0,0 @@ -Installation -============ - -.. toctree:: - :maxdepth: 2 - - ./developers.rst diff --git a/docs/source/python-api/index.rst b/docs/source/python-api/index.rst index b29c790..5e67268 100644 --- a/docs/source/python-api/index.rst +++ b/docs/source/python-api/index.rst @@ -3,7 +3,7 @@ Python API This section lists Python APIs in `fasttextsearch`_. -.. currentmodule:: textsearch +.. currentmodule:: textsearch.python.textsearch create_suffix_array @@ -25,3 +25,15 @@ get_nice_alignments ------------------- .. autofunction:: get_nice_alignments + +align_queries +------------------- +.. autofunction:: align_queries + +get_longest_increasing_pairs +------------------- +.. autofunction:: get_longest_increasing_pairs + +split_aligned_queries +------------------- +.. autofunction:: split_aligned_queries \ No newline at end of file diff --git a/docs/source/tutorials/index.rst b/docs/source/tutorials/index.rst index 45ff859..b28ae8a 100644 --- a/docs/source/tutorials/index.rst +++ b/docs/source/tutorials/index.rst @@ -1,6 +1,8 @@ Tutorials ============ +This section provides tutorials for core concepts of text_search as follows. + .. toctree:: :maxdepth: 2