Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Videos less than 30 sec not working #3

Open
AlphaHasher opened this issue May 14, 2022 · 4 comments
Open

Videos less than 30 sec not working #3

AlphaHasher opened this issue May 14, 2022 · 4 comments

Comments

@AlphaHasher
Copy link

I am aware you mentioned that it does not work for videos under 30 seconds, but I wondered if there were any updates to this. Also, if there is anything I can help with, then just let me know

@Farmadupe
Copy link
Owner

Actually I have draft code that now works with videos under 30 seconds (hopefully will work with any video with at least 64 frames, regardless of duration).

I have also updated the actual algorithm with much closer hamming distance between duplicated videos, so the new updated code is just 'better' than the old code. (The new algorithm now uses a three-dimensional DCT, the old algorithm used ten two-dimensional DCTs)

In fact, the codebase is in a 'mostly-working' state, so I'll see if I can push it to a new branch in github. I haven't worked on it much recently (it's only a hobby project) so I'm not sure when I'll put more time into finishing it off.

Here's my current tasklist. Feel free to contribute on any of these!

Necessary Tasks (mostly boring)

  • Remove more calls to unwrap() in the library, currently many filesystem operations will cause a panic.
  • The function to create a video hash now takes options. Ideally these should be removed if it is possible to create universal defaults. Otherwise a builder interface should probably be created.
  • I replaced the external calls to Ffmpeg in v0.1 with bindings to libgstreamer. The advantages are: 1) faster, 2) better error reporting, 3) It's less hacky. But after completing the integration, the raw libgstreamer library nondeterministically crashes vid_dup_finder, seemingly due to known quality issues in the raw library and/or plugins/codecs. So I need to reconsider whether I need to revert to the old Ffmpeg interface.
  • If the libgstreamer interface is not abandoned, also check that it is easy to install libgstreamer shared libraries on windows (I'm worried it probably isn't easy)
  • Documentation needs to be updated
  • Tests need to be updated.
  • Check that the update is still compatible with Czkawka, because that is how most people use vid_dup_finder_lib.

Useful tasks

  • Find a way to check that the library works on mac. (this should probably be 'necessary')
  • Investigate why the library currently has very high memory usage. I think this is probably some memory fragmentation/leak in libgstreamer codecs. Unlikely alternative is memory leak/fragmentation in Rayon.
  • Set up some basic CI tests.
  • Check if using 32 frames instead of 64 is acceptable, as this will speed up the hashing process.
  • Check that all expected combinations of cmdline options in the app are actually supported
  • General code tidyup
  • Restore some of the utilities in the GUI application for analyzing which duplicates have the highest quality. They were previously implemented in the library but I deleted them instead of transferring them.
  • If I'm feeling brave, the libgstreamer interface crate could be published on crates.io. There currently isn't a well-documented crate for extracting video frames.

Pie in the sky tasks

  • Update the app GUI so that it's not rubbish. Alternatively cease publishing the GUI portion of the app (IMO the cmdline app is quite good quality but the GUI is very hacky.)
  • If a common reference dataset exists for finding near-duplicate videos exists, then test against it. Use it to discover quantitative information relative to other libraries
  • Enumerate the transformations that the library should be able to cope with (i.e changing brightness, resizing, resolution, compression artefects). Enumerate the transofrmations that it can't cope with (rotation, embedding, horizontal flipping, differing duration etc)
  • Learn alternative algorithms for finding duplicate videos. Reimplement the libraries if such libraries are quantitatively better.

Ideas for extension

  • Extend the library to find duplicate 'clips' within videos instead of duplicate whole video files. It should be possible to use the 3D-DCT to perform autocorrelation

@IronCraftMan
Copy link

Any chance you could publish what you have? Would be great to have for others to work on, even if you can't yourself.

@Farmadupe
Copy link
Owner

I have created branch dct3d in the vid_dup_finder_lib library. It is quite functional and is much better than v0.1.0. It may be good enough quality to publish on crates.io.

Feel free to:

  • Fork the codebase
  • Raise pull requests (I like to think I will be responsive)
  • publish any or all code on crates.io.

There are two choices of backend library 1) ffmpeg-commandline-interface 2) link to gstreamer shared library. gstreamer is faster but is probably difficult to bind on windows, sometimes causes crashes the entire process due to bugs, and seems to cause a lot of memory fragmentation when decoding a lot of videos of various formats.

So I suggest to keep ffmpeg.

Remaining tasks on my list are:

  • Check that documentation is accurate
  • Remove any remaining explicit panics
  • run/fix/delete any existing unit tests.
  • maybe delete ffmpeg_gst_wrapper subcrate and just use the ffmpeg_cmdline_utils

P.S. the vid_frame_iter crate may also be suitable to release, as many people on reddit ask for a simple interface to decode videos from gstreamer. It is memory-safe and zero-copy.

https://github.com/Farmadupe/vid_dup_finder_lib/tree/dct3d

@Th3EvilGod
Copy link

@Farmadupe any ETA, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants