Improvements : Speed #3

chris-ha458 · 2023-09-24T23:25:01Z

As per our discussion in #2 for speed improvements the following has been suggested

calc coherence & mess in threads
or calc mess for plugins in threads (or some async?)
or something other...

The paths I had in mind was these:

Related to threads idea : use Rayon
- Replace HashMap with concurrent DashMap (Current std HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)
~~Use replace hashing algorithm used in HashMap~~ with FxHash, AHash, HighwayHash
- aHash implemented Using alternative hash function for HashMaps, HashSets & LRU caching #14
~~Replace sort() with sort_unstable()~~ Sorting was changed to unstable #6
~~Identfiy preallocation opportunities~~. For instance, replace Vec::new() with Vec::with_capacity()
- Seems like most current new() cannot really preallocate due to uncertainty. The basic preallocation algorithm is optimized enough that unless we have a strong idea regarding memory access premature allocation is not helpful.

Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code.
For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with rayon, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)

The text was updated successfully, but these errors were encountered:

nickspring · 2023-09-25T04:40:48Z

Thank you for issue.

Rayon is undoubtedly a good idea for parallel iterations, it should probably be implemented with feature flag.
Do you mean the hashing algorithm for LRU caching in functions (there are a lot of hashing)?
Why do you think that sort_unstable() will be faster than sort()? According to documentation it uses quicksort algorithm (sort uses mergesort). Mergesort has better time complexity in comparison to quicksort, hasn't?

nickspring · 2023-09-25T04:54:15Z

I tried to replace sort() with sort_unstable() and sort_by() with sort_unstable_by(). There is almost no difference in speed, at least for this performance test and benchmark large_datasets. But unstable sorting can produce different results. So I'm not sure it makes sense to use it.

chris-ha458 · 2023-09-25T06:46:03Z

Do you mean the hashing algorithm for LRU caching in functions (there are a lot of hashing)?

In rust, hashmaps themselves can use an alternative hashing method for speed or other reasons.
The default hasher is siphash1-3 which is ddos resistant, but not fast.

Why do you think that sort_unstable() will be faster than sort()

This is what the documentation says :

When applicable, unstable sorting is preferred because it is generally faster than stable sorting and it doesn’t allocate auxiliary memory. See sort_unstable.

Also sort on vec is timsort and sort_unstable is pdqsort

Unstable sort is only relevant if there are other indexes to consider. Atleast for the sorted indexes it should yield the same result.

In case different results are not desirable, I think it would be a good idea to have a unit or integration test to verify expected behavior.

nickspring · 2023-09-25T12:47:37Z

OK, I will push sort_unstable to repo. Concerning the hashing function - we need to try to change it. There are a lot of hash operations because of LRU cache, it can help!

nickspring · 2023-09-25T13:07:06Z

hm, it looks like I tested it wrongly and unstable sort speed up it for 10-20%.

chris-ha458 · 2023-09-26T13:52:09Z

Yes, I understand. I plan to work on hash function replacement at this time. Rayon will be implemented later, not a goal right now.
Originally posted by @nickspring in #8 (comment)

Some recommendations for hash functions :

FxHash : I don't like it particularly, but I thought i'd share since this is often the first choice in this kind of situation.
aHash : This is not stable (the hash values change from version to version) and only seeks to be the fastest.
highway-rs : This is often overlooked but it is very fast, not cryptographically tested but gone through extensive testing and validation(original paper and Smhasher). It also has sample code on how one might implement it.

use std::collections::HashMap;
use std::hash::BuildHasherDefault;
use highway::HighwayHasher;
let mut map =
  HashMap::with_hasher(BuildHasherDefault::<HighwayHasher>::default());

map.insert(1, 2);
assert_eq!(map.get(&1), Some(&2));

If I had to choose, I would have chosen highwayhash.
(No known hash vulernabilities like fxhash,ahash,xxhash,wyhash. Much faster than any cryptographic hash and competitive with non-cryptographic good enough hashes. Pure idiomatic rust implementation with SIMD intrinsic pathways.)

nickspring · 2023-09-26T13:57:03Z

Actually it's good question. I tested standard hasher on big file and it was slow. And I added blake3 hasher (not sure if we need fingerprint, as it is in Python version, at all for comparison, maybe we can compare just decoded_payload)

charset-normalizer-rs/src/entity.rs

Line 178 in 4d9fd71

blake3::hash(

chris-ha458 · 2023-09-29T02:42:05Z

#14 implements aHash. For pure speed with little other considerations, it is good enough.
It is an unstable hash, meaning there is no guarantee that there will be same hashes across platforms or versions.
This might become a problem if/when serde becomes a goal, but the way the repo is used currently it won't be a problem. We could implement highwayhash or others when it does become a problem.

As such for short term, further hasher investigations would be lower priority.
To improve from here we would have to implement benchmarks to compare siphash,fxhash,ahash etc.

nickspring · 2023-09-29T05:26:37Z

Hashing is used only for caching, so it doesn't matter that it's unstable. Also, we don't serialize hashes via serde.
I сhose ahash as it's implemented already in cached library as a feature.

nickspring · 2023-10-02T10:38:13Z

I did some experiments with Rayon. At this time it looks like it has high overhead. So I plan to try to rewrite mess measurement code in more efficient way (I have some ideas).

chris-ha458 · 2023-10-02T11:25:10Z

I was looking into the lib.rs Two main loops.
Especially the named loops 'iana_encodings_loop and 'chunks_loop

Unfortunately, as they are it does not seem like they are easily written into functional Rayon code (if at all)

I agree that some other parts of the code are not good candidates for Rayon due to overhead. Hope you find a good approach!

chris-ha458 · 2023-10-03T12:03:00Z

I'm exploring integrating encoding_rs instead of rust-encoding.

rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported.
However, they work in fundamentally different ways and encoding_rs does not support many current constructs.

It would obviously take some time.

nickspring · 2023-10-04T08:24:31Z

I'm exploring integrating encoding_rs instead of rust-encoding.

rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. However, they work in fundamentally different ways and encoding_rs does not support many current constructs.

It would obviously take some time.

Yes, it doesn't fit to the concept of this library.

nickspring · 2023-10-05T20:18:35Z

It looks like new mess detector approach is faster #31

--> A) charset-normalizer-rs Conclusions
   --> Accuracy: 97.1%
   --> Total time: 1.147958363s
   --> Avg time: 2.806743ms
   --> 50th: 1.044498ms
   --> 95th: 8.228578ms
   --> 99th: 19.302748ms

------------------------------
--> B) chardet Conclusions
   --> Faster than charset-normalizer-rs by 0.6 times
   --> Accuracy: 82.6%
   --> Total time: 2.083252561s
   --> Avg time: 5.093526ms
   --> 50th: 331.699µs
   --> 95th: 2.515793ms
   --> 99th: 16.465157ms

------------------------------
--> C) chardetng Conclusions
   --> Faster than charset-normalizer-rs by 0.9 times
   --> Accuracy: 90.7%
   --> Total time: 1.253477517s
   --> Avg time: 3.064736ms
   --> 50th: 821.598µs
   --> 95th: 8.578877ms
   --> 99th: 24.627836ms

However I broke a lot of idiomatic code in ms.rs.
Tomorrow I will clean code (remove unused functions, rewrite tests for new mess detector, refactor code).

chris-ha458 · 2023-10-12T04:11:30Z

From the list, only Rayon remains.
Unfortunately the current implementation is streamlined enough that naive parallelization is not helpful.

If there are any ideas, let me know. In the meantime, feel free to close this comment.
I think it would be relevant again when there is a major change in this repo or any upstream dependencies (most of which are also highly optimized)

nickspring · 2023-10-12T05:15:53Z

Yes, I have an idea, to refactor and speed up cd.rs module. I think it's possible.

chris-ha458 mentioned this issue Sep 24, 2023

Improvements : Idiomatic code #4

Open

nickspring mentioned this issue Sep 25, 2023

Sorting was changed to unstable #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements : Speed #3

Improvements : Speed #3

chris-ha458 commented Sep 24, 2023 •

edited

Loading

nickspring commented Sep 25, 2023 •

edited

Loading

nickspring commented Sep 25, 2023 •

edited

Loading

chris-ha458 commented Sep 25, 2023

nickspring commented Sep 25, 2023

nickspring commented Sep 25, 2023

chris-ha458 commented Sep 26, 2023 •

edited

Loading

nickspring commented Sep 26, 2023

chris-ha458 commented Sep 29, 2023

nickspring commented Sep 29, 2023

nickspring commented Oct 2, 2023

chris-ha458 commented Oct 2, 2023

chris-ha458 commented Oct 3, 2023

nickspring commented Oct 4, 2023

nickspring commented Oct 5, 2023

chris-ha458 commented Oct 12, 2023

nickspring commented Oct 12, 2023

Improvements : Speed #3

Improvements : Speed #3

Comments

chris-ha458 commented Sep 24, 2023 • edited Loading

nickspring commented Sep 25, 2023 • edited Loading

nickspring commented Sep 25, 2023 • edited Loading

chris-ha458 commented Sep 25, 2023

nickspring commented Sep 25, 2023

nickspring commented Sep 25, 2023

chris-ha458 commented Sep 26, 2023 • edited Loading

nickspring commented Sep 26, 2023

chris-ha458 commented Sep 29, 2023

nickspring commented Sep 29, 2023

nickspring commented Oct 2, 2023

chris-ha458 commented Oct 2, 2023

chris-ha458 commented Oct 3, 2023

nickspring commented Oct 4, 2023

nickspring commented Oct 5, 2023

chris-ha458 commented Oct 12, 2023

nickspring commented Oct 12, 2023

chris-ha458 commented Sep 24, 2023 •

edited

Loading

nickspring commented Sep 25, 2023 •

edited

Loading

nickspring commented Sep 25, 2023 •

edited

Loading

chris-ha458 commented Sep 26, 2023 •

edited

Loading