You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My personal use case for this tool was to filter a very very large pcap file into smaller files representing individual TCP sessions, matching the extract command, it did succeed in doing so but I did notice a few issues that can be resolved in a relatively simple manner.
It scales linearly (by my guesstimate) based on the number of streams that exist in the input pcap file.
High RAM consumption, in my use it reached about 80GB.
Converting linear scaling to something closer to O(1) (guesstimate)
fn read_pcap(input:&str) -> Option<(PcapHeader,Vec<Stream>)>{/*snip*/letmut output = Vec::<Stream>::new();/*snip*/'nextpkt:whileletSome(pkt) = pcap_reader.next_packet(){/*snip*/let pkt = pkt.unwrap();let packet = PPacket::new(&pkt);ifletSome(eth) = EthernetPacket::new(&pkt.data){// Validate it is an IPv4 packetif eth.get_ethertype() == EtherTypes::Ipv4{// Validate it is a TCP packet and we have extracted itifletSome(si) = StreamInfo::new(ð){if output.is_empty(){/*snip*/
output.push(Stream::new(si, packet));continue'nextpkt;}else{// *This* is where the poor scaling comes from.// Running this for each packet we receive, when we have 50000 causes a noticeable slowdown.for s in output.iter_mut(){if s.is_stream(&si){
s.add_and_update(si, packet);continue'nextpkt;}}
output.push(Stream::new(si, packet));}}}}/*snip*/if output.is_empty(){None}else{Some((header, output))}}
The solution is to replace the vector with a hashmap, or a hashset, or similar. Basically something with O(1) lookup time.
We can do this since "s.is_stream(&si)" simply compares StreamInfo, which is static data for the session.
Here is an example solution, note that we lose the order in which we saw each stream however, solution is to add an integer to "Stream", then just increase it each time no Stream was available in the map.
fnread_pcap(input:&str) -> Option<(PcapHeader,Vec<Stream>)>{/*snip*/// Note that we replace "output" with a hashmap. StreamInfo is immutable// once a session has been identified, so this is A-O.K.letmut entries:HashMap<StreamInfo,Stream> = HashMap::new();/*snip*/whileletSome(pkt) = pcap_reader.next_packet(){/*snip*/let pkt = pkt.unwrap();let packet = PPacket::new(&pkt);ifletSome(eth) = EthernetPacket::new(&pkt.data){if eth.get_ethertype() == EtherTypes::Ipv4{// Decode the packetifletSome(si) = StreamInfo::new(ð){// Just hash it and try to get it.ifletSome(s) = entries.get_mut(&si){// Stream already exists. Extend it
s.add_and_update(si, packet);}else{// No matching stream, create one.
entries.insert(si,Stream::new(si, packet));}}}}}/*snip*/if entries.is_empty(){None}else{Some((header, entries.into_values().collect()))}}
High Ram consumption
This stems from this tool retaining all seen pcap packets in an internal vector and only writing them to disk once it's extracted everything.
The solution is to not do that. In the previously proposed hashmap simply map each StreamInfo to a channel Writer. Let the reading end(s) write data into the corresponding pcap file, or performing tallying or scanning of the pcap file depending on the subcommand issued. Adapting everything to this is simple since it'd essentially mean those ends would iterate over a Receiver instead of a vector.
The reader can be implemented by:
Starting a thread for each pcap file, this does not scale super well but it's less resource hungry than today.
Or having one reader thread that multiplexes over each Receiver, keeps track of which datasource belongs to what file and writes into it once it has received data.
I'd recommend the second option, it's slightly more complex but limits this tool to 2 threads (or 1 if it's implemented asyncronously but that rewrite sounds annoying).
The text was updated successfully, but these errors were encountered:
Thank you for this detailed issue as well. This tool was originally hacked together quickly and so this issue doesn't really surprise me. Just wanted to respond to say this may take me longer to address than the other issue since it's a bit more complicated.
My personal use case for this tool was to filter a very very large pcap file into smaller files representing individual TCP sessions, matching the extract command, it did succeed in doing so but I did notice a few issues that can be resolved in a relatively simple manner.
Converting linear scaling to something closer to O(1) (guesstimate)
The solution is to replace the vector with a hashmap, or a hashset, or similar. Basically something with O(1) lookup time.
We can do this since "s.is_stream(&si)" simply compares StreamInfo, which is static data for the session.
Here is an example solution, note that we lose the order in which we saw each stream however, solution is to add an integer to "Stream", then just increase it each time no Stream was available in the map.
High Ram consumption
This stems from this tool retaining all seen pcap packets in an internal vector and only writing them to disk once it's extracted everything.
The solution is to not do that. In the previously proposed hashmap simply map each StreamInfo to a channel Writer. Let the reading end(s) write data into the corresponding pcap file, or performing tallying or scanning of the pcap file depending on the subcommand issued. Adapting everything to this is simple since it'd essentially mean those ends would iterate over a Receiver instead of a vector.
The reader can be implemented by:
I'd recommend the second option, it's slightly more complex but limits this tool to 2 threads (or 1 if it's implemented asyncronously but that rewrite sounds annoying).
The text was updated successfully, but these errors were encountered: