Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High resource consumption on very large pcap files #14

Open
Pommaq opened this issue Feb 26, 2024 · 2 comments
Open

High resource consumption on very large pcap files #14

Pommaq opened this issue Feb 26, 2024 · 2 comments

Comments

@Pommaq
Copy link

Pommaq commented Feb 26, 2024

My personal use case for this tool was to filter a very very large pcap file into smaller files representing individual TCP sessions, matching the extract command, it did succeed in doing so but I did notice a few issues that can be resolved in a relatively simple manner.

  1. It scales linearly (by my guesstimate) based on the number of streams that exist in the input pcap file.
  2. High RAM consumption, in my use it reached about 80GB.

Converting linear scaling to something closer to O(1) (guesstimate)

fn read_pcap(input: &str) -> Option<(PcapHeader, Vec<Stream>)> {
    /*snip*/
    let mut output = Vec::<Stream>::new();
    /*snip*/
    'nextpkt: while let Some(pkt) = pcap_reader.next_packet() {
        /*snip*/
        let pkt = pkt.unwrap();
        let packet = PPacket::new(&pkt);
        if let Some(eth) = EthernetPacket::new(&pkt.data) {
            // Validate it is an IPv4 packet
            if eth.get_ethertype() == EtherTypes::Ipv4 {
                // Validate it is a TCP packet and we have extracted it
                if let Some(si) = StreamInfo::new(&eth) {
                    if output.is_empty() {
                        /*snip*/
                        output.push(Stream::new(si, packet));
                        continue 'nextpkt;
                    } else {                        
                        // *This* is where the poor scaling comes from.
                        // Running this for each packet we receive, when we have 50000 causes a noticeable slowdown.
                        for s in output.iter_mut() { 
                            if s.is_stream(&si) {
                                s.add_and_update(si, packet);
                                continue 'nextpkt;
                            }
                        }

                        output.push(Stream::new(si, packet));
                }
            }
        }
    }
    /*snip*/
    
    if output.is_empty() {
        None
    } else {
        Some((header, output))
    }
}

The solution is to replace the vector with a hashmap, or a hashset, or similar. Basically something with O(1) lookup time.
We can do this since "s.is_stream(&si)" simply compares StreamInfo, which is static data for the session.

Here is an example solution, note that we lose the order in which we saw each stream however, solution is to add an integer to "Stream", then just increase it each time no Stream was available in the map.

fn read_pcap(input: &str) -> Option<(PcapHeader, Vec<Stream>)> {
    /*snip*/
    // Note that we replace "output" with a hashmap. StreamInfo is immutable
    // once a session has been identified, so this is A-O.K.
    
    let mut entries: HashMap<StreamInfo, Stream> = HashMap::new();
    /*snip*/
    while let Some(pkt) = pcap_reader.next_packet() {
        /*snip*/
        let pkt = pkt.unwrap();
        let packet = PPacket::new(&pkt);
        
        if let Some(eth) = EthernetPacket::new(&pkt.data) {
            if eth.get_ethertype() == EtherTypes::Ipv4 {
                // Decode the packet
                if let Some(si) = StreamInfo::new(&eth) {
                    
                    // Just hash it and try to get it.
                    if let Some(s) = entries.get_mut(&si) {
                        // Stream already exists. Extend it
                        s.add_and_update(si, packet);
                    } else {
                        // No matching stream, create one.
                        entries.insert(si, Stream::new(si, packet));
                    }
                }
            }
        }
    }
    /*snip*/
    
    if entries.is_empty() {
        None
    } else {
        Some((header, entries.into_values().collect()))
    }
}

High Ram consumption

This stems from this tool retaining all seen pcap packets in an internal vector and only writing them to disk once it's extracted everything.

The solution is to not do that. In the previously proposed hashmap simply map each StreamInfo to a channel Writer. Let the reading end(s) write data into the corresponding pcap file, or performing tallying or scanning of the pcap file depending on the subcommand issued. Adapting everything to this is simple since it'd essentially mean those ends would iterate over a Receiver instead of a vector.

The reader can be implemented by:

  1. Starting a thread for each pcap file, this does not scale super well but it's less resource hungry than today.
  2. Or having one reader thread that multiplexes over each Receiver, keeps track of which datasource belongs to what file and writes into it once it has received data.

I'd recommend the second option, it's slightly more complex but limits this tool to 2 threads (or 1 if it's implemented asyncronously but that rewrite sounds annoying).

@genonullfree
Copy link
Owner

Thank you for this detailed issue as well. This tool was originally hacked together quickly and so this issue doesn't really surprise me. Just wanted to respond to say this may take me longer to address than the other issue since it's a bit more complicated.

@Pommaq
Copy link
Author

Pommaq commented Mar 7, 2024

No probs, I'll throw together a solution :) I think I made a neat one although it slightly changes the prints from the tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants