Skip to content

Approximate Aggregation Logic

Sathya Sravya edited this page Oct 29, 2023 · 1 revision

Approximate aggregation logic

  1. Get the number of blob IDs from different tables
  2. Pilot Run
    1. Take columns concerned in our aggregation query
    2. Populate samples of corresponding services
      1. Take random samples
        • Construct query satisfying existing conditions/ predicates
        • Consider a fixed number of pilot samples
        • Store the number of samples you got from each service, after inference
    3. Execute aggregation query on chunks of data
      1. Divide inferred data into a fixed number of partitions (here 20) ~ 20 strata units, with chunk_size = len(inferred output) / 20
      2. Execute aggregation query and store resulting estimates of chunks
      3. Structure the results by adding weights. For each chunk/ stratum,
        • Weight = chunk_size / sum(self.num_blob_ids.values())
        • Num_items = size_of_stratum
    4. Compute number of samples required. Use error_target, confidence, alpha, num_sampled_items_in_total
      • Compute z_score using
        • alpha / 2
      • Compute p_lb using
        • num_success = num_sampled_items_of_column
        • num_total = num_blob_ids
      • Compute pilot_estimate
        1. Mean
          • Use weights, statistics, and num_items_in_stratum for counts
          • Compute and return weighted estimate as per bennet_estimate [doubt: we’re using Degrees of freedom = 0 for DescrStatsW.. Is it okay? ]
        2. Count/ Sum
          • Use weights, statistics of strata
          • Compute weighted estimate as per bennet_estimate
          • Inflate estimate by a multiple of (num_blob_ids / num_of_samples_in_all_strata) and return std_ub as is
      • Num_samples_requried = ((z_score ** 2) * (pilot_estimate.std_ub ** 2)) / (error_target ** 2)
      • num_samples = num_samples / population_lb
      • if agg_type != exp.Avg: num_samples /= pilot_estimate.upper_bound
  3. Final query execution and estimation
    1. Populate data by considering required_number_of_samples during inference
    2. Execute query on a fixed number of chunks/ strata and store aggregation results
      • chunk_size = max(infer_ouput_length // _NUM_SAMPLES_SPLIT, 1)
    3. Structure results of chunks/ strata by adding
      • Weights - stratum_size / sum(self.num_blob_ids.values())
      • Num_items (counts) - stratum_size
    4. Using conf, alpha, error_target
      1. Compute Avg estimate:
        • Use weights, statistics, and num_items_in_stratum for counts
        • Compute and return weighted estimate as per bennet_estimate
      2. Compute Sum/ Count estimates:
        • Use weights, statistics of strata
        • Compute weighted estimate as per bennet_estimate
        • Inflate estimate by a multiple of (num_blob_ids / num_of_samples_in_all_strata) and return std_ub as is