-
Notifications
You must be signed in to change notification settings - Fork 7
Approximate Aggregation Logic
Sathya Sravya edited this page Oct 29, 2023
·
1 revision
- Get the number of blob IDs from different tables
- Pilot Run
- Take columns concerned in our aggregation query
- Populate samples of corresponding services
- Take random samples
- Construct query satisfying existing conditions/ predicates
- Consider a fixed number of pilot samples
- Store the number of samples you got from each service, after inference
- Take random samples
- Execute aggregation query on chunks of data
- Divide inferred data into a fixed number of partitions (here 20) ~ 20 strata units, with chunk_size = len(inferred output) / 20
- Execute aggregation query and store resulting estimates of chunks
- Structure the results by adding weights. For each chunk/ stratum,
- Weight = chunk_size / sum(self.num_blob_ids.values())
- Num_items = size_of_stratum
- Compute number of samples required. Use error_target, confidence, alpha, num_sampled_items_in_total
- Compute z_score using
- alpha / 2
- Compute p_lb using
- num_success = num_sampled_items_of_column
- num_total = num_blob_ids
- Compute pilot_estimate
- Mean
- Use weights, statistics, and num_items_in_stratum for counts
- Compute and return weighted estimate as per bennet_estimate [doubt: we’re using Degrees of freedom = 0 for DescrStatsW.. Is it okay? ]
- Count/ Sum
- Use weights, statistics of strata
- Compute weighted estimate as per bennet_estimate
- Inflate estimate by a multiple of (num_blob_ids / num_of_samples_in_all_strata) and return std_ub as is
- Mean
- Num_samples_requried = ((z_score ** 2) * (pilot_estimate.std_ub ** 2)) / (error_target ** 2)
- num_samples = num_samples / population_lb
- if agg_type != exp.Avg: num_samples /= pilot_estimate.upper_bound
- Compute z_score using
- Final query execution and estimation
- Populate data by considering required_number_of_samples during inference
- Execute query on a fixed number of chunks/ strata and store aggregation results
- chunk_size = max(infer_ouput_length // _NUM_SAMPLES_SPLIT, 1)
- Structure results of chunks/ strata by adding
- Weights - stratum_size / sum(self.num_blob_ids.values())
- Num_items (counts) - stratum_size
- Using conf, alpha, error_target
- Compute Avg estimate:
- Use weights, statistics, and num_items_in_stratum for counts
- Compute and return weighted estimate as per bennet_estimate
- Compute Sum/ Count estimates:
- Use weights, statistics of strata
- Compute weighted estimate as per bennet_estimate
- Inflate estimate by a multiple of (num_blob_ids / num_of_samples_in_all_strata) and return std_ub as is
- Compute Avg estimate: