cleanup task along with validation flag#1602
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Avantika-Singh16, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on enhancing the robustness and maintainability of data import configurations across various datasets. The changes streamline contact information, improve data source traceability, and refine validation and resource management settings for automated data pipelines. The overall goal is to ensure more consistent and efficient data processing.
Highlights
- Curator Email Standardization: The
curator_emailsfield has been updated across numerousmanifest.jsonfiles, standardizing the contact email tosupport@datacommons.orgfor various datasets, including COVID-19 data, Eurostat statistics, Google mobility, and others. - Source File Specification: Many
import_specificationsnow include asource_filesarray, explicitly listing the input files required for the data import process. This improves clarity and dependency tracking for automated data pipelines. - Validation Flag Configuration: A new
config_overridesection withignore_validation_status: falsehas been added to severalmanifest.jsonfiles, primarily for Eurostat, NOAA, OECD, and US Census datasets. Conversely, deprecated validation-related flags (invoke_import_tool,invoke_differ_tool,invoke_import_validation) have been removed from some FBI and US NCES manifests. - Resource Limit Adjustments and Cleanup: Resource limits (
cpu,memory,disk) have been added to some manifests (e.g., Google COVID Mobility, US EPA National Emissions Inventory) and removed from others (e.g., Google Sustainability Financial Incentives, NCSES, Singapore Census, US Crash FARS data), indicating a review and optimization of resource allocation for data processing tasks. - Manifest Structure Refinements: Minor structural cleanups include the removal of a top-level
import_namefield in one CDC manifest and the removal of test scripts from thescriptsarray in Google COVID Mobility and PCM-DPC COVID-19 manifests.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Summary of Changes
Hello @Avantika-Singh16, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on standardizing and refining the configuration of various data import pipelines. The changes aim to improve maintainability by centralizing curator contact information, clearly defining source file dependencies, and consistently applying validation settings. These updates contribute to a more robust and transparent data ingestion system.
Highlights
- Standardized Curator Emails: Updated the 'curator_emails' field to a consistent 'support@datacommons.org' across numerous manifest files, centralizing contact information for data curators.
- Enhanced Source File Specification: Added or updated the 'source_files' field in many manifest files, explicitly listing the input files required for data import processes. This improves clarity and dependency tracking.
- Validation Flag Configuration: Introduced or modified the 'config_override' block with 'ignore_validation_status: false' in several Eurostat, OECD, US BLS, US CDC, US Census, US HUD, US Fed, US Crash, World Bank, and FBI manifest files, indicating a standardized approach to validation.
- Resource Limit Adjustments: Removed specific 'resource_limits' configurations from several manifest files (e.g., Google Sustainability, NCSES, Singapore Census, US Crash) and added them to others (Google Covid Mobility, US EPA National Emissions), streamlining resource allocation definitions.
- Manifest File Cleanup: Performed general cleanup in various manifest files, including removing redundant top-level 'import_name' fields and obsolete script references, and consolidating configuration overrides.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request performs a wide-ranging cleanup across numerous manifest files. The changes primarily involve updating curator emails to a standard address, adding source_files for better dependency tracking, and enabling validation by adding config_override flags. While these are valuable improvements, there are several issues to address. I've found a potential bug where a source_files entry seems to be a copy-paste error. Additionally, there are widespread JSON formatting inconsistencies, mainly related to indentation, and many files are missing a final newline character. I've also noted a few other inconsistencies and minor issues. Please review the detailed comments.
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
There was a problem hiding this comment.
Code Review
This pull request introduces a broad set of cleanup changes across numerous manifest files. Key updates include standardizing curator emails to a general support address, adding source_files properties, and enabling validation flags by adding or modifying config_override sections. While these changes are largely positive for consistency and data integrity, I've identified a few issues, including a potential copy-paste error in a file path, an invalid cron schedule, and several instances of inconsistent JSON formatting. Please review the specific comments for details and suggestions.
|
Can we separate the formatting changes from actual changes (either as separate PR or as documentation) to help with reviewing? |
|
We are planning to enable validation check for all the imports globally in the below PR so we may not have to enable it individually for each import |
|
Once we merge this PR, we will need to reschedule all the imports with updated manifests to pick up the changes. Please keep that in mind as a follow up. |
We have fixed the lint issues and we have not done any format change explicit |
We have only sync the existing cloud run and cloud batch configurations into the manifest JSON in this PR |
We have already planned for his activity .Once PR will merged we will planned accordingly |
No description provided.