Hi, thank you for taking the time to improve Snowflake's Snowpark Python or Snowpark pandas APIs!
Many questions can be answered by checking our docs or looking for already existing bug reports and enhancement requests on our issue tracker.
Please start by checking these first!
In that case we'd love to hear from you! Please open a new issue to get in touch with us.
We encourage everyone to first open a new issue to discuss any feature work or bug fixes with one of the maintainers. The following should help guide contributors through potential pitfalls.
We require our contributors to sign a CLA, available at https://github.com/snowflakedb/CLA/blob/main/README.md. A Github Actions bot will assist you when you open a pull request.
git clone <YOUR_FORKED_REPO>
cd snowpark-python
-
Create a new Python virtual environment with any Python version that we support.
-
The Snowpark Python API supports Python 3.8, Python 3.9, Python 3.10, and Python 3.11.
-
The Snowpark pandas API supports Python 3.9, Python 3.10, and Python 3.11. Additionally, Snowpark pandas requires Modin 0.30.1 and pandas 2.2.x.
conda create --name snowpark-dev python=3.9
-
-
Activate the new Python virtual environment. For example,
conda activate snowpark-dev
-
Go to the cloned repository root folder.
-
To install the Snowpark Python API in edit/development mode, use:
python -m pip install -e ".[development, pandas]"
-
To install the Snowpark pandas API in edit/development mode, use:
python -m pip install -e ".[modin-development]"
The
-e
tellspip
to install the library in edit, or development mode. -
You can use PyCharm, VS Code, or any other IDE. The following steps assume you use PyCharm, VS Code or any other similar IDE.
Download the newest community version of PyCharm and follow the installation instructions.
Download and install the latest version of VS Code
Open project and browse to the cloned git directory. Then right-click the directory src
in PyCharm
and "Mark Directory as" -> "Source Root". NOTE: VS Code doesn't have "Source Root" so you can skip this step if you use VS Code.
Configure PyCharm interpreter or Configure VS Code interpreter to use the previously created Python virtual environment.
This section covers guidelines for developers that wish to contribute code to Session
, ServerConnection
, MockServerConnection
and other related objects that are critical to correct functionality of snowpark-python
.
- If the config parameter is set once during initialization and never changed, it is safe to add the parameter to the
Session
object. - If the config parameter can be updated by the user, and the update has side-effects during compilation i.e.
analyzer.analyze()
,analyzer.resolve()
etc, add a warning at config update usingwarn_session_config_update_in_multithreaded_mode
.
Once you have decided that the new component being added with required protection during concurrent access, following can be used:
Session._thread_store
,ServerConnection._thread_store
arethreading.local()
objects which can be used to store a per-thread instance of the component. The python connector cursor object is an example of this.Session._lock
andServerConnection._lock
areRLock
objects which can be used to serialize access to shared resources.Session.query_tag
is an example of this.Session._package_lock
is aRLock
object which can be used to protectpackages
andimports
for stored procedures and user defined functions.Session._plan_lock
is aRLock
object which can be used to serializeSnowflakePlan
andSelectable
method calls.SnowflakePlan.plan_state
is an example.QueryHistory(session, include_thread_id=True)
can be used to log the query history with thread id.
An example PR to make auto temp table cleaner thread-safe can be found here.
The README under tests folder tells you how to set up to run tests.
If this happens to you do not panic! Any PRs originating from a fork will fail some automated tests. This is because forks do not have access to our repository's secrets. A maintainer will manually review your changes then kick off the rest of our testing suite. Feel free to tag @snowflakedb/snowpark-python-api or @snowflakedb/snowpark-pandas-api if you feel like we are taking too long to get to your PR.
Following tree diagram shows the high-level structure of the Snowpark pandas.
snowflake
└── snowpark
└── modin
└── pandas ← pandas API frontend layer
└── core
├── dataframe ← folder containing abstraction
│ for Modin frontend to DF-algebra
│── execution ← additional patching for I/O
└── plugin
├── _interal ← Snowflake specific internals
├── io ← Snowpark pandas IO functions
├── compiler ← query compiler, Modin -> Snowpark pandas DF
└── utils ← util classes from Modin, logging, …