Merge pull request #112 from jpivarski/jpivarski/many-project-ideas

add 13 new project ideas from Jim
research-software-collaborations · Jan 27, 2025 · 3d2fc93 · 3d2fc93
2 parents 4c9d7af + e03676d
commit 3d2fc93
Show file tree

Hide file tree

Showing 13 changed files with 596 additions and 0 deletions.
diff --git a/projects/awkward-date-string-functions.yml b/projects/awkward-date-string-functions.yml
@@ -0,0 +1,48 @@
+---
+name: Dates and strings in Awkward Array
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+experiments:
+  - Any
+skillset:
+  - Python
+  - C++
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "More date & string functions and NumPy's new varlen string in Awkward Array"
+description: >
+  Awkward Array has a suite of string functions provided by Apache
+  Arrow (in `ak.str.*`). However, it's missing a few string functions
+  (see
+  [awkward#2703](https://github.com/scikit-hep/awkward/issues/2703))
+  and it could also be useful to similarly wrap Arrow's date-handling
+  functions (see
+  [awkward#2702](https://github.com/scikit-hep/awkward/issues/2702)),
+  taking care to translate between NumPy's date format (which Awkward
+  uses) and Arrow's date format. In addition, NumPy added a new
+  variable-length string format that is different from all other such
+  formats and it would be useful to convert to and from Awkward Arrays
+  (see
+  [awkward#3170](https://github.com/scikit-hep/awkward/issues/3170)). Although
+  most functionality can be added in Python, there's a slight chance
+  that accessing NumPy's varlen strings would require C (not C++).
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/awkward-gnn-helper-functions.yml b/projects/awkward-gnn-helper-functions.yml
@@ -0,0 +1,58 @@
+---
+name: ML-ready Awkward Arrays
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+experiments:
+  - Any
+skillset:
+  - Python
+  - ML
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Helper functions to turn Awkward records into array dimensions and PyG indexes"
+description: >
+  Awkward Array has functions to convert to and from TensorFlow and
+  PyTorch, such as
+  [ak.from_raggedtensor](https://awkward-array.org/doc/main/reference/generated/ak.from_raggedtensor.html)
+  and following, with support for TensorFlow's RaggedTensor. However,
+  there are format conversions that still have to be handled manually,
+  such as turning an Awkward Array of records (e.g. muon with pT, eta,
+  phi fields) into an array dimension (e.g. length-3 dimension in the
+  tensor `shape`). NumPy has a function for this,
+  [np.lib.recfunctions.structured_to_unstructured](https://numpy.org/doc/2.1/user/basics.rec.html#numpy.lib.recfunctions.structured_to_unstructured),
+  though the Awkward equivalent can have a different name (since it
+  has different submodules). The labor-intensive steps described in
+  [this StackOverflow
+  answer](https://stackoverflow.com/a/79215978/1623645) and [this
+  tutorial](https://hsf-training.github.io/deep-learning-intro-for-hep/25-ragged-data-and-graphs.html#building-permutation-invariance-into-the-model)
+  could be encapsulated as ready-to-use functions. Also,
+  PyTorch-Geometric (PyG) expects ragged arrays to be represented as
+  an external array of integers, which Awkward Array could generate
+  with a function (see
+  [awkward#3256](https://github.com/scikit-hep/awkward/issues/3256)). Yet
+  another framework, [PyTorch
+  Cluster](https://github.com/rusty1s/pytorch_cluster), expects
+  raggedness to be expressed as a list of tensors (see
+  [awkward#3265](https://github.com/scikit-hep/awkward/issues/3265)). All
+  of these helper functions would simplify the conversion of Awkward
+  Arrays into tensors for fixed-size NNs and GNNs.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/awkward-replace-jax-autodiff.yml b/projects/awkward-replace-jax-autodiff.yml
@@ -0,0 +1,46 @@
+---
+name: Custom autodiff in Awkward Array
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+  - 1 year
+experiments:
+  - Any
+skillset:
+  - Python
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Replace JAX with custom autodiff in Awkward Array"
+description: >
+  At an [Analysis Tools](https://indico.cern.ch/event/1387764/)
+  meeting and in
+  [awkward#3349](https://github.com/scikit-hep/awkward/discussions/3349),
+  we've discussed the possibility of switching from JAX to a custom
+  implementation to implement automatic differentiation (autodiff,
+  also known as autograd). The problems with JAX are related to its
+  interface, which is intended to do much more than just
+  autodiff. Also, implementing eager autodiff is likely not a major
+  project, especially if we take advantage of [complex-step
+  differentiation](https://www.hedonisticlearning.com/posts/complex-step-differentiation.html). This
+  project would either implement autodiff as a module within Awkward
+  Array or as a new Scikit-HEP library (and possibly as a backend for
+  Vector, too).
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/awkward-sorted_map.yml b/projects/awkward-sorted_map.yml
@@ -0,0 +1,50 @@
+---
+name: Using std::maps in Awkward Array
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+experiments:
+  - Any
+skillset:
+  - Python
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Implement sorted_map type in Awkward Array"
+description: >
+  Awkward Array implements some data types as types with equivalent
+  storage (e.g. lists of uint8 for strings) plus
+  [ak.behavior](https://awkward-array.org/doc/main/reference/ak.behavior.html)
+  to provide specialized functionality (e.g. printing as strings and
+  broadcasting one string as one object). A basic type that has not
+  been implemented is a key-value mapping, such as C++'s
+  `std::map`. This is different from Awkward Array's "record" type,
+  which has a fixed set of field names, each of which can have a
+  different type. A key-value mapping has keys of one type (often but
+  not always strings) and values of another, fixed type (not different
+  for each key), like `std::map<std::string, int>`. When Uproot
+  encounters C++ `std::map<K, V>` in a ROOT file, it produces an
+  Awkward Array of lists of pairs of types `K` and `V` with name
+  `"sorted_map"`. However, "sorted map" behaviors have not yet been
+  implemented in Awkward Array, which would make this data type useful
+  (see
+  [awkward#780](https://github.com/scikit-hep/awkward/issues/780)). This
+  project would be to add such functionality.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/awkward-units.yml b/projects/awkward-units.yml
@@ -0,0 +1,43 @@
+---
+name: Awkward Arrays with physical units
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+experiments:
+  - Any
+skillset:
+  - Python
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: 'Adding "units" as Awkward Array metadata and conversions as behaviors'
+description: >
+  Awkward Arrays already have an
+  [ak.Array.attrs](https://awkward-array.org/doc/main/reference/generated/ak.Array.html#ak.Array.attrs)
+  attribute that can carry arbitrary metadata (persistent or
+  transient) and an
+  [ak.behavior](https://awkward-array.org/doc/main/reference/ak.behavior.html)
+  that attaches functionality to arrays. One, the other, or both of
+  these would be able to implement physical units on arrays and
+  convert between units when appropriate, such as putting two arrays
+  into common units before adding
+  them. [awkward#2468](https://github.com/scikit-hep/awkward/issues/2468)
+  is a discussion of this feature and possible implementations.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/ragged-completion.yml b/projects/ragged-completion.yml
@@ -0,0 +1,44 @@
+---
+name: Completing the Ragged library
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+  - 1 year
+experiments:
+  - Any
+skillset:
+  - Python
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Implement the remaining functions to make Ragged an Array-API compliant ragged array library"
+description: >
+  Scikit-HEP's
+  [Ragged](https://github.com/scikit-hep/ragged/discussions/6) library
+  is an interface over Awkward Array that restricts it to ragged
+  arrays only (no records, missing data, etc.)  and satisfies
+  DataAPI's [Array API](https://data-apis.org/), which is rapidly
+  becoming the standard interface for array libraries. As such, the
+  requirements for Ragged are very precise: all required functions
+  have already been stubbed out with full docstrings, and about half
+  of them have been implemented. This project would be to complete it
+  and promote it as a fully functional, Array API-compliant ragged
+  array library.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/scikit-hep-gpu-ecosystem.yml b/projects/scikit-hep-gpu-ecosystem.yml
@@ -0,0 +1,44 @@
+---
+name: Solidify the Scikit-HEP GPU ecosystem
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+  - 1 year
+experiments:
+  - Any
+skillset:
+  - Python
+  - CUDA
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Test and identify missing capabilities in the Scikit-HEP GPU ecosystem"
+description: >
+  Awkward Array's CUDA kernels and Numba-CUDA support exist (see [this
+  training](https://hsf-training.github.io/array-oriented-programming/5-gpu.html#awkward-array)),
+  as well as
+  [cuda-histogram](https://github.com/scikit-hep/cuda-histogram), but
+  these features haven't been heavily tested and probably haven't ever
+  been used in an analysis. This project would be to try using
+  Scikit-HEP libraries (including Vector and any other relevant
+  libraries) in an analysis using GPUs to find out what the pain
+  points are, and either fixing them directly or raising awareness
+  among the developers.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page
diff --git a/projects/uproot-complete-tbranch-update.yml b/projects/uproot-complete-tbranch-update.yml
@@ -0,0 +1,46 @@
+---
+name: Modifying existing TTrees in Uproot
+postdate: 2025-01-20
+categories:
+  - Analysis tools
+durations:
+  - 3 months
+  - 1 year
+experiments:
+  - Any
+skillset:
+  - Python
+status:
+  - Available
+project:
+  - Any
+location:
+  - Any
+commitment:
+  - Any
+program:
+  - Any
+shortdescription: "Add new columns to existing TTrees (99% done) and/or new rows (new project) in Uproot"
+description: >
+  Uproot can add new objects to existing ROOT files through
+  [uproot.update](https://uproot.readthedocs.io/en/latest/uproot.writing.writable.update.html),
+  but it would be even more useful if it could modify existing TTrees
+  in place. Zoë Bilodeau implemented the ability to add new
+  columns/TBranches, which is especially useful for backfilling data
+  (e.g. adding an array of `False` for triggers that didn't exist at
+  the time of data-taking). This implementation is nearly done (see
+  [uproot#1155](https://github.com/scikit-hep/uproot5/pull/1155)),
+  apart from a few corner-cases that need to be tested and
+  debugged. It would also be useful to be able to add rows/entries,
+  which would be an entirely new project. Completing the
+  adding-columns project would provide the experience necessary to
+  tackle the adding-rows project.
+
+contacts:
+  - name: Jim Pivarski
+    email: [email protected]
+
+mentees: # keep an empty list until the project has started or a student is identified
+# when that happens add a list with name: and link: attributes for each students
+#  - name: Students name
+#  - link: #url for project page