Skip to content

Commit

Permalink
Merge pull request #112 from jpivarski/jpivarski/many-project-ideas
Browse files Browse the repository at this point in the history
add 13 new project ideas from Jim
  • Loading branch information
davidlange6 authored Jan 27, 2025
2 parents 4c9d7af + e03676d commit 3d2fc93
Show file tree
Hide file tree
Showing 13 changed files with 596 additions and 0 deletions.
48 changes: 48 additions & 0 deletions projects/awkward-date-string-functions.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: Dates and strings in Awkward Array
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
experiments:
- Any
skillset:
- Python
- C++
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "More date & string functions and NumPy's new varlen string in Awkward Array"
description: >
Awkward Array has a suite of string functions provided by Apache
Arrow (in `ak.str.*`). However, it's missing a few string functions
(see
[awkward#2703](https://github.com/scikit-hep/awkward/issues/2703))
and it could also be useful to similarly wrap Arrow's date-handling
functions (see
[awkward#2702](https://github.com/scikit-hep/awkward/issues/2702)),
taking care to translate between NumPy's date format (which Awkward
uses) and Arrow's date format. In addition, NumPy added a new
variable-length string format that is different from all other such
formats and it would be useful to convert to and from Awkward Arrays
(see
[awkward#3170](https://github.com/scikit-hep/awkward/issues/3170)). Although
most functionality can be added in Python, there's a slight chance
that accessing NumPy's varlen strings would require C (not C++).
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
58 changes: 58 additions & 0 deletions projects/awkward-gnn-helper-functions.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: ML-ready Awkward Arrays
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
experiments:
- Any
skillset:
- Python
- ML
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Helper functions to turn Awkward records into array dimensions and PyG indexes"
description: >
Awkward Array has functions to convert to and from TensorFlow and
PyTorch, such as
[ak.from_raggedtensor](https://awkward-array.org/doc/main/reference/generated/ak.from_raggedtensor.html)
and following, with support for TensorFlow's RaggedTensor. However,
there are format conversions that still have to be handled manually,
such as turning an Awkward Array of records (e.g. muon with pT, eta,
phi fields) into an array dimension (e.g. length-3 dimension in the
tensor `shape`). NumPy has a function for this,
[np.lib.recfunctions.structured_to_unstructured](https://numpy.org/doc/2.1/user/basics.rec.html#numpy.lib.recfunctions.structured_to_unstructured),
though the Awkward equivalent can have a different name (since it
has different submodules). The labor-intensive steps described in
[this StackOverflow
answer](https://stackoverflow.com/a/79215978/1623645) and [this
tutorial](https://hsf-training.github.io/deep-learning-intro-for-hep/25-ragged-data-and-graphs.html#building-permutation-invariance-into-the-model)
could be encapsulated as ready-to-use functions. Also,
PyTorch-Geometric (PyG) expects ragged arrays to be represented as
an external array of integers, which Awkward Array could generate
with a function (see
[awkward#3256](https://github.com/scikit-hep/awkward/issues/3256)). Yet
another framework, [PyTorch
Cluster](https://github.com/rusty1s/pytorch_cluster), expects
raggedness to be expressed as a list of tensors (see
[awkward#3265](https://github.com/scikit-hep/awkward/issues/3265)). All
of these helper functions would simplify the conversion of Awkward
Arrays into tensors for fixed-size NNs and GNNs.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
46 changes: 46 additions & 0 deletions projects/awkward-replace-jax-autodiff.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
name: Custom autodiff in Awkward Array
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
- 1 year
experiments:
- Any
skillset:
- Python
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Replace JAX with custom autodiff in Awkward Array"
description: >
At an [Analysis Tools](https://indico.cern.ch/event/1387764/)
meeting and in
[awkward#3349](https://github.com/scikit-hep/awkward/discussions/3349),
we've discussed the possibility of switching from JAX to a custom
implementation to implement automatic differentiation (autodiff,
also known as autograd). The problems with JAX are related to its
interface, which is intended to do much more than just
autodiff. Also, implementing eager autodiff is likely not a major
project, especially if we take advantage of [complex-step
differentiation](https://www.hedonisticlearning.com/posts/complex-step-differentiation.html). This
project would either implement autodiff as a module within Awkward
Array or as a new Scikit-HEP library (and possibly as a backend for
Vector, too).
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
50 changes: 50 additions & 0 deletions projects/awkward-sorted_map.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
name: Using std::maps in Awkward Array
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
experiments:
- Any
skillset:
- Python
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Implement sorted_map type in Awkward Array"
description: >
Awkward Array implements some data types as types with equivalent
storage (e.g. lists of uint8 for strings) plus
[ak.behavior](https://awkward-array.org/doc/main/reference/ak.behavior.html)
to provide specialized functionality (e.g. printing as strings and
broadcasting one string as one object). A basic type that has not
been implemented is a key-value mapping, such as C++'s
`std::map`. This is different from Awkward Array's "record" type,
which has a fixed set of field names, each of which can have a
different type. A key-value mapping has keys of one type (often but
not always strings) and values of another, fixed type (not different
for each key), like `std::map<std::string, int>`. When Uproot
encounters C++ `std::map<K, V>` in a ROOT file, it produces an
Awkward Array of lists of pairs of types `K` and `V` with name
`"sorted_map"`. However, "sorted map" behaviors have not yet been
implemented in Awkward Array, which would make this data type useful
(see
[awkward#780](https://github.com/scikit-hep/awkward/issues/780)). This
project would be to add such functionality.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
43 changes: 43 additions & 0 deletions projects/awkward-units.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
name: Awkward Arrays with physical units
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
experiments:
- Any
skillset:
- Python
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: 'Adding "units" as Awkward Array metadata and conversions as behaviors'
description: >
Awkward Arrays already have an
[ak.Array.attrs](https://awkward-array.org/doc/main/reference/generated/ak.Array.html#ak.Array.attrs)
attribute that can carry arbitrary metadata (persistent or
transient) and an
[ak.behavior](https://awkward-array.org/doc/main/reference/ak.behavior.html)
that attaches functionality to arrays. One, the other, or both of
these would be able to implement physical units on arrays and
convert between units when appropriate, such as putting two arrays
into common units before adding
them. [awkward#2468](https://github.com/scikit-hep/awkward/issues/2468)
is a discussion of this feature and possible implementations.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
44 changes: 44 additions & 0 deletions projects/ragged-completion.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
name: Completing the Ragged library
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
- 1 year
experiments:
- Any
skillset:
- Python
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Implement the remaining functions to make Ragged an Array-API compliant ragged array library"
description: >
Scikit-HEP's
[Ragged](https://github.com/scikit-hep/ragged/discussions/6) library
is an interface over Awkward Array that restricts it to ragged
arrays only (no records, missing data, etc.) and satisfies
DataAPI's [Array API](https://data-apis.org/), which is rapidly
becoming the standard interface for array libraries. As such, the
requirements for Ragged are very precise: all required functions
have already been stubbed out with full docstrings, and about half
of them have been implemented. This project would be to complete it
and promote it as a fully functional, Array API-compliant ragged
array library.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
44 changes: 44 additions & 0 deletions projects/scikit-hep-gpu-ecosystem.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
name: Solidify the Scikit-HEP GPU ecosystem
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
- 1 year
experiments:
- Any
skillset:
- Python
- CUDA
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Test and identify missing capabilities in the Scikit-HEP GPU ecosystem"
description: >
Awkward Array's CUDA kernels and Numba-CUDA support exist (see [this
training](https://hsf-training.github.io/array-oriented-programming/5-gpu.html#awkward-array)),
as well as
[cuda-histogram](https://github.com/scikit-hep/cuda-histogram), but
these features haven't been heavily tested and probably haven't ever
been used in an analysis. This project would be to try using
Scikit-HEP libraries (including Vector and any other relevant
libraries) in an analysis using GPUs to find out what the pain
points are, and either fixing them directly or raising awareness
among the developers.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
46 changes: 46 additions & 0 deletions projects/uproot-complete-tbranch-update.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
name: Modifying existing TTrees in Uproot
postdate: 2025-01-20
categories:
- Analysis tools
durations:
- 3 months
- 1 year
experiments:
- Any
skillset:
- Python
status:
- Available
project:
- Any
location:
- Any
commitment:
- Any
program:
- Any
shortdescription: "Add new columns to existing TTrees (99% done) and/or new rows (new project) in Uproot"
description: >
Uproot can add new objects to existing ROOT files through
[uproot.update](https://uproot.readthedocs.io/en/latest/uproot.writing.writable.update.html),
but it would be even more useful if it could modify existing TTrees
in place. Zoë Bilodeau implemented the ability to add new
columns/TBranches, which is especially useful for backfilling data
(e.g. adding an array of `False` for triggers that didn't exist at
the time of data-taking). This implementation is nearly done (see
[uproot#1155](https://github.com/scikit-hep/uproot5/pull/1155)),
apart from a few corner-cases that need to be tested and
debugged. It would also be useful to be able to add rows/entries,
which would be an entirely new project. Completing the
adding-columns project would provide the experience necessary to
tackle the adding-rows project.
contacts:
- name: Jim Pivarski
email: [email protected]

mentees: # keep an empty list until the project has started or a student is identified
# when that happens add a list with name: and link: attributes for each students
# - name: Students name
# - link: #url for project page
Loading

0 comments on commit 3d2fc93

Please sign in to comment.