Home

What is this project?

This project aims to develop tools to answer a variety of questions about the usage of scientific software. A few motivating examples include:

What software is out there?
How healthy is any given software project?
Which projects are used frequently? Maybe conditioned on a certain domain (eg, astronomy or biology).
What code should I use to solve problem X?
What is the dependency structure of a group of software projects?

Applications and use cases

Some motivating examples and target applications include:

Health monitoring of software: how "alive" is a given project?
Trend analysis: how does a project's usage change over time?
Search and recommendation: what's out there? what should I use? what do other people use?

Implementation, information sources

As a first pass, we can leverage existing sources of informaiton:

Software distributions (ubuntu, anaconda, raw source code) to extract dependency structure between projects
GitHub (and their API) to monitor analytics (eg downloads) and forking structure
arXiv papers, to monitor citation and publications

Note: the dependency structure is likely to be crucial here, due to academics' tendency to incompletely cite individual packages. (Everyone uses BLAS, but almost nobody cites it directly.)

Points for discussion

Dead links: what happens when graduate students graduate?
Identifying citations and urls that correspond to software
Entity disambiguation
Should we include data sets, or limit attention purely to software?
Beyond arXiv: how to deal with other domains?
How do we deal with poor citation quality? DOI's tend to be the first thing to get cut for space constraints. Can we motivate publication venues to encourage proper software citation practices?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

What is this project?

Applications and use cases

Implementation, information sources

Points for discussion

Clone this wiki locally