Skip to content
This repository has been archived by the owner on May 29, 2018. It is now read-only.
bmcfee edited this page Oct 9, 2014 · 4 revisions

What is this project?

This project aims to develop tools to answer a variety of questions about the usage of scientific software. A few motivating examples include:

  • What software is out there?
  • How healthy is any given software project?
  • Which projects are used frequently? Maybe conditioned on a certain domain (eg, astronomy or biology).
  • What code should I use to solve problem X?
  • What is the dependency structure of a group of software projects?

Applications and use cases

Some motivating examples and target applications include:

  • Health monitoring of software: how "alive" is a given project?
  • Trend analysis: how does a project's usage change over time?
  • Search and recommendation: what's out there? what should I use? what do other people use?

Implementation, information sources

As a first pass, we can leverage existing sources of informaiton:

  • Software distributions (ubuntu, anaconda, raw source code) to extract dependency structure between projects
  • GitHub (and their API) to monitor analytics (eg downloads) and forking structure
  • arXiv papers, to monitor citation and publications

Note: the dependency structure is likely to be crucial here, due to academics' tendency to incompletely cite individual packages. (Everyone uses BLAS, but almost nobody cites it directly.)

Points for discussion

  • Dead links: what happens when graduate students graduate?
  • Identifying citations and urls that correspond to software
  • Entity disambiguation
  • Should we include data sets, or limit attention purely to software?
  • Beyond arXiv: how to deal with other domains?
  • How do we deal with poor citation quality? DOI's tend to be the first thing to get cut for space constraints. Can we motivate publication venues to encourage proper software citation practices?
Clone this wiki locally