GitHub - jackmo77/boxscores: Capture and processing of historical minor league boxscores

jackmo77 / boxscores Public

forked from chadwickbureau/data-boxscores

Notifications You must be signed in to change notification settings
Fork 1
Star 0

Capture and processing of historical minor league boxscores

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
processed		processed
src		src
transcript		transcript
.gitignore		.gitignore
README.txt		README.txt

Repository files navigation

Historical minor league baseball boxscores
Prepared and maintained by Chadwick Baseball Bureau (http://www.chadwick-bureau.com)
Contact: Dr T L Turocy ([email protected])

ABOUT THIS DATA

This package contains transcriptions of historical minor league boxscores.
Please read the description below carefully to be sure you understand what
these data are (and what they aren't).


COPYRIGHT AND LICENSE

These files are copyright by Chadwick Baseball Bureau.
They are licensed under the Creative Commons Attribution 4.0 International license:
https://creativecommons.org/licenses/by/4.0/

The source code to transform the original transcriptions into standardised formats
(found in the src/ directory) is copyright by T L Turocy and Chadwick Baseball Bureau.
It is licensed under the GNU General Public Licence, version 2.0 (or later, at the
user's discretion):
https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html


DETAILS

Historically, the published averages for many minor leagues in baseball omitted
players who appeared in only a handful of games ("less-thans"); other leagues
never published complete averages at all.  In these cases, the only way to
document the participation of these players is by capturing a published boxscore.
In addition, there are other reasons why having a compilation of game-level
data for some historical minor leagues may be of interest.

We have developed a simple, text-based format for capturing boxscores efficiently.
This format is similar to the structure of a typical newspaper boxscore,
allowing transcription of the data, with only a minimum of markup required of
the inputter.  These files are organised in the transcript/ directory.
Each source is included in separate subdirectory.  For example, a source might 
consist of all boxscores found in a particular newspaper in a particular year.

There is a parser, in src/convert.py, which takes all of the boxscore transcriptions
from a source, and processes the data into CSV files, which are placed into
a corresponding directory under processed/.  This process does not add or interpolate
any new information, but simply extracts and interprets the information found
in the original transcriptions.  The resulting CSV files are then suitable
for further editorial processing.

The objective of these files is to render the content of those sources in a way that
is as faithful as possible to the originals.  It is important to recognise that
THE GUIDES AND OTHER SOURCES CONTAIN ERRORS AND INACCURACIES.  This collection of
files does not attempt to identify and/or propose corrections to those errors.
The scope of this collection is to document the contents of sources in a standard
and systematic way, and therefore provide the inputs required to editors who wish to
produce cleaned, corrected, or improved accounts of the performance data for these
leagues.  The files in this collection therefore provide one essential component
in the chain of evidence required to produce such improved data.


PEOPLE NAMES TABLES

For each source, a table called people.csv is built.  This summarises the number of
appearances for each name on each club, including first and last observed dates,
and games by position.  This table is grouped by name; therefore, if a player
appears under more than one spelling of his name (which is not at all uncommon), his
performance will be split across multiple rows.  Again, making judgments about
proper names and identifications is a task that is carried out downstream from
this dataset.

Each row is given a person.ref identifier.  These are eight-character strings
of the format LLLLNNTT.  The first four characters are the (double) metaphone encoding
of the last name of the person, padded out to four characters if necessary by adding
'Z' (as 'Z' is not a letter that is used in metaphone).  The digits TT are the total
count of names with the same metaphone encoding, among names observed in that 
player's league.  NN is a sequence number, which can range from 01 up to TT.
The sequence is generated by sorting (lexicographically) on last name, first name,
and then club name.  

For example, suppose there are four separate entries with the surname Smith,
differing by club and/or first name/initial.  The metaphone encoding of Smith is
SM0, which is padded to SM0Z.  The four entries would then have the person.ref values
of SM0Z0104, SM0Z0204, SM0Z0304, and SM0Z0404.  The order in which they are assigned
is determined by the sorting of their first name and club name.

This is a deterministic way to assign these identifiers, and therefore the same
identifier will always be assigned to the same performance, if the dataset is not
changed.   Also, if new boxscores are added with new names, this will only affect
the person.ref assignments to names with the same metaphone encoding.  Picking up
the Smith example, suppose a new boxscore is added, and there is one new player,
named Baker.  Because Baker has the metaphone encoding of PKR, this will not affect
the person.ref values given to the Smiths.  However, if the new boxscore had a 
new person named Schmitt, which also has metaphone encoding SM0, the person.ref of the 
Smith entries would all change.  Schmitt would become SM0Z0105 (because there are 
now 5 rows with SM0, and Schmitt sorts before Smith).  Then the four Smiths would be
SM0Z0205 through SM0Z0505.  

The effect of this scheme is to make it possible to collect boxscores incrementally.
On the one hand, it should be possible to refer to a row in a stable way.  However,
as a new spelling of a name comes into the dataset, it may sometimes be the case that
it will cause a revision downstream of the identification of the player.  It could be,
for example, that one of those players listed as Smith really is Schmitt, and the
existence of the boxscore with Schmitt leads the researcher to revise the identification.
The use of metaphone means these possible reassignments will get flagged only for
similar-sounding names; the use of the total count in the identifier ensures identifiers
will not get re-used.