Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cldf-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.13"]

steps:
- uses: actions/checkout@v3
Expand Down
25 changes: 15 additions & 10 deletions .zenodo.json
Original file line number Diff line number Diff line change
@@ -1,13 +1,6 @@
{
"title": "CLDF dataset derived from Grollemund et al.'s \"Bantu expansion shows habitat alters the route and pace of human dispersals\" from 2015",
"creators": [
{
"name": "Robert Forkel"
},
{
"name": "Tiago Tresoldi"
}
],
"creators": [],
"description": "<p>Cite the source of the dataset as:</p>\n\n<blockquote>\n<p>Grollemund, Rebecca, Branford, Simon, Bostoen, Koen, Meade, Andrew, Venditti, Chris, &amp; Pagel, Mark (2015) Bantu expansion shows habitat alters the route and pace of human dispersals. Proc Natl Acad Sci USA. doi:10.1073/pnas.1503793112.</p>\n</blockquote>",
"access_right": "open",
"keywords": [
Expand All @@ -19,12 +12,24 @@
},
"contributors": [
{
"name": "Mark Pagel",
"type": "Distributor"
"name": "Robert Forkel",
"type": "Editor"
},
{
"name": "Tiago Tresoldi",
"type": "Editor"
},
{
"name": "Johann-Mattis List",
"type": "Editor"
},
{
"name": "Rebecca Grollemund",
"type": "Distributor"
},
{
"name": "Mark Pagel",
"type": "Distributor"
}
],
"upload_type": "dataset",
Expand Down
13 changes: 7 additions & 6 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Contributors

Name | GitHub user | Role
--- | --- | ---
Robert Forkel | @xrotwang | maintainer
Tiago Tresoldi | @tresoldi | maintainer
Mark Pagel | | Distributor
Rebecca Grollemund | | Distributor
Name | GitHub user | Description | Role
--- | --- | --- | ---
Robert Forkel | @xrotwang | CLDF conversion | Editor
Tiago Tresoldi | @tresoldi | CLDF conversion | Editor
Johann-Mattis List | @lingulist | orthography profile | Editor
Rebecca Grollemund | | data collection | Distributor | Author
Mark Pagel | | data analysis | Distributor | Author

7 changes: 3 additions & 4 deletions FORMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,17 @@ The value-to-form processing is divided into two steps, implemented as methods:
- `FormSpec.clean`: Normalizes a form chunk.

These methods use the attributes of a `FormSpec` instance to configure their behaviour.

- `brackets`: `{'(': ')'}`
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
- `separators`: `~,;/`
Iterable of single character tokens that should be recognized as word separator
- `missing_data`: `('?', '-')`
- `missing_data`: `['', '0.0', '?', '-', '- ', '0']`
Iterable of strings that are used to mark missing data
- `strip_inside_brackets`: `False`
Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
- `replacements`: `[]`
- `replacements`: `[(' ', '_'), ('-~ bilí', 'bilí'), ('-́', '-'), ('-´', '-'), ('-ː', '-'), ('_x001E_thathu', 'thathu')]`
List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
- `first_form_only`: `False`
- `first_form_only`: `True`
Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
- `normalize_whitespace`: `True`
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
Expand Down
8 changes: 5 additions & 3 deletions NOTES.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@

Language mapping
----------------
## Language mapping

From Harald Hammarström:

Expand All @@ -19,3 +17,7 @@ From Harald Hammarström:
- Based on Philippson's other publications (the data is from him), JE32_Luyia could only be
Masaaba, Isuxa, Logooli, or Saamia. Saamia [lsm] is the largest one and also the one the
missionaries tried to use for standardisation so one might as well guess JE32_Luyia is Saamia [lsm].

## Consonant Clusters

The orthography profile is only an approximation. There remain quite a few cases where we could not decide what the pronunciation is, due to ambiguities. We left them in this form, but ask kindly to check upon this, when running any kind of analysis in which the phonetic transcriptions of this dataset are important.
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,7 @@ Conceptlists in Concepticon:
- [Grollemund-2015-100](https://concepticon.clld.org/contributions/Grollemund-2015-100)
## Notes


Language mapping
----------------
## Language mapping

From Harald Hammarström:

Expand All @@ -43,6 +41,10 @@ From Harald Hammarström:
Masaaba, Isuxa, Logooli, or Saamia. Saamia [lsm] is the largest one and also the one the
missionaries tried to use for standardisation so one might as well guess JE32_Luyia is Saamia [lsm].

## Consonant Clusters

The orthography profile is only an approximation. There remain quite a few cases where we could not decide what the pronunciation is, due to ambiguities. We left them in this form, but ask kindly to check upon this, when running any kind of analysis in which the phonetic transcriptions of this dataset are important.



## Statistics
Expand All @@ -57,24 +59,25 @@ From Harald Hammarström:

- **Varieties:** 424 (linked to 333 different Glottocodes)
- **Concepts:** 100 (linked to 100 different Concepticon concept sets)
- **Lexemes:** 37,846
- **Lexemes:** 37,730
- **Sources:** 217
- **Synonymy:** 1.00
- **Cognacy:** 37,713 cognates in 3,853 cognate sets (1,794 singletons)
- **Cognacy:** 37,712 cognates in 3,853 cognate sets (1,794 singletons)
- **Cognate Diversity:** 0.10
- **Invalid lexemes:** 0
- **Tokens:** 196,920
- **Segments:** 560 (0 BIPA errors, 0 CLTS sound class errors, 555 CLTS modified)
- **Inventory size (avg):** 34.86
- **Tokens:** 183,363
- **Segments:** 606 (0 BIPA errors, 0 CLTS sound class errors, 600 CLTS modified)
- **Inventory size (avg):** 40.85

# Contributors

Name | GitHub user | Role
--- | --- | ---
Robert Forkel | @xrotwang | maintainer
Tiago Tresoldi | @tresoldi | maintainer
Mark Pagel | | Distributor
Rebecca Grollemund | | Distributor
Name | GitHub user | Description | Role
--- | --- | --- | ---
Robert Forkel | @xrotwang | CLDF conversion | Editor
Tiago Tresoldi | @tresoldi | CLDF conversion | Editor
Johann-Mattis List | @lingulist | orthography profile | Editor
Rebecca Grollemund | | data collection | Distributor | Author
Mark Pagel | | data analysis | Distributor | Author



Expand Down
Loading