lexibank · FredericBlum · Mar 24, 2025 · Mar 24, 2025 · Mar 24, 2025 · Mar 24, 2025
diff --git a/.github/workflows/cldf-validation.yml b/.github/workflows/cldf-validation.yml
@@ -12,7 +12,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.10"]
+        python-version: ["3.13"]
 
     steps:
     - uses: actions/checkout@v3

diff --git a/.zenodo.json b/.zenodo.json
@@ -1,13 +1,6 @@
 {
     "title": "CLDF dataset derived from Grollemund et al.'s \"Bantu expansion shows habitat alters the route and pace of human dispersals\" from 2015",
-    "creators": [
-        {
-            "name": "Robert Forkel"
-        },
-        {
-            "name": "Tiago Tresoldi"
-        }
-    ],
+    "creators": [],
     "description": "<p>Cite the source of the dataset as:</p>\n\n<blockquote>\n<p>Grollemund, Rebecca, Branford, Simon, Bostoen, Koen, Meade, Andrew, Venditti, Chris, &amp; Pagel, Mark (2015) Bantu expansion shows habitat alters the route and pace of human dispersals. Proc Natl Acad Sci USA. doi:10.1073/pnas.1503793112.</p>\n</blockquote>",
     "access_right": "open",
     "keywords": [
@@ -19,12 +12,24 @@
     },
     "contributors": [
         {
-            "name": "Mark Pagel",
-            "type": "Distributor"
+            "name": "Robert Forkel",
+            "type": "Editor"
+        },
+        {
+            "name": "Tiago Tresoldi",
+            "type": "Editor"
+        },
+        {
+            "name": "Johann-Mattis List",
+            "type": "Editor"
         },
         {
             "name": "Rebecca Grollemund",
             "type": "Distributor"
+        },
+        {
+            "name": "Mark Pagel",
+            "type": "Distributor"
         }
     ],
     "upload_type": "dataset",

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -1,9 +1,10 @@
 # Contributors
 
-Name | GitHub user | Role
- --- | --- | --- 
-Robert Forkel | @xrotwang | maintainer
-Tiago Tresoldi | @tresoldi | maintainer
-Mark Pagel | | Distributor
-Rebecca Grollemund | | Distributor
+Name | GitHub user | Description | Role
+ --- | --- | --- | --- 
+Robert Forkel | @xrotwang | CLDF conversion | Editor
+Tiago Tresoldi | @tresoldi | CLDF conversion | Editor
+Johann-Mattis List | @lingulist | orthography profile | Editor 
+Rebecca Grollemund | | data collection | Distributor | Author
+Mark Pagel | | data analysis | Distributor | Author
 
diff --git a/FORMS.md b/FORMS.md
@@ -8,18 +8,17 @@ The value-to-form processing is divided into two steps, implemented as methods:
 - `FormSpec.clean`: Normalizes a form chunk.
 
 These methods use the attributes of a `FormSpec` instance to configure their behaviour.
-
 - `brackets`: `{'(': ')'}`
   Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
 - `separators`: `~,;/`
   Iterable of single character tokens that should be recognized as word separator
-- `missing_data`: `('?', '-')`
+- `missing_data`: `['', '0.0', '?', '-', '- ', '0']`
   Iterable of strings that are used to mark missing data
 - `strip_inside_brackets`: `False`
   Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
-- `replacements`: `[]`
+- `replacements`: `[(' ', '_'), ('-~ bilí', 'bilí'), ('-́', '-'), ('-´', '-'), ('-ː', '-'), ('_x001E_thathu', 'thathu')]`
   List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
-- `first_form_only`: `False`
+- `first_form_only`: `True`
   Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
 - `normalize_whitespace`: `True`
   Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces

diff --git a/NOTES.md b/NOTES.md
@@ -1,6 +1,4 @@
-
-Language mapping
-----------------
+## Language mapping
 
 From Harald Hammarström:
 
@@ -19,3 +17,7 @@ From Harald Hammarström:
 - Based on Philippson's other publications (the data is from him), JE32_Luyia could only be
   Masaaba, Isuxa, Logooli, or Saamia. Saamia [lsm] is the largest one and also the one the 
   missionaries tried to use for standardisation so one might as well guess JE32_Luyia is Saamia [lsm].
+
+## Consonant Clusters
+
+The orthography profile is only an approximation. There remain quite a few cases where we could not decide what the pronunciation is, due to ambiguities. We left them in this form, but ask kindly to check upon this, when running any kind of analysis in which the phonetic transcriptions of this dataset are important.
diff --git a/README.md b/README.md
@@ -21,9 +21,7 @@ Conceptlists in Concepticon:
 - [Grollemund-2015-100](https://concepticon.clld.org/contributions/Grollemund-2015-100)
 ## Notes
 
-
-Language mapping
-----------------
+## Language mapping
 
 From Harald Hammarström:
 
@@ -43,6 +41,10 @@ From Harald Hammarström:
   Masaaba, Isuxa, Logooli, or Saamia. Saamia [lsm] is the largest one and also the one the 
   missionaries tried to use for standardisation so one might as well guess JE32_Luyia is Saamia [lsm].
 
+## Consonant Clusters
+
+The orthography profile is only an approximation. There remain quite a few cases where we could not decide what the pronunciation is, due to ambiguities. We left them in this form, but ask kindly to check upon this, when running any kind of analysis in which the phonetic transcriptions of this dataset are important.
+
 
 
 ## Statistics
@@ -57,24 +59,25 @@ From Harald Hammarström:
 
 - **Varieties:** 424 (linked to 333 different Glottocodes)
 - **Concepts:** 100 (linked to 100 different Concepticon concept sets)
-- **Lexemes:** 37,846
+- **Lexemes:** 37,730
 - **Sources:** 217
 - **Synonymy:** 1.00
-- **Cognacy:** 37,713 cognates in 3,853 cognate sets (1,794 singletons)
+- **Cognacy:** 37,712 cognates in 3,853 cognate sets (1,794 singletons)
 - **Cognate Diversity:** 0.10
 - **Invalid lexemes:** 0
-- **Tokens:** 196,920
-- **Segments:** 560 (0 BIPA errors, 0 CLTS sound class errors, 555 CLTS modified)
-- **Inventory size (avg):** 34.86
+- **Tokens:** 183,363
+- **Segments:** 606 (0 BIPA errors, 0 CLTS sound class errors, 600 CLTS modified)
+- **Inventory size (avg):** 40.85
 
 # Contributors
 
-Name | GitHub user | Role
- --- | --- | --- 
-Robert Forkel | @xrotwang | maintainer
-Tiago Tresoldi | @tresoldi | maintainer
-Mark Pagel | | Distributor
-Rebecca Grollemund | | Distributor
+Name | GitHub user | Description | Role
+ --- | --- | --- | --- 
+Robert Forkel | @xrotwang | CLDF conversion | Editor
+Tiago Tresoldi | @tresoldi | CLDF conversion | Editor
+Johann-Mattis List | @lingulist | orthography profile | Editor 
+Rebecca Grollemund | | data collection | Distributor | Author
+Mark Pagel | | data analysis | Distributor | Author