Skip to content

Commit

Permalink
chore: update plots for April 2024 crawl (CC-MAIN-2024-18)
Browse files Browse the repository at this point in the history
  • Loading branch information
thunderpoot committed May 1, 2024
1 parent 277bd65 commit f25b07a
Show file tree
Hide file tree
Showing 39 changed files with 5,013 additions and 4,556 deletions.
2 changes: 1 addition & 1 deletion _config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
title: Statistics of Common Crawl Monthly Archives
description: Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
latest_crawl: CC-MAIN-2024-10
latest_crawl: CC-MAIN-2024-18

show_navigation: True
navlist:
Expand Down
94 changes: 47 additions & 47 deletions plots/charsets-top-100.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
<thead>
<tr style="text-align: right;">
<th>crawl</th>
<th>CC-MAIN-2023-40</th>
<th>CC-MAIN-2023-50</th>
<th>CC-MAIN-2024-10</th>
<th>CC-MAIN-2024-18</th>
</tr>
<tr>
<th>charset</th>
Expand All @@ -16,21 +16,21 @@
<tbody>
<tr>
<th>&lt;other&gt;</th>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>&lt;unknown&gt;</th>
<td>1.7751</td>
<td>1.9997</td>
<td>1.6892</td>
<td>1.7163</td>
</tr>
<tr>
<th>Big5</th>
<td>0.0622</td>
<td>0.0610</td>
<td>0.0554</td>
<td>0.0364</td>
</tr>
<tr>
<th>Big5-HKSCS</th>
Expand All @@ -40,51 +40,51 @@
</tr>
<tr>
<th>EUC-JP</th>
<td>0.1089</td>
<td>0.1110</td>
<td>0.1060</td>
<td>0.1091</td>
</tr>
<tr>
<th>EUC-KR</th>
<td>0.0832</td>
<td>0.0957</td>
<td>0.0806</td>
<td>0.0830</td>
</tr>
<tr>
<th>GB18030</th>
<td>0.0166</td>
<td>0.0204</td>
<td>0.0157</td>
<td>0.0167</td>
</tr>
<tr>
<th>GB2312</th>
<td>0.2485</td>
<td>0.3646</td>
<td>0.2435</td>
<td>0.2712</td>
</tr>
<tr>
<th>GBK</th>
<td>0.0975</td>
<td>0.1331</td>
<td>0.1052</td>
<td>0.1083</td>
</tr>
<tr>
<th>IBM420</th>
<td>0.0060</td>
<td>0.0055</td>
<td>0.0056</td>
<td>0.0055</td>
</tr>
<tr>
<th>IBM424</th>
<td>0.0023</td>
<td>0.0034</td>
<td>0.0022</td>
<td>0.0023</td>
</tr>
<tr>
<th>IBM500</th>
<td>0.0007</td>
<td>0.0008</td>
<td>0.0007</td>
<td>0.0007</td>
</tr>
<tr>
<th>IBM855</th>
Expand All @@ -95,44 +95,44 @@
<tr>
<th>IBM866</th>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0001</td>
<td>0.0001</td>
</tr>
<tr>
<th>ISO-2022-JP</th>
<td>0.0008</td>
<td>0.0011</td>
<td>0.0012</td>
<td>0.0014</td>
</tr>
<tr>
<th>ISO-8859-1</th>
<td>2.2454</td>
<td>2.2951</td>
<td>2.3258</td>
<td>2.2744</td>
</tr>
<tr>
<th>ISO-8859-13</th>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
</tr>
<tr>
<th>ISO-8859-15</th>
<td>0.0584</td>
<td>0.0553</td>
<td>0.0500</td>
<td>0.0484</td>
</tr>
<tr>
<th>ISO-8859-16</th>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0001</td>
<td>0.0001</td>
</tr>
<tr>
<th>ISO-8859-2</th>
<td>0.1236</td>
<td>0.1236</td>
<td>0.1261</td>
<td>0.1140</td>
</tr>
<tr>
<th>ISO-8859-3</th>
Expand All @@ -142,153 +142,153 @@
</tr>
<tr>
<th>ISO-8859-4</th>
<td>0.0011</td>
<td>0.0008</td>
<td>0.0007</td>
<td>0.0007</td>
</tr>
<tr>
<th>ISO-8859-5</th>
<td>0.0028</td>
<td>0.0028</td>
<td>0.0026</td>
<td>0.0028</td>
</tr>
<tr>
<th>ISO-8859-6</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0000</td>
</tr>
<tr>
<th>ISO-8859-7</th>
<td>0.0086</td>
<td>0.0084</td>
<td>0.0095</td>
<td>0.0069</td>
</tr>
<tr>
<th>ISO-8859-8</th>
<td>0.0005</td>
<td>0.0007</td>
<td>0.0009</td>
<td>0.0007</td>
</tr>
<tr>
<th>ISO-8859-9</th>
<td>0.0220</td>
<td>0.0264</td>
<td>0.0261</td>
<td>0.0258</td>
</tr>
<tr>
<th>KOI8-R</th>
<td>0.0060</td>
<td>0.0064</td>
<td>0.0077</td>
<td>0.0072</td>
</tr>
<tr>
<th>KOI8-U</th>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>Shift_JIS</th>
<td>0.1604</td>
<td>0.1953</td>
<td>0.1865</td>
<td>0.1905</td>
</tr>
<tr>
<th>TIS-620</th>
<td>0.0074</td>
<td>0.0062</td>
<td>0.0051</td>
<td>0.0053</td>
</tr>
<tr>
<th>US-ASCII</th>
<td>0.0272</td>
<td>0.0323</td>
<td>0.0349</td>
<td>0.0352</td>
</tr>
<tr>
<th>UTF-16</th>
<td>0.0034</td>
<td>0.0034</td>
<td>0.0034</td>
<td>0.0037</td>
</tr>
<tr>
<th>UTF-16BE</th>
<td>0.0008</td>
<td>0.0005</td>
<td>0.0003</td>
<td>0.0003</td>
</tr>
<tr>
<th>UTF-16LE</th>
<td>0.0014</td>
<td>0.0019</td>
<td>0.0015</td>
<td>0.0016</td>
</tr>
<tr>
<th>UTF-32</th>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0000</td>
</tr>
<tr>
<th>UTF-32LE</th>
<td>0.0006</td>
<td>0.0006</td>
<td>0.0003</td>
<td>0.0003</td>
</tr>
<tr>
<th>UTF-8</th>
<td>94.0352</td>
<td>93.5115</td>
<td>94.0329</td>
<td>94.0871</td>
</tr>
<tr>
<th>windows-1250</th>
<td>0.0822</td>
<td>0.0758</td>
<td>0.0748</td>
<td>0.0729</td>
</tr>
<tr>
<th>windows-1251</th>
<td>0.5314</td>
<td>0.5618</td>
<td>0.5250</td>
<td>0.5091</td>
</tr>
<tr>
<th>windows-1252</th>
<td>0.1830</td>
<td>0.2031</td>
<td>0.1836</td>
<td>0.1804</td>
</tr>
<tr>
<th>windows-1253</th>
<td>0.0031</td>
<td>0.0029</td>
<td>0.0026</td>
<td>0.0027</td>
</tr>
<tr>
<th>windows-1254</th>
<td>0.0102</td>
<td>0.0123</td>
<td>0.0095</td>
<td>0.0114</td>
</tr>
<tr>
<th>windows-1255</th>
<td>0.0041</td>
<td>0.0069</td>
<td>0.0076</td>
<td>0.0071</td>
</tr>
<tr>
<th>windows-1256</th>
<td>0.0552</td>
<td>0.0478</td>
<td>0.0538</td>
<td>0.0412</td>
</tr>
<tr>
<th>windows-1257</th>
<td>0.0111</td>
<td>0.0096</td>
<td>0.0125</td>
<td>0.0079</td>
</tr>
<tr>
<th>windows-1258</th>
Expand All @@ -299,7 +299,7 @@
<tr>
<th>windows-31j</th>
<td>0.0009</td>
<td>0.0009</td>
<td>0.0005</td>
<td>0.0005</td>
</tr>
<tr>
Expand All @@ -310,15 +310,15 @@
</tr>
<tr>
<th>x-windows-874</th>
<td>0.0108</td>
<td>0.0102</td>
<td>0.0097</td>
<td>0.0101</td>
</tr>
<tr>
<th>x-windows-949</th>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
</tr>
</tbody>
</table>
Loading

0 comments on commit f25b07a

Please sign in to comment.