Skip to content

Commit

Permalink
chore: update plots for September 2024 crawl (CC-MAIN-2024-38)
Browse files Browse the repository at this point in the history
  • Loading branch information
sebastian-nagel committed Sep 24, 2024
1 parent e39dbf7 commit 129c764
Show file tree
Hide file tree
Showing 41 changed files with 5,184 additions and 4,751 deletions.
2 changes: 1 addition & 1 deletion _config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
title: Statistics of Common Crawl Monthly Archives
description: Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
repository: commoncrawl/cc-crawl-statistics
latest_crawl: CC-MAIN-2024-33
latest_crawl: CC-MAIN-2024-38

show_navigation: True
navlist:
Expand Down
2 changes: 1 addition & 1 deletion plot.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ zcat stats/excerpt/size.json.gz \

zcat stats/excerpt/tld.json.gz \
| python3 plot/tld.py CC-MAIN-2008-2009 CC-MAIN-2012 CC-MAIN-2014-10 \
CC-MAIN-2016-30 CC-MAIN-2019-04 CC-MAIN-2020-40 $LATEST_CRAWL
CC-MAIN-2016-30 CC-MAIN-2019-09 CC-MAIN-2022-49 $LATEST_CRAWL

zcat stats/excerpt/mimetype.json.gz \
| python3 plot/mimetype.py
Expand Down
92 changes: 46 additions & 46 deletions plots/charsets-top-100.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
<thead>
<tr style="text-align: right;">
<th>crawl</th>
<th>CC-MAIN-2024-26</th>
<th>CC-MAIN-2024-30</th>
<th>CC-MAIN-2024-33</th>
<th>CC-MAIN-2024-38</th>
</tr>
<tr>
<th>charset</th>
Expand All @@ -18,67 +18,67 @@
<th>&lt;other&gt;</th>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
</tr>
<tr>
<th>&lt;unknown&gt;</th>
<td>1.6322</td>
<td>1.9684</td>
<td>2.0394</td>
<td>1.5057</td>
</tr>
<tr>
<th>Big5</th>
<td>0.0740</td>
<td>0.0779</td>
<td>0.0856</td>
<td>0.0823</td>
</tr>
<tr>
<th>Big5-HKSCS</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>EUC-JP</th>
<td>0.1076</td>
<td>0.1063</td>
<td>0.1060</td>
<td>0.1006</td>
</tr>
<tr>
<th>EUC-KR</th>
<td>0.1034</td>
<td>0.1011</td>
<td>0.0984</td>
<td>0.0837</td>
</tr>
<tr>
<th>GB18030</th>
<td>0.0178</td>
<td>0.0171</td>
<td>0.0170</td>
<td>0.0150</td>
</tr>
<tr>
<th>GB2312</th>
<td>0.4719</td>
<td>0.2880</td>
<td>0.2748</td>
<td>0.2230</td>
</tr>
<tr>
<th>GBK</th>
<td>0.1341</td>
<td>0.1383</td>
<td>0.1381</td>
<td>0.1110</td>
</tr>
<tr>
<th>IBM420</th>
<td>0.0052</td>
<td>0.0052</td>
<td>0.0056</td>
<td>0.0040</td>
</tr>
<tr>
<th>IBM424</th>
<td>0.0020</td>
<td>0.0021</td>
<td>0.0022</td>
<td>0.0014</td>
</tr>
<tr>
<th>IBM500</th>
Expand All @@ -96,19 +96,19 @@
<th>IBM866</th>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0003</td>
</tr>
<tr>
<th>ISO-2022-JP</th>
<td>0.0011</td>
<td>0.0010</td>
<td>0.0012</td>
<td>0.0009</td>
</tr>
<tr>
<th>ISO-8859-1</th>
<td>2.3257</td>
<td>2.2709</td>
<td>2.3973</td>
<td>2.7588</td>
</tr>
<tr>
<th>ISO-8859-13</th>
Expand All @@ -118,69 +118,69 @@
</tr>
<tr>
<th>ISO-8859-15</th>
<td>0.0487</td>
<td>0.0506</td>
<td>0.0528</td>
<td>0.0466</td>
</tr>
<tr>
<th>ISO-8859-16</th>
<td>0.0001</td>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0001</td>
</tr>
<tr>
<th>ISO-8859-2</th>
<td>0.1164</td>
<td>0.1177</td>
<td>0.1278</td>
<td>0.1072</td>
</tr>
<tr>
<th>ISO-8859-3</th>
<td>0.0004</td>
<td>0.0005</td>
<td>0.0006</td>
<td>0.0005</td>
</tr>
<tr>
<th>ISO-8859-4</th>
<td>0.0007</td>
<td>0.0006</td>
<td>0.0007</td>
<td>0.0008</td>
</tr>
<tr>
<th>ISO-8859-5</th>
<td>0.0031</td>
<td>0.0021</td>
<td>0.0022</td>
<td>0.0019</td>
</tr>
<tr>
<th>ISO-8859-6</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>ISO-8859-7</th>
<td>0.0071</td>
<td>0.0062</td>
<td>0.0067</td>
<td>0.0054</td>
</tr>
<tr>
<th>ISO-8859-8</th>
<td>0.0006</td>
<td>0.0009</td>
<td>0.0010</td>
<td>0.0008</td>
</tr>
<tr>
<th>ISO-8859-9</th>
<td>0.0247</td>
<td>0.0254</td>
<td>0.0275</td>
<td>0.0235</td>
</tr>
<tr>
<th>KOI8-R</th>
<td>0.0072</td>
<td>0.0070</td>
<td>0.0071</td>
<td>0.0065</td>
</tr>
<tr>
<th>KOI8-U</th>
Expand All @@ -190,105 +190,105 @@
</tr>
<tr>
<th>Shift_JIS</th>
<td>0.1780</td>
<td>0.1747</td>
<td>0.1866</td>
<td>0.1602</td>
</tr>
<tr>
<th>TIS-620</th>
<td>0.0060</td>
<td>0.0059</td>
<td>0.0060</td>
<td>0.0054</td>
</tr>
<tr>
<th>US-ASCII</th>
<td>0.0327</td>
<td>0.0344</td>
<td>0.0378</td>
<td>0.0338</td>
</tr>
<tr>
<th>UTF-16</th>
<td>0.0049</td>
<td>0.0054</td>
<td>0.0058</td>
<td>0.0047</td>
</tr>
<tr>
<th>UTF-16BE</th>
<td>0.0004</td>
<td>0.0003</td>
<td>0.0003</td>
<td>0.0003</td>
</tr>
<tr>
<th>UTF-16LE</th>
<td>0.0015</td>
<td>0.0016</td>
<td>0.0013</td>
<td>0.0011</td>
</tr>
<tr>
<th>UTF-32</th>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0000</td>
</tr>
<tr>
<th>UTF-32LE</th>
<td>0.0002</td>
<td>0.0002</td>
<td>0.0004</td>
<td>0.0002</td>
</tr>
<tr>
<th>UTF-8</th>
<td>93.8361</td>
<td>93.6945</td>
<td>93.4387</td>
<td>93.9167</td>
</tr>
<tr>
<th>windows-1250</th>
<td>0.0782</td>
<td>0.0789</td>
<td>0.0784</td>
<td>0.0696</td>
</tr>
<tr>
<th>windows-1251</th>
<td>0.5168</td>
<td>0.5550</td>
<td>0.5710</td>
<td>0.4984</td>
</tr>
<tr>
<th>windows-1252</th>
<td>0.1721</td>
<td>0.1767</td>
<td>0.1907</td>
<td>0.1552</td>
</tr>
<tr>
<th>windows-1253</th>
<td>0.0026</td>
<td>0.0029</td>
<td>0.0028</td>
<td>0.0022</td>
</tr>
<tr>
<th>windows-1254</th>
<td>0.0114</td>
<td>0.0117</td>
<td>0.0130</td>
<td>0.0111</td>
</tr>
<tr>
<th>windows-1255</th>
<td>0.0062</td>
<td>0.0073</td>
<td>0.0078</td>
<td>0.0067</td>
</tr>
<tr>
<th>windows-1256</th>
<td>0.0487</td>
<td>0.0448</td>
<td>0.0484</td>
<td>0.0380</td>
</tr>
<tr>
<th>windows-1257</th>
<td>0.0096</td>
<td>0.0082</td>
<td>0.0079</td>
<td>0.0075</td>
</tr>
<tr>
<th>windows-1258</th>
Expand All @@ -298,21 +298,21 @@
</tr>
<tr>
<th>windows-31j</th>
<td>0.0005</td>
<td>0.0007</td>
<td>0.0006</td>
<td>0.0006</td>
</tr>
<tr>
<th>x-windows-874</th>
<td>0.0087</td>
<td>0.0081</td>
<td>0.0092</td>
<td>0.0070</td>
</tr>
<tr>
<th>x-windows-949</th>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
</tr>
</tbody>
</table>
Loading

0 comments on commit 129c764

Please sign in to comment.