From 88617cb03727c9f7dc77a4d2a891a441ce3ade6f Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 10:52:40 +0100 Subject: [PATCH 1/7] Remove duplicate word --- _posts/2024-12-05-csv-files-dethroning-parquet-or-not.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md b/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md index 9314015e4a..5926e68f40 100644 --- a/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md +++ b/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md @@ -64,7 +64,7 @@ Furthermore, the reader became one of the fastest CSV readers in analytical syst ## Comparing CSV and Parquet -With the large boost boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them? +With the large boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them? To find out, we will run a few examples using both CSV and Parquet files containing TPC-H data to shed light on their differences. All scripts used to generate the benchmarks of this blogpost can be found in a [repository](https://github.com/pdet/csv_vs_parquet). From 119367495550bd9eb666f65ea5d67e974a849b73 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 10:58:53 +0100 Subject: [PATCH 2/7] Blog: Remove repeated words --- _posts/2023-04-14-h2oai.md | 2 +- _posts/2024-09-27-sql-only-extensions.md | 2 +- _posts/2024-11-29-duckdb-tricks-part-3.md | 2 +- _posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/_posts/2023-04-14-h2oai.md b/_posts/2023-04-14-h2oai.md index 338ced0e7a..b9095b861d 100644 --- a/_posts/2023-04-14-h2oai.md +++ b/_posts/2023-04-14-h2oai.md @@ -47,7 +47,7 @@ The queries have not changed since the benchmark went dormant. The data is gener | advanced groupby #2 | `SELECT id3, max(v1)-min(v2) AS range_v1_v2 FROM tbl GROUP BY id3` | Range selection over small cardinality groups, grouped by integer | | advanced groupby #3 | `SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2` |Advanced group by query | | advanced groupby #4 | `SELECT id2, id4, pow(corr(v1, v2), 2) AS r2 FROM tbl GROUP BY id2, id4` | Arithmetic over medium sized groups, grouped by varchar, integer. | -| advanced groupby #5 | `SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6` | Many many small groups, the number of groups is the cardinality of the dataset | +| advanced groupby #5 | `SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6` | Many small groups, the number of groups is the cardinality of the dataset | | join #1 |`SELECT x.*, small.id4 AS small_id4, v2 FROM x JOIN small USING (id1)` | Joining a large table (x) with a small-sized table on integer type | | join #2 |`SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x JOIN medium USING (id2)` | Joining a large table (x) with a medium-sized table on integer type | | join #3 |`SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x LEFT JOIN medium USING (id2)` | Left join a large table (x) with a medium-sized table on integer type| diff --git a/_posts/2024-09-27-sql-only-extensions.md b/_posts/2024-09-27-sql-only-extensions.md index c9c7025c1f..158e654367 100644 --- a/_posts/2024-09-27-sql-only-extensions.md +++ b/_posts/2024-09-27-sql-only-extensions.md @@ -131,7 +131,7 @@ git push #### Write Your SQL Macros -It it likely a bit faster to iterate if you test your macros directly in DuckDB. +It is likely a bit faster to iterate if you test your macros directly in DuckDB. After you have written your SQL, we will move it into the extension. The example we will use demonstrates how to pull a dynamic set of columns from a dynamic table name (or a view name!). diff --git a/_posts/2024-11-29-duckdb-tricks-part-3.md b/_posts/2024-11-29-duckdb-tricks-part-3.md index e137d3a5d4..7d4b3a0728 100644 --- a/_posts/2024-11-29-duckdb-tricks-part-3.md +++ b/_posts/2024-11-29-duckdb-tricks-part-3.md @@ -179,7 +179,7 @@ We have now a table with all the data from January to October, amounting to almo ## Reordering Parquet Files Suppose we want to analyze the average delay of the [Intercity Direct trains](https://en.wikipedia.org/wiki/Intercity_Direct) operated by the [Nederlandse Spoorwegen (NS)](https://en.wikipedia.org/wiki/Nederlandse_Spoorwegen), measured at the final destination of the train service. -While we can run this analysis directly on the the `.csv` files, the lack of metadata (such as schema and min-max indexes) will limit the performance. +While we can run this analysis directly on the `.csv` files, the lack of metadata (such as schema and min-max indexes) will limit the performance. Let's measure this in the CLI client by turning on the [timer]({% link docs/api/cli/dot_commands.md %}): ```plsql diff --git a/_posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md b/_posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md index b28e508245..47383d7ae8 100644 --- a/_posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md +++ b/_posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md @@ -78,7 +78,7 @@ The table contains a summary of the DuckDB benchmark results. ## Historical Context -So why did we set out to run these these experiments in the first place? +So why did we set out to run these experiments in the first place? Just a few weeks ago, [CWI](https://cwi.nl/), the birthplace of DuckDB, held a ceremony for the [Dijkstra Fellowship](https://www.cwi.nl/en/events/dijkstra-awards/cwi-lectures-dijkstra-fellowship/). The fellowship was awarded to Marcin Żukowski for his pioneering role in the development of database management systems and his successful entrepreneurial career that resulted in systems such as [VectorWise](https://en.wikipedia.org/wiki/Actian_Vector) and [Snowflake](https://en.wikipedia.org/wiki/Snowflake_Inc.). From d5cad87b71a1701a9b2e4f4509078ef5ad45442b Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 11:01:43 +0100 Subject: [PATCH 3/7] Remove repeated words --- docs/extensions/spatial/functions.md | 2 +- docs/extensions/spatial/r-tree_indexes.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/extensions/spatial/functions.md b/docs/extensions/spatial/functions.md index 8665f89055..c4ac78b465 100644 --- a/docs/extensions/spatial/functions.md +++ b/docs/extensions/spatial/functions.md @@ -1702,7 +1702,7 @@ VARCHAR ST_QuadKey (col0 GEOMETRY, col1 INTEGER) #### Description Compute the [quadkey](https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system) for a given lon/lat point at a given level. -Note that the the parameter order is **longitude**, **latitude**. +Note that the parameter order is **longitude**, **latitude**. `level` has to be between 1 and 23, inclusive. diff --git a/docs/extensions/spatial/r-tree_indexes.md b/docs/extensions/spatial/r-tree_indexes.md index 6c903bcd3c..8c30bb076d 100644 --- a/docs/extensions/spatial/r-tree_indexes.md +++ b/docs/extensions/spatial/r-tree_indexes.md @@ -109,7 +109,7 @@ EXPLAIN SELECT count(*) FROM t1 WHERE ST_Within(geom, ST_MakeEnvelope(45, 45, 65 Creating R-trees on top of an already populated table is much faster than first creating the index and then inserting the data. This is because the R-tree will have to periodically rebalance itself and perform a somewhat costly splitting operation when a node reaches max capacity after an insert, potentially causing additional splits to cascade up the tree. However, when the R-tree index is created on an already populated table, a special bottom up "bulk loading algorithm" (Sort-Tile-Recursive) is used, which divides all entries into an already balanced tree as the total number of required nodes can be computed from the beginning. -Additionally, using the bulk loading algorithm tends to create a R-tree with a better structure (less overlap between bounding boxes), which usually leads to better query performance. If you find that the performance of querying the R-tree starts to deteriorate after a large number of of updates or deletions, dropping and re-creating the index might produce a higher quality R-tree. +Additionally, using the bulk loading algorithm tends to create a R-tree with a better structure (less overlap between bounding boxes), which usually leads to better query performance. If you find that the performance of querying the R-tree starts to deteriorate after a large number of updates or deletions, dropping and re-creating the index might produce a higher quality R-tree. ### Memory Usage From 0edc8a6aa5f65914980753cfadabebb7141736f5 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 11:02:23 +0100 Subject: [PATCH 4/7] Single-file PDF: Patch the eisvogel template to provide '\pandocbounded' for Pandoc --- single-file-document/templates/eisvogel2.tex | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/single-file-document/templates/eisvogel2.tex b/single-file-document/templates/eisvogel2.tex index 64a48a9d94..cf1c2346f8 100644 --- a/single-file-document/templates/eisvogel2.tex +++ b/single-file-document/templates/eisvogel2.tex @@ -389,15 +389,17 @@ $if(graphics)$ \usepackage{graphicx} \makeatletter -\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi} -\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi} -\makeatother -% Scale images if necessary, so that they will not overflow the page -% margins by default, and it is still possible to overwrite the defaults -% using explicit options in \includegraphics[width, height, ...]{} -\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio} +\newsavebox\pandoc@box +\newcommand*\pandocbounded[1]{% scales image to fit in text height/width + \sbox\pandoc@box{#1}% + \Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}% + \Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}% + \ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi% select the smaller of both + \ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}% + \else\usebox{\pandoc@box}% + \fi% +} % Set default figure placement to htbp -\makeatletter % Make use of float-package and set default placement for figures to H. % The option H means 'PUT IT HERE' (as opposed to the standard h option which means 'You may put it here if you like'). \usepackage{float} From ef494cfc2ea7f75b82c888fcb73986605d6ad20d Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 11:19:15 +0100 Subject: [PATCH 5/7] Comment --- single-file-document/concatenate_to_single_file.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/single-file-document/concatenate_to_single_file.py b/single-file-document/concatenate_to_single_file.py index 92d25e2305..ea72623598 100644 --- a/single-file-document/concatenate_to_single_file.py +++ b/single-file-document/concatenate_to_single_file.py @@ -105,7 +105,7 @@ def adjust_links_in_doc_body(doc_body): "]({% link docs/python/overview.md %})" ) - # replace "`, `" (with its typical surroundings) with "`,` " to allow line breaking + # replace "`, `" (with the surrounding characters used for emphasis) with "`,` " to allow line breaking # see https://stackoverflow.com/questions/76951040/pandoc-preserve-whitespace-in-inline-code doc_body = doc_body.replace("`*`, `*`", "`*`,` *`") From c9b91b63483fc804db7cb42afc4ed96f2d6ed77f Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 11:19:40 +0100 Subject: [PATCH 6/7] Single-file: Remove div tags --- single-file-document/concatenate_to_single_file.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/single-file-document/concatenate_to_single_file.py b/single-file-document/concatenate_to_single_file.py index ea72623598..c6913e69b2 100644 --- a/single-file-document/concatenate_to_single_file.py +++ b/single-file-document/concatenate_to_single_file.py @@ -115,6 +115,9 @@ def adjust_links_in_doc_body(doc_body): # replace links to data sets to point to the website doc_body = doc_body.replace("](/data/", "](https://duckdb.org/data/") + # remove '
' HTML tags + doc_body = re.sub(r']*?>[\n ]*([^§]*?)[\n ]*
', r'\1', doc_body, flags=re.MULTILINE) + # replace '' HTML tags with Markdown's '![]()' construct doc_body = re.sub(r'', r'![](\1)', doc_body, flags=re.MULTILINE) From 76ff6005f82173ef7f10de0954cc70abb4772ab7 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 11:22:34 +0100 Subject: [PATCH 7/7] Single-file build: Add newline after images --- single-file-document/concatenate_to_single_file.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/single-file-document/concatenate_to_single_file.py b/single-file-document/concatenate_to_single_file.py index c6913e69b2..076c5a7976 100644 --- a/single-file-document/concatenate_to_single_file.py +++ b/single-file-document/concatenate_to_single_file.py @@ -119,7 +119,7 @@ def adjust_links_in_doc_body(doc_body): doc_body = re.sub(r']*?>[\n ]*([^§]*?)[\n ]*', r'\1', doc_body, flags=re.MULTILINE) # replace '' HTML tags with Markdown's '![]()' construct - doc_body = re.sub(r'', r'![](\1)', doc_body, flags=re.MULTILINE) + doc_body = re.sub(r'', r'![](\1)\n', doc_body, flags=re.MULTILINE) # use relative path for images in Markdown doc_body = doc_body.replace("](/images", "](../images")