Implemented reading binary data in chunks

hosseinmoein · May 22, 2024 · 46a2d17 · 46a2d17
1 parent f762a0b
commit 46a2d17
Show file tree

Hide file tree

Showing 8 changed files with 318 additions and 144 deletions.
diff --git a/docs/HTML/read.html b/docs/HTML/read.html
@@ -41,7 +41,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 bool
 read(const char *file_name,
@@ -76,7 +76,7 @@
       .<BR>
       .<BR>
         All empty lines or lines starting with # will be skipped.<BR>
-        <B>NOTE:</B> Only in CSV2 format you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or EOF is reached.<BR><BR>
+        <B>NOTE:</B> Only in CSV2 and binary formats you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or EOF is reached.<BR><BR>
 
  -----------------------------------------------<BR>
         <B>JSON</B> file format looks like this:<BR>
@@ -96,7 +96,8 @@
           <LI>Fields in column dictionaries must be in N (name), T (type), D (data) order</LI>
         </OL>
  -----------------------------------------------<BR>
-        <B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR><BR>
+        <B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR>
+        <B>NOTE:</B> Only in CSV2 and binary formats you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or end-of-column is reached.<BR><BR>
 
  -----------------------------------------------<BR>
         In all formats the following data types are supported:
@@ -161,7 +162,7 @@
 
         <B>NOTE:</B>: This version of read() can be substantially faster, especially for larger files, than if you open the file yourself and use the read() version below.
       </td>
-      <td width="30%">
+      <td width="31%">
 <B>file_name</B>: Complete path to the file<BR>
 <B>iof</B>: Specifies the I/O format. The default is CSV<BR>
 <B>columns_only</B>: If true, the index column is not read. You may want to do that to read multiple files into the same DataFrame. If <I>columns_only</I> is false the index column must exist in the stream. If <I>columns_only</I> is true the index column may or may not exist<BR>
@@ -171,7 +172,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename S&gt;
 bool
@@ -183,12 +184,12 @@
       <td>
         Same as read() above, but takes a reference to a stream
       </td>
-      <td width="30%">
+      <td width="31%">
       </td>
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 std::future&lt;bool&gt;
 read_async(const char *file_name,
@@ -199,12 +200,12 @@
       <td>
         Same as read() above, but executed asynchronously
       </td>
-      <td width="30%">
+      <td width="31%">
       </td>
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename S&gt;
 std::future&lt;bool&gt;
@@ -216,7 +217,7 @@
       <td>
         Same as read_async() above, but takes a reference to a stream
       </td>
-      <td width="30%">
+      <td width="31%">
       </td>
     </tr>
 
@@ -239,7 +240,7 @@
         </OL>
 
       </td>
-      <td width="30%">
+      <td width="31%">
         <B>data_frame</B>: A null terminated string that was generated by calling to_string(). It must contain a complete DataFrame<BR>
       </td>
     </tr>

diff --git a/docs/HTML/write.html b/docs/HTML/write.html
@@ -41,7 +41,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename S, typename ... Ts&gt;
 bool
@@ -52,7 +52,7 @@
       long max_recs = std::numeric_limits<long>::max()) const; </font>
         </B></PRE>
       </td>
-      <td width = "33.3%">
+      <td width = "35%">
         It outputs the content of DataFrame into the stream o. Currently 4 formats (i.e. csv, csv2, json, binary) are supported specified by the iof parameter.<BR><BR><BR>
         The <B>CSV</B> file format is written:<BR>
         <PRE>
@@ -151,7 +151,7 @@
         </PRE>
 
       </td>
-      <td width="30%">
+      <td width="31%">
         <B>S</B>: Output stream type<BR>
         <B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
         <B>o</B>: Reference to an streamable object (e.g. cout, file, ...)<BR>
@@ -163,7 +163,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename ... Ts&gt;
 std::future&lt;bool&gt;
@@ -174,7 +174,7 @@
       long max_recs = std::numeric_limits<long>::max()) const; </font>
         </B></PRE>
       </td>
-      <td width = "33.3%">
+      <td width = "35%">
         Same as write() above, but it takes a file name<BR><BR>
         <B>NOTE:</B>: This version of write() can be substantially faster, especially for larger files, than if you open the file yourself and use the write() version above.
       </td>
@@ -183,7 +183,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename S, typename ... Ts&gt;
 std::future&lt;bool&gt;
@@ -202,7 +202,7 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename ... Ts&gt;
 std::future&lt;bool&gt;
@@ -221,14 +221,14 @@
     </tr>
 
     <tr bgcolor="Azure">
-      <td bgcolor="blue" width = "33.3%"> <font color="white">
+      <td bgcolor="blue" width = "30%"> <font color="white">
         <PRE><B>
 template&lt;typename ... Ts&gt;
 std::string
 to_string(std::streamsize precision = 12) const; </font>
         </B></PRE>
       </td>
-      <td width = "33.3%">
+      <td width = "35%">
         This is a convenient function (simple implementation) to convert a DataFrame into a string that could be restored later by calling from_string(). It utilizes the write() member function of DataFrame.<BR>
 		These functions could be used to transmit a DataFrame from one place to another or store a DataFrame in databases, caches, ... <BR><BR>
 
@@ -240,7 +240,7 @@
         </OL>
 
       </td>
-      <td width = "30%">
+      <td width = "31%">
         <B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
         <B>precision</B>: Specifies the precision for floating point numbers<BR>
       </td>

diff --git a/include/DataFrame/Internals/DataFrame_misc.tcc b/include/DataFrame/Internals/DataFrame_misc.tcc
@@ -227,11 +227,11 @@ DataFrame<I, H>::print_binary_functor_<Ts ...>::operator() (const T &vec)  {
     std::strncpy(col_name, name, sizeof(col_name));
     os.write(col_name, sizeof(col_name));
     if constexpr (std::is_same_v<ValueType, std::string>)
-        _write_binary_string_(os, vec);
+        _write_binary_string_(os, vec, start_row, end_row);
     else if constexpr (std::is_same_v<ValueType, DateTime>)
-        _write_binary_datetime_(os, vec);
+        _write_binary_datetime_(os, vec, start_row, end_row);
     else
-        _write_binary_data_(os, vec);
+        _write_binary_data_(os, vec, start_row, end_row);
 
     return;
 }

diff --git a/include/DataFrame/Internals/DataFrame_private_decl.h b/include/DataFrame/Internals/DataFrame_private_decl.h
@@ -61,7 +61,10 @@ using JoinSortingPair = std::pair<const T *, size_type>;
 // ----------------------------------------------------------------------------
 
 void read_json_(std::istream &file, bool columns_only);
-void read_binary_(std::istream &file);
+void read_binary_(std::istream &file,
+                  bool columns_only,
+                  size_type starting_row,
+                  size_type num_rows);
 void read_csv_(std::istream &file, bool columns_only);
 void read_csv2_(std::istream &file,
                 bool columns_only,