Skip to content

Commit

Permalink
Implemented reading binary data in chunks
Browse files Browse the repository at this point in the history
  • Loading branch information
hosseinmoein committed May 22, 2024
1 parent f762a0b commit 46a2d17
Show file tree
Hide file tree
Showing 8 changed files with 318 additions and 144 deletions.
23 changes: 12 additions & 11 deletions docs/HTML/read.html
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
bool
read(const char *file_name,
Expand Down Expand Up @@ -76,7 +76,7 @@
.<BR>
.<BR>
All empty lines or lines starting with # will be skipped.<BR>
<B>NOTE:</B> Only in CSV2 format you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or EOF is reached.<BR><BR>
<B>NOTE:</B> Only in CSV2 and binary formats you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or EOF is reached.<BR><BR>

-----------------------------------------------<BR>
<B>JSON</B> file format looks like this:<BR>
Expand All @@ -96,7 +96,8 @@
<LI>Fields in column dictionaries must be in N (name), T (type), D (data) order</LI>
</OL>
-----------------------------------------------<BR>
<B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR><BR>
<B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR>
<B>NOTE:</B> Only in CSV2 and binary formats you can specify <I>starting_row</I> and <I>num_rows</I>. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at <I>starting_row</I> and continues until either <I>num_rows</I> rows is read or end-of-column is reached.<BR><BR>

-----------------------------------------------<BR>
In all formats the following data types are supported:
Expand Down Expand Up @@ -161,7 +162,7 @@

<B>NOTE:</B>: This version of read() can be substantially faster, especially for larger files, than if you open the file yourself and use the read() version below.
</td>
<td width="30%">
<td width="31%">
<B>file_name</B>: Complete path to the file<BR>
<B>iof</B>: Specifies the I/O format. The default is CSV<BR>
<B>columns_only</B>: If true, the index column is not read. You may want to do that to read multiple files into the same DataFrame. If <I>columns_only</I> is false the index column must exist in the stream. If <I>columns_only</I> is true the index column may or may not exist<BR>
Expand All @@ -171,7 +172,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename S&gt;
bool
Expand All @@ -183,12 +184,12 @@
<td>
Same as read() above, but takes a reference to a stream
</td>
<td width="30%">
<td width="31%">
</td>
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
std::future&lt;bool&gt;
read_async(const char *file_name,
Expand All @@ -199,12 +200,12 @@
<td>
Same as read() above, but executed asynchronously
</td>
<td width="30%">
<td width="31%">
</td>
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename S&gt;
std::future&lt;bool&gt;
Expand All @@ -216,7 +217,7 @@
<td>
Same as read_async() above, but takes a reference to a stream
</td>
<td width="30%">
<td width="31%">
</td>
</tr>

Expand All @@ -239,7 +240,7 @@
</OL>

</td>
<td width="30%">
<td width="31%">
<B>data_frame</B>: A null terminated string that was generated by calling to_string(). It must contain a complete DataFrame<BR>
</td>
</tr>
Expand Down
20 changes: 10 additions & 10 deletions docs/HTML/write.html
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename S, typename ... Ts&gt;
bool
Expand All @@ -52,7 +52,7 @@
long max_recs = std::numeric_limits<long>::max()) const; </font>
</B></PRE>
</td>
<td width = "33.3%">
<td width = "35%">
It outputs the content of DataFrame into the stream o. Currently 4 formats (i.e. csv, csv2, json, binary) are supported specified by the iof parameter.<BR><BR><BR>
The <B>CSV</B> file format is written:<BR>
<PRE>
Expand Down Expand Up @@ -151,7 +151,7 @@
</PRE>

</td>
<td width="30%">
<td width="31%">
<B>S</B>: Output stream type<BR>
<B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
<B>o</B>: Reference to an streamable object (e.g. cout, file, ...)<BR>
Expand All @@ -163,7 +163,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -174,7 +174,7 @@
long max_recs = std::numeric_limits<long>::max()) const; </font>
</B></PRE>
</td>
<td width = "33.3%">
<td width = "35%">
Same as write() above, but it takes a file name<BR><BR>
<B>NOTE:</B>: This version of write() can be substantially faster, especially for larger files, than if you open the file yourself and use the write() version above.
</td>
Expand All @@ -183,7 +183,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename S, typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -202,7 +202,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -221,14 +221,14 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<td bgcolor="blue" width = "30%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::string
to_string(std::streamsize precision = 12) const; </font>
</B></PRE>
</td>
<td width = "33.3%">
<td width = "35%">
This is a convenient function (simple implementation) to convert a DataFrame into a string that could be restored later by calling from_string(). It utilizes the write() member function of DataFrame.<BR>
These functions could be used to transmit a DataFrame from one place to another or store a DataFrame in databases, caches, ... <BR><BR>

Expand All @@ -240,7 +240,7 @@
</OL>

</td>
<td width = "30%">
<td width = "31%">
<B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
<B>precision</B>: Specifies the precision for floating point numbers<BR>
</td>
Expand Down
6 changes: 3 additions & 3 deletions include/DataFrame/Internals/DataFrame_misc.tcc
Original file line number Diff line number Diff line change
Expand Up @@ -227,11 +227,11 @@ DataFrame<I, H>::print_binary_functor_<Ts ...>::operator() (const T &vec) {
std::strncpy(col_name, name, sizeof(col_name));
os.write(col_name, sizeof(col_name));
if constexpr (std::is_same_v<ValueType, std::string>)
_write_binary_string_(os, vec);
_write_binary_string_(os, vec, start_row, end_row);
else if constexpr (std::is_same_v<ValueType, DateTime>)
_write_binary_datetime_(os, vec);
_write_binary_datetime_(os, vec, start_row, end_row);
else
_write_binary_data_(os, vec);
_write_binary_data_(os, vec, start_row, end_row);

return;
}
Expand Down
5 changes: 4 additions & 1 deletion include/DataFrame/Internals/DataFrame_private_decl.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,10 @@ using JoinSortingPair = std::pair<const T *, size_type>;
// ----------------------------------------------------------------------------

void read_json_(std::istream &file, bool columns_only);
void read_binary_(std::istream &file);
void read_binary_(std::istream &file,
bool columns_only,
size_type starting_row,
size_type num_rows);
void read_csv_(std::istream &file, bool columns_only);
void read_csv2_(std::istream &file,
bool columns_only,
Expand Down
Loading

0 comments on commit 46a2d17

Please sign in to comment.