Skip to content

Commit

Permalink
Merge pull request #301 from hosseinmoein/Hossein/BinaryFormat
Browse files Browse the repository at this point in the history
Implementing reading/writing in binary
  • Loading branch information
hosseinmoein authored May 20, 2024
2 parents 1adfdbf + f7f6358 commit 727303c
Show file tree
Hide file tree
Showing 19 changed files with 1,102 additions and 118 deletions.
Binary file added data/SHORT_IBM.dat
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/HTML/DataFrame.html
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ <H2><font color="blue">Summary</font></H2>
&nbsp;&nbsp;&nbsp;&nbsp;<span style='color:#800000; font-weight:bold; '>class</span> DataFrame;</B><BR><BR>

<B>I</B> specifies the <I>index</I> column type. Index column in a DataFrame is unlike an index in a SQL database. SQL database index makes access efficient. It doesn't give you any more information. The index column in a DataFrame is metadata about the data in the DataFrame. Each entry in the index describes the given row. It could be time, frequency, …, or a set of descriptors in a struct (like temperature, altitude, …).<BR>
<B>H</B> specifies a heterogenous vector type to contain DataFrame columns &#8212; don't get hang up on this too much, instead use the convenient typedef's in <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrameTypes.html">DataFrame Library Types</a>.<BR>
<B>H</B> specifies a heterogenous vector type to contain DataFrame columns &#8212; don't get hang up on this too much, instead use the convenient typedef's in <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrameTypes.html">DataFrame Library Types</a>. <B>H</B> is a relatively complex construct. You do not need to fully understand <B>H</B> to use the DataFrame library.<BR>
<B>H</B> can only be:<BR>
<UL>
<LI><B>HeteroVector<span style='color:#808030; '>&lt;</span><span style='color:#666616;'>std</span><span style='color:#800080;'>::</span><span style='color:#603000;'>size_t</span> A <span style='color:#808030;'>=</span> <span style='color:#008c00;'>0</span><span style='color:#808030;'>></span></B>: This is an actual heterogenous vector that would contain data. This will result in a "standard" data frame</LI>
Expand Down
2 changes: 1 addition & 1 deletion docs/HTML/io_format.html
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
csv2 = 2, // Regular csv format (similar to Pandas)
json = 3,
hdf5 = 4, // Not Implemented
binary = 5, // Not Implemented
binary = 5,
}; </B></PRE> </font>
</td>
<td>
Expand Down
8 changes: 5 additions & 3 deletions docs/HTML/read.html
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,9 @@
<LI>Column "INDEX" must be the first column, if it exists</LI>
<LI>Fields in column dictionaries must be in N (name), T (type), D (data) order</LI>
</OL>
<BR>
-----------------------------------------------<BR>
<B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR><BR>

-----------------------------------------------<BR>
In all formats the following data types are supported:
<PRE>
Expand All @@ -117,8 +119,8 @@
string -- char *
bool -- bool
DateTime -- DateTime data in format of
&lt;Epoch seconds&gt;.&lt;nanoseconds&gt;
(1516179600.874123908)
&lt;Epoch seconds&gt;.&lt;nanoseconds&gt;
(1516179600.874123908)
</PRE>
In case of io_format::csv2 and io_format::csv the following additional types are also supported:
<PRE>
Expand Down
155 changes: 83 additions & 72 deletions docs/HTML/write.html
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue"> <font color="white">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<PRE><B>
template&lt;typename S, typename ... Ts&gt;
bool
Expand All @@ -52,39 +52,41 @@
long max_recs = std::numeric_limits<long>::max()) const; </font>
</B></PRE>
</td>
<td>
It outputs the content of DataFrame into the stream o. Currently 3 formats (i.e. csv, csv2, json) are supported specified by the iof parameter.<BR><BR><BR>
<td width = "33.3%">
It outputs the content of DataFrame into the stream o. Currently 4 formats (i.e. csv, csv2, json, binary) are supported specified by the iof parameter.<BR><BR><BR>
The <B>CSV</B> file format is written:<BR>
<PRE>
INDEX:&lt;Number of data points&gt;:&lt;Comma delimited list of values&gt;
&lt;Column1 name&gt;:&lt;Number of data points&gt;:&lt;Column1 type&gt;:&lt;Comma delimited list of values&gt;
&lt;Column2 name&gt;:&lt;Number of data points&gt;:&lt;Column2 type&gt;:&lt;Comma delimited list of values&gt;
.
.
.
INDEX:&lt;Number of data points&gt;:&lt;Comma delimited list of values&gt;
&lt;Col1 name&gt;:&lt;Number of data points&gt;:&lt;Col1 type&gt;:&lt;Comma delimited list of values&gt;
&lt;Col2 name&gt;:&lt;Number of data points&gt;:&lt;Col2 type&gt;:&lt;Comma delimited list of values&gt;
.
.
.
</PRE>
All empty lines or lines starting with # will be skipped. For examples, see files in test directory<BR><BR>

The <B>CSV2</B> file format must be (this is similar to Pandas csv format):<BR>
<PRE>
INDEX:&lt;Number of data points&gt;:&lt;Index type&gt;:,&lt;Column1 name&gt;:&lt;Number of data points&gt;:&lt;Column1 type&gt;,&lt;Column2 name&gt;:&lt;Number of data points&gt;:&lt;Column2 type&gt;, . . .
Comma delimited rows of values
.
.
.
INDEX:&lt;Number of data points&gt;:&lt;Index type&gt;:,&lt;Column1 name&gt;:
&lt;Number of data points&gt;:&lt;Column1 type&gt;,&lt;Column2 name&gt;:
&lt;Number of data points&gt;:&lt;Column2 type&gt;, . . .
Comma delimited rows of values
.
.
.
</PRE>
All empty lines or lines starting with # will be skipped. For examples, see IBM and FORD files in test directory<BR><BR>

The <B>JSON</B> file format looks like this:<BR>
<PRE>
{
"INDEX":{"N":3,"T":"ulong","D":[123450,123451,123452]},
"col_3":{"N":3,"T":"double","D":[15.2,16.34,17.764]},
"col_4":{"N":3,"T":"int","D":[22,23,24]},
"col_str":{"N":3,"T":"string","D":["11","22","33"]},
"col_2":{"N":3,"T":"double","D":[8,9.001,10]},
"col_1":{"N":3,"T":"double","D":[1,2,3.456]}
}
{
"INDEX":{"N":3,"T":"ulong","D":[123450,123451,123452]},
"col_3":{"N":3,"T":"double","D":[15.2,16.34,17.764]},
"col_4":{"N":3,"T":"int","D":[22,23,24]},
"col_str":{"N":3,"T":"string","D":["11","22","33"]},
"col_2":{"N":3,"T":"double","D":[8,9.001,10]},
"col_1":{"N":3,"T":"double","D":[1,2,3.456]}
}
</PRE>
Please note DataFrame json does not follow json spec 100%. In json, there is no particular order in dictionary fields. But in DataFrame json:<BR>
<OL>
Expand All @@ -93,66 +95,75 @@
</OL>
<BR>
<BR>

<B>Binary</B> format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness.<BR><BR><BR>

In all formats the following data types are supported:
<PRE>
float -- float
double -- double
longdouble -- long double
short -- short int
ushort -- unsigned short int
int -- int
uint -- unsigned int
long -- long int
longlong -- long long int
ulong -- unsigned long int
ulonglong -- unsigned long long int
char -- char
uchar -- unsigned char
string -- std::string
string -- const char *
string -- char *
bool -- bool
DateTime -- DateTime data in format of &lt;Epoch seconds&gt;.&lt;nanoseconds&gt; (1516179600.874123908)
float -- float
double -- double
longdouble -- long double
short -- short int
ushort -- unsigned short int
int -- int
uint -- unsigned int
long -- long int
longlong -- long long int
ulong -- unsigned long int
ulonglong -- unsigned long long int
char -- char
uchar -- unsigned char
string -- std::string
string -- const char *
string -- char *
bool -- bool
DateTime -- DateTime data in format of
&lt;Epoch seconds&gt;.&lt;nanoseconds&gt;
(1516179600.874123908)
</PRE>
In case of io_format::csv2 and io_format::csv the following additional types are also supported:
<PRE>
dbl_vec -- A vector of double precision values, The vector is printed as "s[d1|d2|...]"
where s is the size of the vector and d's are the double values.
str_vec -- A vector of std::string values, The vector is printed as "s[str1|str2|...]"
where s is the size of the vector and str's are the strings.
dbl_set -- A set of double precision values, The set is printed as "s[d1|d2|...]"
where s is the size of the set and d's are the double values.
str_set -- A set of std::string values, The set is printed as "s[str1|str2|...]"
where s is the size of the set and str's are the strings.
str_dbl_map -- A map of string keys to double precision values, The map is printed as "s{k1:v1|k2:v2|...}"
where s is the size of the map and k's and v's are keys and values.
str_dbl_unomap -- An unordered map of string keys to double precision values, The map is printed as "s{k1:v1|k2:v2|...}"
where s is the size of the map and k's and v's are keys and values.
dbl_vec -- A vector of double precision values, The vector is printed
as "s[d1|d2|...]" where s is the size of the vector and d's
are the double values.
str_vec -- A vector of std::string values, The vector is printed as
"s[str1|str2|...]" where s is the size of the vector and
str's are the strings.
dbl_set -- A set of double precision values, The set is printed as
"s[d1|d2|...]" where s is the size of the set and d's
are the double values.
str_set -- A set of std::string values, The set is printed as
"s[str1|str2|...]" where s is the size of the set and
str's are the strings.
str_dbl_map -- A map of string keys to double precision values, The map is
printed as "s{k1:v1|k2:v2|...}" where s is the size of
the map and k's and v's are keys and values.
str_dbl_unomap -- An unordered map of string keys to double precision values,
The map is printed as "s{k1:v1|k2:v2|...}" where s is the
size of the map and k's and v's are keys and values.
</PRE>

In case of io_format::csv2 the following additional types are also supported:
<PRE>
DateTimeAME -- DateTime string printed in American style (MM/DD/YYYY HH:MM:SS.mmm)
DateTimeEUR -- DateTime string printed in European style (YYYY/MM/DD HH:MM:SS.mmm)
DateTimeISO -- DateTime string printed in ISO style (YYYY-MM-DD HH:MM:SS.mmm)
DateTimeAME -- American style (MM/DD/YYYY HH:MM:SS.mmm)
DateTimeEUR -- European style (YYYY/MM/DD HH:MM:SS.mmm)
DateTimeISO -- ISO style (YYYY-MM-DD HH:MM:SS.mmm)
</PRE>

</td>
<td>
<PRE>
<B>S</B>: Output stream type
<B>Ts</B>: The list of types for all columns. A type should be specified only once
<B>o</B>: Reference to an streamable object (e.g. cout, file, ...)
<B>iof</B>: Specifies the I/O format. The default is CSV
<B>precision</B>: Specifies the precision for floating point numbers
<B>columns_only</B>: If true, the index columns is not written into the stream
<B>max_recs</B>: Max number of rows to write. If it is positive, it will write max_recs from the beginning of DataFrame. If it is negative, it will write max_recs from the end of DataFrame
</PRE>
<td width="30%">
<B>S</B>: Output stream type<BR>
<B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
<B>o</B>: Reference to an streamable object (e.g. cout, file, ...)<BR>
<B>iof</B>: Specifies the I/O format. The default is CSV<BR>
<B>precision</B>: Specifies the precision for floating point numbers<BR>
<B>columns_only</B>: If true, the index columns is not written into the stream<BR>
<B>max_recs</B>: Max number of rows to write. If it is positive, it will write max_recs from the beginning of DataFrame. If it is negative, it will write max_recs from the end of DataFrame<BR>
</td>
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue"> <font color="white">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -163,7 +174,7 @@
long max_recs = std::numeric_limits<long>::max()) const; </font>
</B></PRE>
</td>
<td>
<td width = "33.3%">
Same as write() above, but it takes a file name<BR><BR>
<B>NOTE:</B>: This version of write() can be substantially faster, especially for larger files, than if you open the file yourself and use the write() version above.
</td>
Expand All @@ -172,7 +183,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue"> <font color="white">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<PRE><B>
template&lt;typename S, typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -191,7 +202,7 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue"> <font color="white">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::future&lt;bool&gt;
Expand All @@ -210,14 +221,14 @@
</tr>

<tr bgcolor="Azure">
<td bgcolor="blue"> <font color="white">
<td bgcolor="blue" width = "33.3%"> <font color="white">
<PRE><B>
template&lt;typename ... Ts&gt;
std::string
to_string(std::streamsize precision = 12) const; </font>
</B></PRE>
</td>
<td>
<td width = "33.3%">
This is a convenient function (simple implementation) to convert a DataFrame into a string that could be restored later by calling from_string(). It utilizes the write() member function of DataFrame.<BR>
These functions could be used to transmit a DataFrame from one place to another or store a DataFrame in databases, caches, ... <BR><BR>

Expand All @@ -229,7 +240,7 @@
</OL>

</td>
<td>
<td width = "30%">
<B>Ts</B>: The list of types for all columns. A type should be specified only once<BR>
<B>precision</B>: Specifies the precision for floating point numbers<BR>
</td>
Expand Down
19 changes: 19 additions & 0 deletions include/DataFrame/Internals/DataFrame_functors.h
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,25 @@ struct print_csv_functor_ : DataVec::template visitor_base<Ts ...> {
void operator() (const T &vec);
};

// ----------------------------------------------------------------------------

template<typename ... Ts>
struct print_binary_functor_ : DataVec::template visitor_base<Ts ...> {

inline print_binary_functor_ (const char *n,
std::ostream &o,
long sr,
long er)
: name(n), os(o), start_row(sr), end_row(er) { }

const char *name;
std::ostream &os;
const long start_row;
const long end_row;

template<typename T>
void operator() (const T &vec);
};

// ----------------------------------------------------------------------------

Expand Down
44 changes: 34 additions & 10 deletions include/DataFrame/Internals/DataFrame_misc.tcc
Original file line number Diff line number Diff line change
Expand Up @@ -197,23 +197,47 @@ DataFrame<I, H>::print_csv_functor_<Ts ...>::operator() (const T &vec) {
using VecType = typename std::remove_reference<T>::type;
using ValueType = typename VecType::value_type;

_write_csv_df_header_<std::ostream, ValueType>(os, name, vec.size()) << ':';
_write_csv_df_header_<std::ostream, ValueType>(os, name, vec.size())
<< ':';

const long vec_size = vec.size();
const long sr = std::min(start_row, vec_size);
const long er = std::min(end_row, vec_size);

if (vec_size > 0) {
for (long i = sr; i < er; ++i)
_write_csv_df_index_(os, vec[i]) << ',';
}
for (long i = sr; i < er; ++i)
_write_csv_df_index_(os, vec[i]) << ',';
os << '\n';

return;
}

// ----------------------------------------------------------------------------

template<typename I, typename H>
template<typename ... Ts>
template<typename T>
void
DataFrame<I, H>::print_binary_functor_<Ts ...>::operator() (const T &vec) {

using VecType = typename std::remove_reference<T>::type;
using ValueType = typename VecType::value_type;

char col_name[64];

std::strncpy(col_name, name, sizeof(col_name));
os.write(col_name, sizeof(col_name));
if constexpr (std::is_same_v<ValueType, std::string>)
_write_binary_string_(os, vec);
else if constexpr (std::is_same_v<ValueType, DateTime>)
_write_binary_datetime_(os, vec);
else
_write_binary_data_(os, vec);

return;
}

// ----------------------------------------------------------------------------

template<typename I, typename H>
template<typename ... Ts>
template<typename T>
Expand Down Expand Up @@ -696,7 +720,7 @@ operator() (T &vec) {

using VecType = typename std::remove_reference<T>::type;
using ValueType = typename VecType::value_type;
using ViewType = typename DF::template ColumnVecType<ValueType>;
using ViewType = typename DF::template ColumnVecType<ValueType>;

ViewType new_col;
const size_type vec_size = vec.size();
Expand Down Expand Up @@ -737,12 +761,12 @@ operator() (T &vec) const {
if (sel_indices[i] < vec_s) {
if constexpr (std::is_base_of<HeteroVector<align_value>, H>::value)
vec.erase(vec.begin() + (sel_indices[i] - del_count++));
else
else
vec.erase(sel_indices[i] - del_count++);
}
}
else
break;
}
}
}

// ----------------------------------------------------------------------------
Expand Down Expand Up @@ -772,7 +796,7 @@ random_load_data_functor_<DF, Ts ...>::operator() (const T &vec) {

const size_type vec_s = vec.size();
const size_type n_rows = rand_indices.size();
typename DF::template ColumnVecType<ValueType> new_vec;
typename DF::template ColumnVecType<ValueType> new_vec;
size_type prev_value { 0 };

new_vec.reserve(n_rows);
Expand Down
Loading

0 comments on commit 727303c

Please sign in to comment.