Skip to content

Commit 684eedc

Browse files
authored
Merge branch 'branch-0.7' into ci-cudatoolkit-dep
2 parents b9b04a4 + 50bb9c1 commit 684eedc

File tree

117 files changed

+9255
-2959
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+9255
-2959
lines changed

.github/CODEOWNERS

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
1-
#Admins & project code owners
2-
#* @rapidsai/admins @rapidsai/cudf-admin @rapidsai/cudf-codeowners
3-
41
#cpp code owners
5-
#cpp/ @rapidsai/cudf-cpp-codeowners
2+
cpp/ @rapidsai/cudf-cpp-codeowners
63

74
#python code owners
8-
#python/ @rapidsai/cudf-python-codeowners
5+
python/ @rapidsai/cudf-python-codeowners
6+
7+
#cmake code owners
8+
**/CMakeLists.txt @rapidsai/cudf-cmake-codeowners
9+
**/cmake/ @rapidsai/cudf-cmake-codeowners
10+
11+
#build/ops code owners
12+
.github/ @rapidsai/cudf-ops-codeowners
13+
ci/ @rapidsai/cudf-ops-codeowners
14+
conda/ @rapidsai/cudf-ops-codeowners
15+
**/Dockerfile @rapidsai/cudf-ops-codeowners
16+
**/.dockerignore @rapidsai/cudf-ops-codeowners
17+
docker/ @rapidsai/cudf-ops-codeowners

CHANGELOG.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# cuDF 0.7.0 (Date TBD)
22

33
## New Features
4-
4+
- PR #1142 Add `GDF_BOOL` column type
55
- PR #1194 Implement overloads for CUDA atomic operations
66
- PR #1292 Implemented Bitwise binary ops AND, OR, XOR (&, |, ^)
77
- PR #1235 Add GPU-accelerated Parquet Reader
@@ -20,9 +20,13 @@
2020
- PR #1441 Add Series level cumulative ops (cumsum, cummin, cummax, cumprod)
2121
- PR #1461 Add Python coverage test to gpu build
2222
- PR #1445 Parquet Reader: Add selective reading of rows and row group
23+
- PR #1532 Parquet Reader: Add support for INT96 timestamps
24+
- PR #1516 Add Series and DataFrame.ndim
25+
- PR #1466 Add GPU-accelerated ORC Reader
2326

2427
## Improvements
2528

29+
- PR #1531 Refactor closures as private functions in gpuarrow
2630
- PR #1404 Parquet reader page data decoding speedup
2731
- PR #1076 Use `type_dispatcher` in join, quantiles, filter, segmented sort, radix sort and hash_groupby
2832
- PR #1202 Simplify README.md
@@ -40,6 +44,7 @@
4044
- PR #1319 CSV Reader: Use column wrapper for gdf_column output alloc/dealloc
4145
- PR #1376 Change series quantile default to linear
4246
- PR #1399 Replace CFFI bindings for NVTX functions with Cython bindings
47+
- PR #1407 Rename and cleanup of `gdf_table` to `device_table`
4348
- PR #1389 Refactored `set_null_count()`
4449
- PR #1386 Added macros `GDF_TRY()`, `CUDF_TRY()` and `ASSERT_CUDF_SUCCEEDED()`
4550
- PR #1435 Rework CMake and conda recipes to depend on installed libraries
@@ -50,10 +55,14 @@
5055
- PR #1479 Convert Parquet Reader CFFI to Cython
5156
- PR #1397 Add a utility function for producing an overflow-safe kernel launch grid configuration
5257
- PR #1382 Add GPU parsing of nested brackets to cuIO parsing utilities
58+
- PR #1481 Add cudf::table constructor to allocate a set of `gdf_column`s
5359
- PR #1484 Convert GroupBy CFFI to Cython
5460
- PR #1463 Allow and default melt keyword argument var_name to be None
5561
- PR #1486 Parquet Reader: Use device_buffer rather than device_ptr
5662
- PR #1525 Add cudatoolkit conda dependency
63+
- PR #1520 Renamed `src/dataframe` to `src/table` and moved `table.hpp`. Made `types.hpp` to be type declarations only.
64+
- PR #1521 Added `row_bitmask` to compute bitmask for rows of a table. Merged `valids_ops.cu` and `bitmask_ops.cu`
65+
- PR #1553 Overload `hash_row` to avoid using intial hash values. Updated `gdf_hash` to select between overloads
5766

5867
## Bug Fixes
5968

@@ -87,9 +96,16 @@
8796
- PR #1451 Fix hash join estimated result size is not correct
8897
- PR #1454 Fix local build script improperly change directory permissions
8998
- PR #1490 Require Dask 1.1.0+ for `is_dataframe_like` test or skip otherwise.
99+
- PR #1491 Use more specific directories & groups in CODEOWNERS
90100
- PR #1497 Fix Thrust issue on CentOS caused by missing default constructor of host_vector elements
91101
- PR #1498 Add missing include guard to device_atomics.cuh and separated DEVICE_ATOMICS_TEST
92102
- PR #1506 Fix csv-write call to updated NVStrings method
103+
- PR #1510 Added nvstrings `fillna()` function
104+
- PR #1507 Parquet Reader: Default string data to GDF_STRING
105+
- PR #1535 Fix doc issue to ensure correct labelling of cudf.series
106+
- PR #1537 Fix `undefined reference` link error in HashPartitionTest
107+
- PR #1548 Fix ci/local/build.sh README from using an incorrect image example
108+
- PR #1551 CSV Reader: Fix integer column name indexing
93109

94110

95111
# cuDF 0.6.1 (25 Mar 2019)
@@ -151,6 +167,7 @@
151167
- PR #1155 Add __array_ufunc__ for DataFrame and Series for sqrt
152168
- PR #1168 to_frame for series accepts a name argument
153169

170+
154171
## Improvements
155172

156173
- PR #1218 Add dask-cudf page to API docs

ci/local/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,17 @@ where:
2323
```
2424

2525
Example Usage:
26-
`bash build.sh -r ~/rapids/cudf -i gpuci/cuda9.2-ubuntu16.04-gcc5-py3.6`
26+
`bash build.sh -r ~/rapids/cudf -i gpuci/rapidsai-base:cuda9.2-ubuntu16.04-gcc5-py3.6`
2727

2828
For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai-base/tags) page.
2929

30+
Style Check:
31+
```bash
32+
$ bash ci/local/build.sh -r ~/rapids/cudf -s
33+
$ source activate gdf #Activate gpuCI conda environment
34+
$ cd rapids
35+
$ flake8 python
36+
```
3037

3138
## Information
3239

cpp/CMakeLists.txt

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -91,16 +91,16 @@ include(FeatureSummary)
9191
include(CheckIncludeFiles)
9292
include(CheckLibraryExists)
9393

94-
include(ConfigureArrow)
95-
9694
###################################################################################################
9795
# - find arrow ------------------------------------------------------------------------------------
9896

97+
include(ConfigureArrow)
98+
9999
if (ARROW_FOUND)
100100
message(STATUS "Apache Arrow found in ${ARROW_INCLUDE_DIR}")
101101
else()
102102
message(FATAL_ERROR "Apache Arrow not found, please check your settings.")
103-
endif()
103+
endif(ARROW_FOUND)
104104

105105
###################################################################################################
106106
# - find zlib -------------------------------------------------------------------------------------
@@ -221,8 +221,8 @@ link_directories("${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}" # CMAKE_CUDA_IMPLICIT
221221

222222
add_library(cudf SHARED
223223
src/comms/ipc/ipc.cu
224-
src/dataframe/column.cpp
225-
src/dataframe/context.cpp
224+
src/column/column.cpp
225+
src/column/context.cpp
226226
src/string/nvcategory_util.cpp
227227
src/join/joining.cu
228228
src/orderby/orderby.cu
@@ -241,7 +241,6 @@ add_library(cudf SHARED
241241
src/binary/jit/util/operator.cpp
242242
src/binary/jit/util/type.cpp
243243
src/bitmask/bitmask_ops.cu
244-
src/bitmask/valid_ops.cu
245244
src/compaction/stream_compaction_ops.cu
246245
src/datetime/datetime_ops.cu
247246
src/hash/hashing.cu
@@ -259,6 +258,11 @@ add_library(cudf SHARED
259258
src/io/convert/dlpack/cudf_dlpack.cpp
260259
src/io/csv/csv_reader.cu
261260
src/io/csv/csv_writer.cu
261+
src/io/orc/orc_reader.cu
262+
src/io/orc/orc.cpp
263+
src/io/orc/timezone.cpp
264+
src/io/orc/stripe_data.cu
265+
src/io/orc/stripe_init.cu
262266
src/io/parquet/page_data.cu
263267
src/io/parquet/page_hdr.cu
264268
src/io/parquet/parquet_reader.cu

cpp/include/bitmask.hpp

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
/*
2+
* Copyright (c) 2019, NVIDIA CORPORATION.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
#ifndef BITMASK_HPP
18+
#define BITMASK_HPP
19+
20+
#include <cudf.h>
21+
#include <types.hpp>
22+
23+
/**
24+
* @brief Counts the number of valid bits for the specified number of rows
25+
* in a validity bitmask.
26+
*
27+
* If the bitmask is null, returns a count equal to the number of rows.
28+
*
29+
* @param[in] masks The validity bitmask buffer in device memory
30+
* @param[in] num_rows The number of bits to count
31+
* @param[out] count The number of valid bits in the buffer from [0, num_rows)
32+
*
33+
* @returns GDF_SUCCESS upon successful completion
34+
*
35+
*/
36+
gdf_error gdf_count_nonzero_mask(gdf_valid_type const* masks,
37+
gdf_size_type num_rows, gdf_size_type* count);
38+
39+
/** ---------------------------------------------------------------------------*
40+
* @brief Concatenate the validity bitmasks of multiple columns
41+
*
42+
* Accounts for the differences between lengths of columns and their bitmasks
43+
* (e.g. because gdf_valid_type is larger than one bit).
44+
*
45+
* @param[out] output_mask The concatenated mask
46+
* @param[in] output_column_length The total length (in data elements) of the
47+
* concatenated column
48+
* @param[in] masks_to_concat The array of device pointers to validity bitmasks
49+
* for the columns to concatenate
50+
* @param[in] column_lengths An array of lengths of the columns to concatenate
51+
* @param[in] num_columns The number of columns to concatenate
52+
* @return gdf_error GDF_SUCCESS or GDF_CUDA_ERROR if there is a runtime CUDA
53+
error
54+
*
55+
---------------------------------------------------------------------------**/
56+
gdf_error gdf_mask_concat(gdf_valid_type* output_mask,
57+
gdf_size_type output_column_length,
58+
gdf_valid_type* masks_to_concat[],
59+
gdf_size_type* column_lengths,
60+
gdf_size_type num_columns);
61+
62+
63+
#endif

cpp/include/cudf/functions.h

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -41,21 +41,6 @@ gdf_error gdf_nvtx_range_push_hex(char const * const name, unsigned int color );
4141
*/
4242
gdf_error gdf_nvtx_range_pop();
4343

44-
/**
45-
* @brief Counts the number of valid bits for the specified number of rows
46-
* in a validity bitmask.
47-
*
48-
* If the bitmask is null, returns a count equal to the number of rows.
49-
*
50-
* @param[in] masks The validity bitmask buffer in device memory
51-
* @param[in] num_rows The number of bits to count
52-
* @param[out] count The number of valid bits in the buffer from [0, num_rows)
53-
*
54-
* @returns GDF_SUCCESS upon successful completion
55-
*
56-
*/
57-
gdf_error gdf_count_nonzero_mask(gdf_valid_type const *masks,
58-
gdf_size_type num_rows, gdf_size_type *count);
5944

6045
/**
6146
* Calculates the number of bytes to allocate for a column's validity bitmask

cpp/include/cudf/io_functions.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,18 @@ gdf_error read_csv(csv_read_arg *args);
3434
*/
3535
gdf_error write_csv(csv_write_arg* args);
3636

37+
/*
38+
* @brief Interface to parse ORC data to GDF columns
39+
*
40+
* This function accepts an input source for an Apache ORC dataset and outputs
41+
* an array of gdf_columns.
42+
*
43+
* @param[in,out] args Structure containing input and output args
44+
*
45+
* @return gdf_error GDF_SUCCESS if successful
46+
**/
47+
gdf_error read_orc(orc_read_arg *args);
48+
3749
/*
3850
* @brief Interface to parse Parquet data to GDF columns
3951
*/

cpp/include/cudf/io_types.h

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,33 @@ typedef struct
158158

159159
} csv_write_arg;
160160

161+
/**---------------------------------------------------------------------------*
162+
* @brief Input and output arguments to the read_orc interface.
163+
*---------------------------------------------------------------------------**/
164+
typedef struct {
165+
166+
/*
167+
* Output arguments
168+
*/
169+
int num_cols_out; ///< Out: Number of columns returned
170+
int num_rows_out; ///< Out: Number of rows returned
171+
gdf_column **data; ///< Out: Array of gdf_columns*
172+
173+
/*
174+
* Input arguments
175+
*/
176+
gdf_input_type source_type; ///< In: Type of data source
177+
const char *source; ///< In: If source_type is FILE_PATH, contains the filepath. If input_data_type is HOST_BUFFER, points to the host memory buffer
178+
size_t buffer_size; ///< In: If source_type is HOST_BUFFER, represents the size of the buffer in bytes. Unused otherwise.
179+
180+
const char **use_cols; ///< In: Columns of interest; only these columns will be parsed and returned.
181+
int use_cols_len; ///< In: Number of columns
182+
183+
int skip_rows; ///< In: Number of rows to skip from the start
184+
int num_rows; ///< In: Number of rows to read. Actual number of returned rows may be less
185+
186+
} orc_read_arg;
187+
161188
/**---------------------------------------------------------------------------*
162189
* @brief Input and output arguments to the read_parquet interface.
163190
*---------------------------------------------------------------------------**/
@@ -185,4 +212,6 @@ typedef struct {
185212
const char **use_cols; ///< In: Columns of interest; only these columns will be parsed and returned.
186213
int use_cols_len; ///< In: Number of columns
187214

215+
bool strings_to_categorical; ///< In: If TRUE, returns string data as GDF_CATEGORY, otherwise GDF_STRING
216+
188217
} pq_read_arg;

cpp/include/cudf/types.h

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
#pragma once
22

33
// TODO: Update to use fixed width types when CFFI goes away
4-
typedef int gdf_size_type; ///< Limits the maximum size of a gdf_column to 2^31-1
4+
typedef int gdf_size_type; ///< Limits the maximum size of a gdf_column to 2^31-1
55
typedef gdf_size_type gdf_index_type;
66
typedef unsigned char gdf_valid_type;
7-
typedef long gdf_date64;
8-
typedef int gdf_date32;
9-
typedef int gdf_category;
10-
typedef long gdf_timestamp;
11-
typedef int gdf_nvstring_category;
7+
typedef char gdf_bool8; /*< Storage type for Boolean values.
8+
char is used to guarantee 8-bit storage.
9+
zero == false, nonzero == true. */
10+
typedef long gdf_date64;
11+
typedef int gdf_date32;
12+
typedef int gdf_category;
13+
typedef long gdf_timestamp;
14+
typedef int gdf_nvstring_category;
1215

1316

1417
/**
@@ -22,6 +25,7 @@ typedef enum {
2225
GDF_INT64,
2326
GDF_FLOAT32,
2427
GDF_FLOAT64,
28+
GDF_BOOL8, ///< Boolean stored in 8 bits per Boolean. zero==false, nonzero==true.
2529
GDF_DATE32, ///< int32_t days since the UNIX epoch
2630
GDF_DATE64, ///< int64_t milliseconds since the UNIX epoch
2731
GDF_TIMESTAMP, ///< Exact timestamp encoded with int64 since UNIX epoch (Default unit millisecond)
@@ -32,7 +36,6 @@ typedef enum {
3236
} gdf_dtype;
3337

3438

35-
3639
/**
3740
* @brief These are all possible gdf error codes that can be returned from
3841
* a libgdf function. ANY NEW ERROR CODE MUST ALSO BE ADDED TO `gdf_error_get_name`
@@ -110,6 +113,7 @@ typedef union {
110113
long si64; /**< GDF_INT64 */
111114
float fp32; /**< GDF_FLOAT32 */
112115
double fp64; /**< GDF_FLOAT64 */
116+
char b08; /**< GDF_BOOL8 */
113117
gdf_date32 dt32; /**< GDF_DATE32 */
114118
gdf_date64 dt64; /**< GDF_DATE64 */
115119
gdf_timestamp tmst; /**< GDF_TIMESTAMP */

0 commit comments

Comments
 (0)