Troubleshooting
Checking the Schema of the Parquet File
NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option. If you encounter the error
RuntimeError: Schemas are inconsistent, try using to_parquet(..., schema="infer"),
or pass an explicit pyarrow schema. Such as to_parquet(..., schema={"column1": pa.string()})
when you load the dataset as shown below, one of your parquet files might have a different schema:
ds = nvt.Dataset(PATH, engine="parquet", part_size="1000MB")
ds.to_ddf().head()
The easiest way to fix this is to load your dataset with dask_cudf and save it again using the parquet format ( dask_cudf.read_parquet("INPUT_FOLDER").to_parquet("OUTPUT_FOLDER")
), so that the parquet file is standardized and the _metadata
file is generated.
If you want to identify which parquet files contain columns with different schemas, you can run one of these scripts:
These scripts check for schema consistency and generate only the _metadata
file instead of
converting all the parquet files. If the schema is inconsistent across all files, the script will
raise an exception. For additional information, see this
issue.
Setting the Row Group Size for the Parquet Files
You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the row_group_size
is the number of rows that will be stored in each row group (internal structure within the parquet file):
#Pandas
pandas_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
#cuDF
cudf_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
The row group memory size of the parquet files should be smaller than the part_size that
you set for the NVTabular dataset such as nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB")
. To determine how much memory a row group will hold, you can slice your dataframe to a specific number of rows and use the following function to get the memory usage in bytes. You can then set the row_group_size (number of rows) accordingly when you save the parquet file. A row group memory size that is close to 128MB is recommended.
def _memory_usage(df):
"""this function is a workaround for obtaining memory usage lists
in cudf0.16. This can be deleted and replaced with `df.memory_usage(deep= True, index=True).sum()`
when using cudf 0.17, which has been fixed as noted on https://github.com/rapidsai/cudf/pull/6549)"""
size = 0
for col in df._data.columns:
if cudf.utils.dtypes.is_list_dtype(col.dtype):
for child in col.base_children:
size += child.__sizeof__()
else:
size += col._memory_usage(deep=True)
size += df.index.memory_usage(deep=True)
return size