Parquet
| Input | Output | Alias |
|---|---|---|
| ✔ | ✔ |
Description
Apache Parquet is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.
Data types matching
The table below shows how Parquet data types match ClickHouse data types.
| Parquet type (logical, converted, or physical) | ClickHouse data type |
|---|---|
BOOLEAN | Bool |
UINT_8 | UInt8 |
INT_8 | Int8 |
UINT_16 | UInt16 |
INT_16 | Int16/Enum16 |
UINT_32 | UInt32 |
INT_32 | Int32 |
UINT_64 | UInt64 |
INT_64 | Int64 |
DATE | Date32 |
TIMESTAMP, TIME | DateTime64 |
FLOAT | Float32 |
DOUBLE | Float64 |
INT96 | DateTime64(9, 'UTC') |
BYTE_ARRAY, UTF8, ENUM, BSON | String |
JSON | JSON |
FIXED_LEN_BYTE_ARRAY | FixedString |
DECIMAL | Decimal |
LIST | Array |
MAP | Map |
| struct | Tuple |
FLOAT16 | Float32 |
UUID | FixedString(16) |
INTERVAL | FixedString(12) |
Point (GeoParquet) | Point |
LineString (GeoParquet) | LineString |
Polygon (GeoParquet) | Polygon |
MultiLineString (GeoParquet) | MultiLineString |
MultiPolygon (GeoParquet) | MultiPolygon |
| mixed/unknown geometry (GeoParquet) | Geometry |
When writing Parquet file, data types that don't have a matching Parquet type are converted to the nearest available type:
| ClickHouse data type | Parquet type |
|---|---|
| IPv4 | UINT_32 |
| IPv6 | FIXED_LEN_BYTE_ARRAY (16 bytes) |
| Date (16 bits) | DATE (32 bits) |
| DateTime (32 bits, seconds) | TIMESTAMP (64 bits, milliseconds) |
| Int128/UInt128/Int256/UInt256 | FIXED_LEN_BYTE_ARRAY (16/32 bytes, little-endian) |
| Point | BYTE_ARRAY (WKB) + GeoParquet metadata |
| LineString | BYTE_ARRAY (WKB) + GeoParquet metadata |
| Polygon | BYTE_ARRAY (WKB) + GeoParquet metadata |
| MultiLineString | BYTE_ARRAY (WKB) + GeoParquet metadata |
| MultiPolygon | BYTE_ARRAY (WKB) + GeoParquet metadata |
Arrays can be nested and can have a value of Nullable type as an argument. Tuple and Map types can also be nested.
Data types of ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to that data type which is set for the ClickHouse table column. E.g. a UINT_32 Parquet column can be read into an IPv4 ClickHouse column.
For some Parquet types there's no closely matching ClickHouse type. We read them as follows:
TIME(time of day) is read as a timestamp. E.g.10:23:13.000becomes1970-01-01 10:23:13.000.TIMESTAMP/TIMEwithisAdjustedToUTC=falseis a local wall-clock time (year, month, day, hour, minute, second and subsecond fields in a local timezone, regardless of what specific time zone is considered local), same as SQLTIMESTAMP WITHOUT TIME ZONE. ClickHouse reads it as if it were a UTC timestamp instead. E.g.2025-09-29 18:42:13.000(representing a reading of a local wall clock) becomes2025-09-29 18:42:13.000(DateTime64(3, 'UTC')representing a point in time). If converted to String, it shows the correct year, month, day, hour, minute, second and subsecond, which can then be interpreted as being in some local timezone instead of UTC. Counterintuitively, changing the type fromDateTime64(3, 'UTC')toDateTime64(3)would not help as both types represent a point in time rather than a clock reading, butDateTime64(3)would incorrectly be formatted using local timezone.INTERVALis currently read asFixedString(12)with raw binary representation of the time interval, as encoded in Parquet file.
Geo types (GeoParquet)
ClickHouse supports reading and writing geometry columns according to the GeoParquet specification. Geometry columns are stored as BYTE_ARRAY payloads encoded in WKB (or WKT on read), with a JSON geo key in the file-level Parquet metadata describing each geometry column's encoding, geometry type and CRS.
Read behavior
On read, geometry columns are mapped to the corresponding ClickHouse geo data types:
- A column declared as
Point,LineString,Polygon,MultiLineStringorMultiPolygonis read into the matching ClickHouse geo type. - A column with multiple or unknown geometry types is read into the
Geometrytype, which is aVariantover all supported geo types. - If the requested column type is
String, the GeoParquet metadata is ignored and the raw encoded geometry payload is returned as-is — WKB or WKT bytes, matching whichever encoding the GeoParquet column declares. This is also true if the settinginput_format_parquet_allow_geoparquet_parseris set to0.
Write behavior
On write, top-level columns of type Point, LineString, Polygon, MultiLineString or MultiPolygon are encoded as BYTE_ARRAY (WKB) and the appropriate geo JSON metadata is appended to the Parquet file footer. A top-level Geometry Variant is also encoded as a WKB BYTE_ARRAY payload (its sub-values are converted to WKB and stored as a Nullable(String) column), but no geo metadata is emitted for it, so the result is not recognized as a GeoParquet geometry column on read. Other geo-related types, such as Ring, are written using their native underlying representation with no GeoParquet metadata. This behavior can be disabled entirely by setting output_format_parquet_geometadata to 0, in which case even the supported geo types are written using their native underlying representation (Point as Tuple(Float64, Float64), LineString as Array(Point), Polygon as Array(Array(Point)), etc.) and no GeoParquet metadata is emitted.
Geometry columns must appear at the root of the schema or nested inside Tuple (struct); nesting them inside Array or Map is not supported. Nullable is not supported for geo columns either.
Example usage
Inserting data
Using a Parquet file with the following data, named as football.parquet:
Insert the data:
Reading data
Read data using the Parquet format:
Parquet is a binary format that does not display in a human-readable form on the terminal. Use the INTO OUTFILE to output Parquet files.
To exchange data with Hadoop, you can use the HDFS table engine.
Format settings
| Setting | Description | Default |
|---|---|---|
input_format_parquet_case_insensitive_column_matching | Ignore case when matching Parquet columns with CH columns. | 0 |
input_format_parquet_preserve_order | Avoid reordering rows when reading from Parquet files. Usually makes it much slower. | 0 |
input_format_parquet_filter_push_down | When reading Parquet files, skip whole row groups based on the WHERE/PREWHERE expressions and min/max statistics in the Parquet metadata. | 1 |
input_format_parquet_bloom_filter_push_down | When reading Parquet files, skip whole row groups based on the WHERE expressions and bloom filter in the Parquet metadata. | 0 |
input_format_parquet_allow_missing_columns | Allow missing columns while reading Parquet input formats | 1 |
input_format_parquet_local_file_min_bytes_for_seek | Min bytes required for local read (file) to do seek, instead of read with ignore in Parquet input format | 8192 |
input_format_parquet_enable_row_group_prefetch | Enable row group prefetching during parquet parsing. Currently, only single-threaded parsing can prefetch. | 1 |
input_format_parquet_skip_columns_with_unsupported_types_in_schema_inference | Skip columns with unsupported types while schema inference for format Parquet | 0 |
input_format_parquet_max_block_size | Max block size for parquet reader. | 65409 |
input_format_parquet_prefer_block_bytes | Average block bytes output by parquet reader | 16744704 |
input_format_parquet_enable_json_parsing | When reading Parquet files, parse JSON columns as ClickHouse JSON Column. | 1 |
input_format_parquet_allow_geoparquet_parser | When reading Parquet files, recognize the GeoParquet geo metadata and decode geometry columns (WKB or WKT, per the column's declared encoding) as ClickHouse geo data types. If 0, geometry columns are exposed as their raw physical (String) representation. | 1 |
output_format_parquet_row_group_size | Target row group size in rows. | 1000000 |
output_format_parquet_row_group_size_bytes | Target row group size in bytes, before compression. | 536870912 |
output_format_parquet_string_as_string | Use Parquet String type instead of Binary for String columns. | 1 |
output_format_parquet_fixed_string_as_fixed_byte_array | Use Parquet FIXED_LEN_BYTE_ARRAY type instead of Binary for FixedString columns. | 1 |
output_format_parquet_compression_method | Compression method for Parquet output format. Supported codecs: snappy, lz4, brotli, zstd, gzip, none (uncompressed) | zstd |
output_format_parquet_parallel_encoding | Do Parquet encoding in multiple threads. | 1 |
output_format_parquet_data_page_size | Target page size in bytes, before compression. | 1048576 |
output_format_parquet_batch_size | Check page size every this many rows. Consider decreasing if you have columns with average values size above a few KBs. | 1024 |
output_format_parquet_write_page_index | Add a possibility to write page index into parquet files. | 1 |
output_format_parquet_geometadata | Write GeoParquet geo metadata into the Parquet file footer and encode top-level ClickHouse geo columns (Point, LineString, Polygon, MultiLineString, MultiPolygon) as WKB. If 0, those columns are written using their native underlying representation (e.g. Point as Tuple(Float64, Float64)) and no GeoParquet metadata is emitted. | 1 |
input_format_parquet_import_nested | Obsolete setting, does nothing. | 0 |
input_format_parquet_local_time_as_utc | true | Determines the data type used by schema inference for Parquet timestamps with isAdjustedToUTC=false. If true: DateTime64(..., 'UTC'), if false: DateTime64(...). Neither behavior is fully correct as ClickHouse doesn't have a data type for local wall-clock time. Counterintuitively, 'true' is probably the less incorrect option, because formatting the 'UTC' timestamp as String will produce representation of the correct local time. |