Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When z-order optimizing, keep partition in only one row_group (if possible) #2769

Open
deanm0000 opened this issue Aug 13, 2024 · 1 comment
Labels
binding/python Issues for the Python package enhancement New feature or request

Comments

@deanm0000
Copy link

Description

I'd like to see z-order optimization be extended to the row_group by having a row_group end if it can't fit the next partition in it rather than splitting that partition. As it is now, if I query the table for that partition value, the reader has to read 2 row_groups instead of just 1.

I don't have an MRE for the data setup but here's a demo with my own data behind it.

from deltalake import DeltaTable, WriterProperties
import pyarrow.parquet as pq
import fsspec
dt_path=...
abfs= fsspec(...)
dt=DeltaTable(dt_path)
dt.optimize.z_order(
    ['node_id'], 
    writer_properties=WriterProperties(compression="ZSTD")
    )
dtfile=dt.files()[0]
with abfs.open(f"{dt_path}/{dtfile}", "rb") as ff:
    pqfile=pq.ParquetFile(ff)
stats=[]
for rg in range(pqfile.metadata.num_row_groups):
    stats.append({'rg':rg, 
                    'min_node':pqfile.metadata.row_group(rg).column(3).statistics.min,
                    'max_node':pqfile.metadata.row_group(rg).column(3).statistics.max,
                    'num_values':pqfile.metadata.row_group(rg).column(3).statistics.num_values
                    })
stats[0:5] # Just first 5 for brevity
[{'rg': 0, 'min_node': 1, 'max_node': 49202, 'num_values': 1048576},
 {'rg': 1, 'min_node': 49202, 'max_node': 49636, 'num_values': 1048576},
 {'rg': 2, 'min_node': 49636, 'max_node': 50496, 'num_values': 1048576},
 {'rg': 3, 'min_node': 50496, 'max_node': 52458, 'num_values': 1048576},
 {'rg': 4, 'min_node': 52458, 'max_node': 1048072, 'num_values': 1048576}]

Notice how the max_node in each row group is the min_node for the next row_group which means values of that node span two row_groups so if I query that node then it has to download 2 row groups instead of just 1.

It'd be better if the first row group stopped at 49201 (or whatever came before 49202) and then 49202 was solely in the second rg.

Use Case
Faster, more efficient queries of nodes that would otherwise be straddling 2 row_groups.

Related Issue(s)
unknown, maybe page index?

@deanm0000 deanm0000 added the enhancement New feature or request label Aug 13, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Aug 13, 2024
@deanm0000
Copy link
Author

I made this which does the above with pyarrow although just one partition at a time. It's on pypi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
2 participants