Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On improving the efficiency of CLP compression and decompression #501

Open
lihaoZhang1234 opened this issue Jul 30, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@lihaoZhang1234
Copy link

Request

Why can't we fix the template during compression and then compress according to the template, this will improve the efficiency of compression and decompression of CLP. For example, application scenarios: when an enterprise compresses daily logs under the same application, when the template is fixed, there is no need to repeatedly extract the template, and it can be compressed and decompressed directly according to the template, which greatly saves compression time and computational resources.

Possible implementation

Template extraction of an application's logs prior to compression, after which the application's daily logs are compressed directly according to the template.

@lihaoZhang1234 lihaoZhang1234 added the enhancement New feature or request label Jul 30, 2024
@jackluo923
Copy link
Member

jackluo923 commented Jul 30, 2024

It is technically possible to use fix "template" during compression, but we don't have much motivation in doing so. Generating "templates" on-the-fly in CLP is extremely cheap. Sharing a "template" introduces additional complexities, such as managing shared templates, handling new templates discovered in the logs, etc, which may outweigh the benefits of using a fixed template. Keeping everything self-contained, with an independent template for each archive, makes design simple with compression and search embarrassingly parallelizable with no data dependencies across archives.

That said, we have internally experimented with dictionary pre-training (an improved version of the "fixed template"). In this approach, compression dictionaries are pre-trained on a dataset that can be used across multiple datasets. Only the "delta" (new dictionary entries) needs to be saved in the archive. IIRC, as expected, there were no noticeable performance gain, only compression ratio gain which is expected. If customers have strong need to achieve maximum compression ratio (we have many options and tuning to achieve higher compression ratio) and the additional complexity is involved is justifiable, then we can consider moving this experimental feature into production code.

@lihaoZhang1234
Copy link
Author

Will this improved method of fixing templates be open-sourced in the future? Does this fixed template method increase its compression and decompression time as the compression rate goes up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
2 participants