Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4938]feat(lakehouse-paimon): Support S3 filesystem for Paimon catalog. #4939

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Add support for Paimon S3 filesystem.

Note: related documents will be added in another PR.

Why are the changes needed?

for better user experience.

Fix: #4938

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Test locally and IT

@yuqi1129 yuqi1129 self-assigned this Sep 13, 2024
@@ -61,6 +62,12 @@ public class PaimonCatalogPropertiesMetadata extends BaseCatalogPropertiesMetada
AuthenticationConfig.AUTH_TYPE_KEY,
AuthenticationConfig.AUTH_TYPE_KEY);

private static final Map<String, String> S3_CONFIGURATION =
ImmutableMap.of(
PaimonS3FileSystemConfig.S3_ACCESS_KEY, PaimonS3FileSystemConfig.S3_ACCESS_KEY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use the properties defined in S3Properties to unify the storage configuration for Gravitino

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #4939 (comment)

throw new IllegalArgumentException("Unsupported file system type: " + type);
}

public static FileSystemType fromStoragePath(String storagePath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems couldn't be extended to support other storage implement FILEIO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed by @FANNG1, The Iceberg catalog uses gravitino.bypass mechanism to transfer properties to the underlying storage, thus making it very good scalability and can support other storage system without changing the code, so I wonder

  1. Should we follow this design for the Paimon catalog to support multiple file systems storage?
  2. Is it advantageous to adopt this strategy for the fileset catalog to support multiple storage?

@FANNG1 @jerryshao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to leverage the configurations defined by @FANNG1 for Iceberg in Paimon too, we can unify the cloud storage configuration for different catalogs.

For other configurations that Gravitino cannot support, we can use "gravitino.bypass". But for some important configurations we should have a unified manner in Gravitino.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I received it and checked it before launching the PR. The keys they used were not necessarily the same.

public class S3Properties {
  // An alternative endpoint of the S3 service, This could be used to for S3FileIO with any
  // s3-compatible object storage service that has a different endpoint, or access a private S3
  // endpoint in a virtual private cloud
  public static final String GRAVITINO_S3_ENDPOINT = "s3-endpoint";
  // The static access key ID used to access S3 data.
  public static final String GRAVITINO_S3_ACCESS_KEY_ID = "s3-access-key-id";
  // The static secret access key used to access S3 data.
  public static final String GRAVITINO_S3_SECRET_ACCESS_KEY = "s3-secret-access-key";
  // The region of the S3 service.
  public static final String GRAVITINO_S3_REGION = "s3-region";

  private S3Properties() {}
}
public class PaimonS3FileSystemConfig extends Config {

  public static final String S3_ENDPOINT = "s3.endpoint";
  public static final String S3_ACCESS_KEY = "s3.access-key";
  public static final String S3_SECRET_KEY = "s3.secret-key";
}
Copy link
Contributor

@FANNG1 FANNG1 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different catalog may use different S3 properties, like s3.access-key-id in Iceberg, s3.access-key in Paimon, and XX in hadoop, etc. Given this variety, it seems reasonable to propose a unified set of cloud properties. This approach would simplify the usage of S3 across various catalogs for normal users.

Copy link
Contributor

@jerryshao jerryshao Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be the same.

We should unify all the cloud storage related configurations and provide a unified way to the users, hide the difference for different catalogs.

This is important, should carefully think about it and treat it seriously.

}

@AfterAll
public void stop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop spark session

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be stop in the super stop method, let me check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the spark session has been closed in the super class.

@Override
protected void startNecessaryContainers() {
localStackContainer =
new LocalStackContainer(DockerImageName.parse("localstack/localstack")).withServices(S3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to use localstack for test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you have a suggestion for the use of the s3 simulator?

LOCAL_FILE,
HDFS,
S3,
OSS;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not plan to support oss in this pr, it have better to remove it first to avoid confusion for other users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FANNG1 Do you have any other suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to remove OSS, since it's not supported.

}
}

private static void checkS3FileSystemConfig(Map<String, String> resultConf) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/gravitino/pull/4939/files#diff-0c775b587b52f59019d196d56c6de96c190a6320085c91ad7757e5ac8255bef5R54 aims to solve the problem that if the users set the key, then the values should not be null or empty. If the key is not set, the check is useless.

This is to address the problem that users should set the key and the value can't be null or empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I misunderstood.

import org.apache.gravitino.config.ConfigEntry;
import org.apache.gravitino.connector.PropertyEntry;

public class PaimonS3FileSystemConfig extends Config {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible to merge the same configurations of iceberg to provide a common configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys for S3 in Iceberg are not the same as those here, please see #4939 (comment)

String serverPath = ITUtils.joinPath(gravitinoHome, "libs");
String paimonCatalogPath =
ITUtils.joinPath(gravitinoHome, "catalogs", "lakehouse-paimon", "libs");
JdbcDriverDownloader.downloadJdbcDriver(PAIMON_S3_JAR_URL, serverPath, paimonCatalogPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JdbcDriverDownloader.downloadJdbcDriver

Should we adjust the method name to make it seem compatible with downloading any jar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, I need to rename it.

public enum FileSystemType {
LOCAL_FILE,
HDFS,
S3,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rename in S3 an atomic operation?
FilesystemCatalog depends on the atomic rename to avoid commit confliction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rename in S3 an atomic operation? FilesystemCatalog depends on the atomic rename to avoid commit confliction.

I don't have much knowledge about it. let me verify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants