-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#4938]feat(lakehouse-paimon): Support S3 filesystem for Paimon catalog. #4939
base: main
Are you sure you want to change the base?
Conversation
@@ -61,6 +62,12 @@ public class PaimonCatalogPropertiesMetadata extends BaseCatalogPropertiesMetada | |||
AuthenticationConfig.AUTH_TYPE_KEY, | |||
AuthenticationConfig.AUTH_TYPE_KEY); | |||
|
|||
private static final Map<String, String> S3_CONFIGURATION = | |||
ImmutableMap.of( | |||
PaimonS3FileSystemConfig.S3_ACCESS_KEY, PaimonS3FileSystemConfig.S3_ACCESS_KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use the properties defined in S3Properties
to unify the storage configuration for Gravitino
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #4939 (comment)
throw new IllegalArgumentException("Unsupported file system type: " + type); | ||
} | ||
|
||
public static FileSystemType fromStoragePath(String storagePath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems couldn't be extended to support other storage implement FILEIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed by @FANNG1, The Iceberg catalog uses gravitino.bypass
mechanism to transfer properties to the underlying storage, thus making it very good scalability and can support other storage system without changing the code, so I wonder
- Should we follow this design for the Paimon catalog to support multiple file systems storage?
- Is it advantageous to adopt this strategy for the fileset catalog to support multiple storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to leverage the configurations defined by @FANNG1 for Iceberg in Paimon too, we can unify the cloud storage configuration for different catalogs.
For other configurations that Gravitino cannot support, we can use "gravitino.bypass". But for some important configurations we should have a unified manner in Gravitino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I received it and checked it before launching the PR. The keys they used were not necessarily the same.
public class S3Properties {
// An alternative endpoint of the S3 service, This could be used to for S3FileIO with any
// s3-compatible object storage service that has a different endpoint, or access a private S3
// endpoint in a virtual private cloud
public static final String GRAVITINO_S3_ENDPOINT = "s3-endpoint";
// The static access key ID used to access S3 data.
public static final String GRAVITINO_S3_ACCESS_KEY_ID = "s3-access-key-id";
// The static secret access key used to access S3 data.
public static final String GRAVITINO_S3_SECRET_ACCESS_KEY = "s3-secret-access-key";
// The region of the S3 service.
public static final String GRAVITINO_S3_REGION = "s3-region";
private S3Properties() {}
}
public class PaimonS3FileSystemConfig extends Config {
public static final String S3_ENDPOINT = "s3.endpoint";
public static final String S3_ACCESS_KEY = "s3.access-key";
public static final String S3_SECRET_KEY = "s3.secret-key";
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different catalog may use different S3 properties, like s3.access-key-id
in Iceberg, s3.access-key
in Paimon, and XX
in hadoop, etc. Given this variety, it seems reasonable to propose a unified set of cloud properties. This approach would simplify the usage of S3 across various catalogs for normal users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be the same.
We should unify all the cloud storage related configurations and provide a unified way to the users, hide the difference for different catalogs.
This is important, should carefully think about it and treat it seriously.
} | ||
|
||
@AfterAll | ||
public void stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop spark session
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be stop in the super stop method, let me check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the spark session has been closed in the super class.
@Override | ||
protected void startNecessaryContainers() { | ||
localStackContainer = | ||
new LocalStackContainer(DockerImageName.parse("localstack/localstack")).withServices(S3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use localstack
for test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do you have a suggestion for the use of the s3 simulator?
LOCAL_FILE, | ||
HDFS, | ||
S3, | ||
OSS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not plan to support oss in this pr, it have better to remove it first to avoid confusion for other users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FANNG1 Do you have any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable to remove OSS, since it's not supported.
} | ||
} | ||
|
||
private static void checkS3FileSystemConfig(Map<String, String> resultConf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/gravitino/pull/4939/files#diff-0c775b587b52f59019d196d56c6de96c190a6320085c91ad7757e5ac8255bef5R54 aims to solve the problem that if the users set the key, then the values should not be null or empty. If the key is not set, the check is useless.
This is to address the problem that users should set the key and the value can't be null or empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, I misunderstood.
import org.apache.gravitino.config.ConfigEntry; | ||
import org.apache.gravitino.connector.PropertyEntry; | ||
|
||
public class PaimonS3FileSystemConfig extends Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is possible to merge the same configurations of iceberg to provide a common configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys for S3 in Iceberg are not the same as those here, please see #4939 (comment)
String serverPath = ITUtils.joinPath(gravitinoHome, "libs"); | ||
String paimonCatalogPath = | ||
ITUtils.joinPath(gravitinoHome, "catalogs", "lakehouse-paimon", "libs"); | ||
JdbcDriverDownloader.downloadJdbcDriver(PAIMON_S3_JAR_URL, serverPath, paimonCatalogPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JdbcDriverDownloader.downloadJdbcDriver
Should we adjust the method name to make it seem compatible with downloading any jar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right, I need to rename it.
public enum FileSystemType { | ||
LOCAL_FILE, | ||
HDFS, | ||
S3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the rename in S3
an atomic operation?
FilesystemCatalog depends on the atomic rename to avoid commit confliction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/hadoop/blob/ea6e0f7cd58d0129897dfc7870aee188be80a904/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2365-L2385
i am not sure that if it will fail when the destination path exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the rename in
S3
an atomic operation? FilesystemCatalog depends on the atomic rename to avoid commit confliction.
I don't have much knowledge about it. let me verify this.
What changes were proposed in this pull request?
Add support for Paimon S3 filesystem.
Note: related documents will be added in another PR.
Why are the changes needed?
for better user experience.
Fix: #4938
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Test locally and IT