Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug report] federation query using 2 hive metastores does not work when using gravitino #4932

Open
foryou7242 opened this issue Sep 13, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@foryou7242
Copy link

Version

main branch

Describe what's wrong

I want to use federation query using hive metastore stored in 2 hadoop clusters.

So we added two hive catalogues to metalake.

There is a difference between the location path in the show create table and the actual location information when sql-sql querying.

image

It seems to be an effect of the actual spark-sql query spark.sql.metastore.uris option, so I'm wondering if it's possible to federate query 2 hives?

Error message and/or stacktrace

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',

explain query

spark-sql ()> EXPLAIN show create table   portal_test_schema;
== Physical Plan ==
ShowCreateTable [createtab_stmt#0], HiveTable(org.apache.spark.sql.SparkSession@14144cc9,CatalogTable(
Database: ladp
Table: portal_test_schema
Created Time: Thu Jan 26 18:40:15 JST 2023
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hive
Table Properties: [bucketing_version=2, numFilesErasureCoded=0, transient_lastDdlTime=1725947686]
Location: hdfs://test2/portal_test_schema
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
...
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@64cbc28e)

How to reproduce

gravitino branch main

Additional context

No response

@foryou7242 foryou7242 added the bug Something isn't working label Sep 13, 2024
@jerqi
Copy link
Collaborator

jerqi commented Sep 14, 2024

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 14, 2024

@foryou7242 , could you help to clarify the below questions?

  1. For your enviroment, is it only one Hive metastore, but two HDFS clusters?
  2. The main problem is you create a table with location hdfs://hdfs1/xxx, but show create table shows the location is hdfs://hdfs2/xxx, YES?
  3. could you share the catalog propertis when you create hive catalog and the Spark configurations when using SparkSQL?
@foryou7242
Copy link
Author

foryou7242 commented Sep 19, 2024

@FANNG1

  1. no, 2 hive metastore and 2 hdfs cluster
  2. yes

test1 cluster catalog

  • metastore.uris : thrift://test1:9083
    test2 cluster catalog
  • metastore.uris : thrift://test2:9083
    spark sql configuration
spark-sql     --master yarn     --queue batch  --deploy-mode client    --conf spark.executor.cores=2  --conf spark.executor.instances=10 \
 --conf spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin" \
 --conf spark.sql.gravitino.uri=http://gravitino.stage.com \
 --conf spark.sql.gravitino.metalake=TEST \
 --jars hdfs://test1/user/spark/application/gravitino-spark-connector-runtime-3.4_2.12-0.7.0-incubating-SNAPSHOT.jar

Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection
But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 19, 2024

Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?

kyuubi hive connector could support multi hive mestatore, because Gravitino will create separate kyuubi hive instance for different catalogs which contains different hive metastore uri, I had tested two hive metastore with a shared HDFS cluster works well in the initial POC phase.

and could you share the SQL to create the table? Does querying data works well?

@foryou7242
Copy link
Author

and could you share the SQL to create the table? Does querying data works well?

table is the same as issue because it's an existing table.

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
@FANNG1
Copy link
Contributor

FANNG1 commented Sep 19, 2024

I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs. @foryou7242 could you try with the simple SQL like following?

create table a(a int) location 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a';
show create table a;
CREATE TABLE t1.a (
  a INT)
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a'
TBLPROPERTIES (
  'input-format' = 'org.apache.hadoop.mapred.TextInputFormat',
  'output-format' = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
  'serde-lib' = 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
  'serde-name' = 'a',
  'table-type' = 'MANAGED_TABLE',
  'transient_lastDdlTime' = '1726756779')
explain show create table a;
explain show create table a
== Physical Plan ==
ShowCreateTable [createtab_stmt#36], HiveTable(org.apache.spark.sql.SparkSession@7752f9fe,CatalogTable(
Database: t1
Table: a
Owner: hive
Created Time: Thu Sep 19 22:39:39 CST 2024
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: MANAGED
Provider: hive
Table Properties: [gravitino.identifier=gravitino.v1.uid1812375099371418513, owner=hive, transient_lastDdlTime=1726756779]
Location: hdfs://localhost:9000/user/hive/warehouse/t1.db/a
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Schema: root
 |-- a: integer (nullable = true)
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@61563a91)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
3 participants