pyspark启动初始化2025-06-18 15:43:14,147 INFO conf.HiveConf: Found configuration file file:/D:/pyspark/hive/hive-3.1.1/conf/hive-site.xml Hive Session ID = ae15e233-5595-4035-9a63-90e6fef3164c 2025-06-18 15:43:15,369 INFO SessionState: Hive Session ID = ae15e233-5595-4035-9a63-90e6fef3164c Logging initialized using configuration in jar:file:/D:/pyspark/hive/hive-3.1.1/lib/hive-common-3.1.1.jar!/hive-log4j2.properties Async: true 2025-06-18 15:43:15,415 INFO SessionState: Logging initialized using configuration in jar:file:/D:/pyspark/hive/hive-3.1.1/lib/hive-common-3.1.1.jar!/hive-log4j2.properties Async: true 2025-06-18 15:43:16,270 INFO session.SessionState: Created HDFS directory: /tmp/hive/aaa/ae15e233-5595-4035-9a63-90e6fef3164c 2025-06-18 15:43:16,274 INFO session.SessionState: Created local directory: D:/pyspark/hive/hive-3.1.1/data/scratch/ae15e233-5595-4035-9a63-90e6fef3164c 2025-06-18 15:43:16,277 INFO session.SessionState: Created HDFS directory: /tmp/hive/aaa/ae15e233-5595-4035-9a63-90e6fef3164c/_tmp_space.db 2025-06-18 15:43:16,287 INFO conf.HiveConf: Using the default value passed in for log id: ae15e233-5595-4035-9a63-90e6fef3164c 2025-06-18 15:43:16,287 INFO session.SessionState: Updating thread name to ae15e233-5595-4035-9a63-90e6fef3164c main 2025-06-18 15:43:17,092 INFO metastore.HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore 2025-06-18 15:43:17,111 WARN metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 2025-06-18 15:43:17,114 INFO metastore.ObjectStore: ObjectStore, initialize called 2025-06-18 15:43:17,116 INFO conf.MetastoreConf: Found configuration file file:/D:/pyspark/hive/hive-3.1.1/conf/hive-site.xml 2025-06-18 15:43:17,117 INFO conf.MetastoreConf: Unable to find config file hivemetastore-site.xml 2025-06-18 15:43:17,118 INFO conf.MetastoreConf: Found configuration file null 2025-06-18 15:43:17,119 INFO conf.MetastoreConf: Unable to find config file metastore-site.xml 2025-06-18 15:43:17,119 INFO conf.MetastoreConf: Found configuration file null 2025-06-18 15:43:17,256 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 2025-06-18 15:43:17,450 INFO hikari.HikariDataSource: HikariPool-1 - Starting... 2025-06-18 15:43:17,626 INFO hikari.HikariDataSource: HikariPool-1 - Start completed. 2025-06-18 15:43:17,679 INFO hikari.HikariDataSource: HikariPool-2 - Starting... 2025-06-18 15:43:17,682 INFO hikari.HikariDataSource: HikariPool-2 - Start completed. 2025-06-18 15:43:17,799 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 2025-06-18 15:43:17,898 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL 2025-06-18 15:43:17,899 INFO metastore.ObjectStore: Initialized ObjectStore 2025-06-18 15:43:18,059 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:18,060 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:18,061 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:18,061 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:18,061 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:18,062 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,914 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,914 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,915 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,915 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,916 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:19,917 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-18 15:43:23,624 INFO metastore.HiveMetaStore: Added admin role in metastore 2025-06-18 15:43:23,626 INFO metastore.HiveMetaStore: Added public role in metastore 2025-06-18 15:43:24,075 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 2025-06-18 15:43:24,231 INFO metastore.RetryingMetaStoreClient: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=aaa (auth:SIMPLE) retries=1 delay=1 lifetime=0 2025-06-18 15:43:24,248 INFO metastore.HiveMetaStore: 0: get_all_functions 2025-06-18 15:43:24,250 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_all_functions Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Hive Session ID = 69d15420-0a4b-4d2a-934e-e5662db4697f 2025-06-18 15:43:24,766 INFO SessionState: Hive Session ID = 69d15420-0a4b-4d2a-934e-e5662db4697f 2025-06-18 15:43:24,767 INFO CliDriver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. 2025-06-18 15:43:24,775 INFO session.SessionState: Created HDFS directory: /tmp/hive/aaa/69d15420-0a4b-4d2a-934e-e5662db4697f 2025-06-18 15:43:24,777 INFO session.SessionState: Created local directory: D:/pyspark/hive/hive-3.1.1/data/scratch/69d15420-0a4b-4d2a-934e-e5662db4697f 2025-06-18 15:43:24,779 INFO session.SessionState: Created HDFS directory: /tmp/hive/aaa/69d15420-0a4b-4d2a-934e-e5662db4697f/_tmp_space.db 2025-06-18 15:43:24,780 INFO metastore.HiveMetaStore: 1: get_databases: @hive# 2025-06-18 15:43:24,780 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_databases: @hive# 2025-06-18 15:43:24,781 INFO metastore.HiveMetaStore: 1: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore 2025-06-18 15:43:24,781 INFO metastore.ObjectStore: ObjectStore, initialize called 2025-06-18 15:43:24,786 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL 2025-06-18 15:43:24,786 INFO metastore.ObjectStore: Initialized ObjectStore 2025-06-18 15:43:24,791 INFO metastore.HiveMetaStore: 1: get_tables_by_type: db=@hive#bclcredits pat=.*,type=MATERIALIZED_VIEW 2025-06-18 15:43:24,791 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_tables_by_type: db=@hive#bclcredits pat=.*,type=MATERIALIZED_VIEW 2025-06-18 15:43:24,796 INFO metastore.HiveMetaStore: 1: get_multi_table : db=bclcredits tbls= 2025-06-18 15:43:24,796 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_multi_table : db=bclcredits tbls= 2025-06-18 15:43:24,798 INFO metastore.HiveMetaStore: 1: get_tables_by_type: db=@hive#default pat=.*,type=MATERIALIZED_VIEW 2025-06-18 15:43:24,798 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_tables_by_type: db=@hive#default pat=.*,type=MATERIALIZED_VIEW 2025-06-18 15:43:24,799 INFO metastore.HiveMetaStore: 1: get_multi_table : db=default tbls= 2025-06-18 15:43:24,799 INFO HiveMetaStore.audit: ugi=aaa ip=unknown-ip-addr cmd=get_multi_table : db=default tbls= 2025-06-18 15:43:24,800 INFO metadata.HiveMaterializedViewsRegistry: Materialized views registry has been initialized
时间: 2025-07-31 10:10:30 浏览: 17
<think>我们正在处理一个关于PySpark与Hive集成时初始化日志和配置的问题。用户特别提到了"datanucleus"、"hive-site.xml"和"metastore",这表明问题可能与Hive元数据存储的配置以及DataNucleus(Hive元存储的ORM框架)的初始化有关。###关键点分析1.**Hive初始化日志**:当PySpark启动时,如果配置了Hive支持,它会尝试初始化Hive元存储。这个过程中产生的日志对于诊断问题至关重要。2.**DataNucleus**:Hive使用DataNucleus(JDO实现)来管理元数据到关系型数据库的映射。初始化问题通常与DataNucleus无法正确连接到数据库有关。3.**hive-site.xml**:这是Hive的主要配置文件,包含元存储数据库的连接信息(如JDBCURL、驱动类名、用户名和密码)。4.**常见问题**:通常包括数据库驱动未找到、连接配置错误、数据库不可访问、数据库模式未初始化等。###配置步骤为了确保PySpark能够正确初始化Hive元存储,我们需要正确配置`hive-site.xml`并将其放置在Spark的配置目录中(通常是`$SPARK_HOME/conf`)。同时,确保将数据库驱动JAR包放在Spark的jars目录下。###示例配置(hive-site.xml)```xml<configuration><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://your_metastore_db_host:3306/metastore_db?createDatabaseIfNotExist=true</value><description>JDBCconnectstringforaJDBCmetastore</description></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value><description>DriverclassnameforaJDBCmetastore</description></property><property><name>javax.jdo.option.ConnectionUserName</name><value>your_db_username</value><description>usernametouseagainstmetastoredatabase</description></property><property><name>javax.jdo.option.ConnectionPassword</name><value>your_db_password</value><description>passwordtouseagainstmetastoredatabase</description></property></configuration>```###日志分析在启动PySpark时,可以通过设置日志级别来获取更详细的日志信息。例如,在启动脚本中设置:```bashexportSPARK_DAEMON_JAVA_OPTS="-Dlog4j.configuration=file:///path/to/log4j.properties"```在`log4j.properties`中,可以设置`org.apache.hadoop.hive.metastore`和`org.datanucleus`的日志级别为DEBUG。###常见问题及解决方案1.**数据库驱动未找到**:将对应的JDBC驱动JAR(如MySQL的`mysql-connector-java-xxx.jar`)放入`$SPARK_HOME/jars`目录。2.**连接配置错误**:检查`hive-site.xml`中的连接URL、驱动类名、用户名和密码是否正确。3.**数据库不可访问**:确保数据库服务正在运行,并且网络可访问。4.**数据库模式未初始化**:如果元存储数据库是新的,需要运行Hive的`schematool`来初始化数据库模式。例如:```bash$HIVE_HOME/bin/schematool-dbTypemysql-initSchema```###PySpark代码示例在PySpark中启用Hive支持:```pythonfrompyspark.sqlimportSparkSessionspark=SparkSession.builder\.appName("HiveExample")\.config("spark.sql.warehouse.dir","/user/hive/warehouse")\.enableHiveSupport()\.getOrCreate()spark.sql("SHOWDATABASES").show()```###注意事项-确保Spark版本与Hive版本兼容。-如果使用Hive的元存储服务(而不是直接连接数据库),需要确保HiveMetastore服务已经启动。§§相关问题§§1.如何查看PySpark初始化Hive元存储的详细日志?2.在配置Hive元存储时,如何解决DataNucleus初始化错误?3.如何为Spark配置多个Hive配置文件?</think>### PySpark Hive 初始化日志与配置问题解析
当 PySpark 连接 Hive Metastore 时,Datanucleus 相关日志和配置问题是常见故障点,以下是关键排查步骤:
---
#### 1. **核心配置检查**
- **`hive-site.xml` 位置**
必须位于 Spark 的配置路径:
`$SPARK_HOME/conf/hive-site.xml`
确认包含 Metastore 连接信息:
```xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://metastore-host:3306/metastore_db</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
```
- **驱动 JAR 依赖**
将数据库驱动 JAR (如 `mysql-connector-java.jar`) 放入:
`$SPARK_HOME/jars/`
否则会报 `No suitable driver found` 错误[^1]。
---
#### 2. **DataNucleus 日志分析**
在 `spark-submit` 中添加日志参数:
```bash
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///path/to/log4j.properties"
```
**典型错误示例**:
```log
ERROR DataNucleus.Persistence: Error creating validator
javax.jdo.JDOFatalInternalException: Found 1 persistable classes but they are not enhanced
```
**原因**:Hive 的 JDO 类未增强(需运行 `schematool`)。
---
#### 3. **元数据库初始化**
执行 Hive 元数据初始化:
```bash
$HIVE_HOME/bin/schematool -dbType mysql -initSchema
```
成功输出:
```log
Metastore schema initialized to 3.1.0
```
---
#### 4. **SparkSession 配置**
确保启用 Hive 支持:
```python
spark = SparkSession.builder \
.appName("HiveExample") \
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
.enableHiveSupport() \ # 关键配置
.getOrCreate()
```
---
#### 5. **常见问题解决**
| 问题现象 | 解决方案 |
|----------|----------|
| `ClassNotFoundException: org.datanucleus.*` | 将 `datanucleus-*.jar` 放入 `$SPARK_HOME/jars` |
| `Hive metastore database is not initialized` | 运行 `schematool -initSchema` |
| `Permission denied` | 检查 HDFS 目录权限:`hdfs dfs -chmod -R 777 /user/hive` |
| `Connection refused` | 验证 Metastore 服务状态:`netstat -tuln | grep 9083` |
> **注意**:Spark 3.x 默认使用内置 Derby 数据库,需显式配置 Metastore[^2]。
阅读全文
相关推荐




















