最新消息:Welcome to the puzzle paradise for programmers! Here, a well-designed puzzle awaits you. From code logic puzzles to algorithmic challenges, each level is closely centered on the programmer's expertise and skills. Whether you're a novice programmer or an experienced tech guru, you'll find your own challenges on this site. In the process of solving puzzles, you can not only exercise your thinking skills, but also deepen your understanding and application of programming knowledge. Come to start this puzzle journey full of wisdom and challenges, with many programmers to compete with each other and show your programming wisdom! Translated with DeepL.com (free version)

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

matteradmin8PV0评论

I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:

Chosen cluster:

•   Worker Nodes: 6
•   Cores per Worker: 48
•   Memory per Worker: 384 GB

Data:

•   Table A: 158 GB
•   Table B: 300 GB
•   Table C: 32 MB

Process:

1.  Read dfs from delta tables
2.  Perform a broadcast join between Table B and the small Table C.
3.  The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4.  The final job includes upsert operations into the destination.
5.  The destination table is partitioned by id, family, *date*

Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.

Post a comment

comment list (0)

  1. No comments so far