{"id":1394,"date":"2025-09-07T17:35:42","date_gmt":"2025-09-07T17:35:42","guid":{"rendered":"https:\/\/dataforma.tech\/?p=1394"},"modified":"2025-09-07T17:35:44","modified_gmt":"2025-09-07T17:35:44","slug":"otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks","status":"publish","type":"post","link":"https:\/\/dataforma.tech\/en\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/","title":{"rendered":"Optimizing Memory Management in PySpark: A Practical Guide for Databricks Notebooks"},"content":{"rendered":"<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"584\" src=\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4-1024x584.png\" alt=\"\" class=\"wp-image-1399\" srcset=\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4-1024x584.png 1024w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4-300x171.png 300w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4-768x438.png 768w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4-18x10.png 18w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png 1300w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Anyone who&#039;s worked with PySpark knows how frustrating it is when an application crashes due to lack of memory. You&#039;re there, processing data peacefully, when suddenly everything stops and that error message no one wants to see appears. It turns out that memory management in PySpark can reduce resource consumption by up to 2x when you know what you&#039;re doing.<\/p>\n\n\n\n<p>Until recently, figuring out memory issues in distributed PySpark applications was like looking for a needle in a haystack, especially when the problem was in the Spark executors. You were left guessing which part of the code was consuming all the available memory.<\/p>\n\n\n\n<p>Fortunately, since Databricks Runtime 12.0, this headache has been greatly reduced with the arrival of memory analysis tools directly on the executors.<\/p>\n\n\n\n<p>Today, we have access to memory profiling tools that show exactly which lines of code in your UDFs are hogging resources. This completely changes how we can make the necessary improvements.<\/p>\n\n\n\n<p>There is also Adaptive Query Execution, known as AQE, which came with Spark 3.x and helps improve query execution plans while they are running, using the statistics it collects.<\/p>\n\n\n\n<p>The question remains: how can we use these tools in a practical way to improve the performance of our notebooks in Databricks?<\/p>\n\n\n\n<p>We&#039;ll see how to better organize your data layout, how to manage shuffles without breaking everything, how to solve those annoying skew issues, and how to use cache in a way that really makes a difference in your application&#039;s performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Organize Data and Write Better in Delta Tables<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"599\" src=\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5-1024x599.png\" alt=\"\" class=\"wp-image-1400\" srcset=\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5-1024x599.png 1024w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5-300x176.png 300w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5-768x450.png 768w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5-18x12.png 18w, https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-5.png 1300w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Image Source:\u00a0<a href=\"https:\/\/bigdataboutique.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">BigData Boutique<\/a><\/figcaption><\/figure>\n\n\n\n<p>You&#039;ve probably experienced that situation where a query that should be fast ends up taking forever. Often the culprit is &quot;small files&quot; in Delta Lake. When you&#039;re writing data frequently, especially in streams or batch updates, thousands of small files end up creating a mess in the metadata and slowing down your queries.<\/p>\n\n\n\n<p>Fortunately, Databricks has some very useful features to solve this data layout issue in Delta tables.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">delta.autoOptimize.<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/synapse-analytics\/spark\/optimize-write-for-apache-spark\" target=\"_blank\" rel=\"noreferrer noopener\">optimizeWrite<\/a>&nbsp;and delta.autoOptimize.autoCompact<\/h3>\n\n\n\n<p>There is a feature called&nbsp;<strong>optimizeWrite<\/strong>&nbsp;which reduces the number of files written, making each file larger during write operations. The idea is simple: it takes several small writes to the same partition and combines them into a single operation before executing, creating larger, more efficient files.<\/p>\n\n\n\n<p>To enable this functionality, you have two options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>At the table level:\u00a0<code>delta.autoOptimize.optimizeWrite = true<\/code><\/li>\n\n\n\n<li>At the session level:\u00a0<code>spark.databricks.delta.optimizeWrite.enabled = true<\/code><\/li>\n<\/ul>\n\n\n\n<p>THE&nbsp;<strong>autoCompact<\/strong>&nbsp;It works as a plugin, automatically running a small optimize command after each write operation. Basically, it takes data from files that are under a certain size and merges it into a larger file, immediately after the write completes successfully.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Settings&nbsp;<a href=\"https:\/\/docs.databricks.com\/aws\/en\/delta\/tune-file-size\" target=\"_blank\" rel=\"noreferrer noopener\">delta.targetFileSize<\/a>&nbsp;for file size control<\/h3>\n\n\n\n<p>To adjust the size of the files in your Delta tables, you can configure the property&nbsp;<code>delta.targetFileSize<\/code>&nbsp;with the size that makes the most sense. Once you set this property, all layout optimization operations will do their best to generate files of the specified size.<\/p>\n\n\n\n<p>Databricks itself automatically adjusts the file size based on the table size:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For tables smaller than 2.56 TB: 256 MB<\/li>\n\n\n\n<li>For tables between 2.56 TB and 10 TB: grows linearly from 256 MB to 1 GB<\/li>\n\n\n\n<li>For tables larger than 10 TB: 1 GB<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Use ZORDER to keep related data in the same place<\/h3>\n\n\n\n<p>Z-Ordering is an interesting technique that places related information in the same set of files. This organization is automatically used by Delta Lake&#039;s data-skipping algorithms, drastically reducing the amount of data that needs to be read during queries.<\/p>\n\n\n\n<p>To apply Z-Ordering, you specify the columns in the OPTIMIZE command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>OPTIMIZE events WHERE date &gt;= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)<\/code><\/pre>\n\n\n\n<p>This technique works well for columns you frequently use in filters and that have high cardinality, meaning many different values. You can even specify multiple columns for ZORDER BY by separating them with commas, but the effectiveness decreases as you add more columns.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Shuffles and Partitions: How to Prevent Data Spills<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/wsstgprdphotosonic01.blob.core.windows.net\/photosonic\/b5594843-1538-4fa3-83e4-607df027259e.WEBP?st=2025-09-07T17%3A21%3A27Z&amp;se=2025-09-14T17%3A21%3A27Z&amp;sp=r&amp;sv=2025-07-05&amp;sr=b&amp;sig=YUfEkJdYlY\/apBbcPVMxYqd1qus9iOMHc7qVuUZOE%2Bo%3D\" alt=\"Flowchart explaining Apache Spark shuffling, showing wide transformations with shuffling and narrow transformations without shuffling.\"\/><\/figure>\n\n\n\n<p><sub>Image Source:&nbsp;<a href=\"https:\/\/www.linkedin.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/sub><\/p>\n\n\n\n<p>Shuffle in PySpark is one of those things that can completely destroy your application&#039;s performance. Whenever you perform joins, aggregations, or sorts, Spark needs to reorganize the data across the cluster&#039;s nodes. When this isn&#039;t done correctly, data starts spilling to disk, slowing everything down.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">spark.sql.shuffle.partitions: adjusting by hand according to data volume<\/h3>\n\n\n\n<p>Spark is configured by default to use&nbsp;<a href=\"https:\/\/www.databricks.com\/blog\/2020\/10\/21\/faster-sql-adaptive-query-execution-in-databricks.html\" target=\"_blank\" rel=\"noreferrer noopener\">200 partitions during shuffle operations<\/a>This number is almost never ideal for what you&#039;re doing. If you use too few partitions, each task will process too much data and can overflow memory, causing that annoying spill to disk. On the other hand, if you use too many partitions, you end up creating very small tasks that spend more time on overhead than processing data.<\/p>\n\n\n\n<p>To adjust this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>spark.conf.set(&quot;spark.sql.shuffle.partitions&quot;, 50) <em># Example for 5GB of data on a 10-core cluster<\/em><\/code><\/pre>\n\n\n\n<p>There&#039;s a rule of thumb that works well: each task should process between 128MB and 200MB of data. You can calculate the ideal number of partitions by dividing the total volume of data that will be shuffled by this value. Another option is to set it to 2 to 3 times the total number of CPU cores you have available.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">spark.sql.adaptive.autoOptimizeShuffle.preshufflePartitionSizeInBytes<\/h3>\n\n\n\n<p>Databricks has created a feature called Auto-Optimized Shuffle (AOS) that automatically tries to figure out the optimal number of partitions:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>spark.conf.set(&quot;spark.databricks.adaptive.autoOptimizeShuffle.enabled&quot;, &quot;true&quot;)<\/code><\/pre>\n\n\n\n<p>However, this functionality has its limitations. When you work with tables that have&nbsp;<a href=\"https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide\" target=\"_blank\" rel=\"noreferrer noopener\">exceptionally high compression ratios (20x to 40x)<\/a>, AOS can make a serious mistake in estimating the number of partitions needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">AQE for automatic adjustment of shuffle partitions<\/h3>\n\n\n\n<p>Adaptive Query Execution is enabled by default since Apache Spark 3.2.0 and one of the most useful things it does is dynamically adjust the number of shuffle partitions.<\/p>\n\n\n\n<p>To configure AQE:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>spark.conf.set(&quot;spark.sql.adaptive.enabled&quot;, &quot;true&quot;) spark.conf.set(&quot;spark.sql.adaptive.coalescePartitions.enabled&quot;, &quot;true&quot;) spark.conf.set(&quot;spark.sql.adaptive.advisoryPartitionSizeInBytes&quot;, &quot;67108864&quot;) <em># 64MB<\/em><\/code><\/pre>\n\n\n\n<p>What happens is that AQE analyzes the actual data size after the shuffle and adjusts the number of partitions on the fly. Imagine you have a 10GB DataFrame with 200 original partitions\u2014AQE can dynamically shrink it to 50 well-balanced partitions during runtime, improving performance without you having to do anything.<\/p>\n\n\n\n<p>AQE can also split skewed partitions into smaller chunks, avoiding bottlenecks that cause one task to take much longer than the others and cause spillage. This is especially valuable when you&#039;re working with data with unpredictable characteristics.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Identify and Correct Skew and Data Explosion<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/wsstgprdphotosonic01.blob.core.windows.net\/photosonic\/35e871cb-1255-4644-b721-f9afaa4067e8.WEBP?st=2025-09-07T17%3A21%3A29Z&amp;se=2025-09-14T17%3A21%3A29Z&amp;sp=r&amp;sv=2025-07-05&amp;sr=b&amp;sig=6TTvm3C4FiIVdX5FIWIOYcE2uT5jBmh032k\/4pXdc7g%3D\" alt=\"Diagram showing six Apache Spark optimization techniques for data processing including caching, partitioning, and adaptive query execution.\"\/><\/figure>\n\n\n\n<p><sub>Image Source:&nbsp;<a href=\"https:\/\/vlinkinfo.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">VLink Inc.<\/a><\/sub><\/p>\n\n\n\n<p>One of the most annoying things about distributed processing is data skew. This is where data becomes unbalanced across Spark partitions, causing some tasks to process much more data than others.<\/p>\n\n\n\n<p>You might be there, thinking everything is going smoothly, when suddenly one task takes forever to finish while the others are long gone.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to find skew using Spark UI<\/h3>\n\n\n\n<p>Spark UI has become my best friend for tracking down these issues. When you suspect skew, you go straight to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In the tab\u00a0<strong>Stages<\/strong>, those tasks that take much longer than average<\/li>\n\n\n\n<li>The metrics of\u00a0<strong>Summary Metrics<\/strong>\u00a0where you see huge differences between the percentiles<\/li>\n\n\n\n<li>A partition is already considered problematic when it is\u00a0<a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html\" target=\"_blank\" rel=\"noreferrer noopener\">greater than 5 times the median and exceeds 256MB<\/a><\/li>\n<\/ul>\n\n\n\n<p>The tip is to keep an eye on skewness metrics to identify bottlenecks. When the distribution is healthy, the values are similar across all percentiles. But when there&#039;s skew, you see a dramatic difference between the 75th percentile and the maximum value.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Resolve skew with hints and salting<\/h3>\n\n\n\n<p>Since Spark 3.0, AQE can automatically handle joins that have skew when you put&nbsp;<code>spark.sql.adaptive.enabled=true<\/code>&nbsp;and&nbsp;<code>spark.sql.adaptive.skewJoin.enabled=true<\/code>.<\/p>\n\n\n\n<p>But there are cases he can&#039;t solve alone. Then you need to intervene:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Use explicit skew hints to warn Spark about problematic columns<\/li>\n\n\n\n<li>Apply the \u201csalting\u201d technique \u2013 which basically adds a bit of randomness to the keys:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>df_salted = df.withColumn(&quot;salt&quot;, (rand() * 10).cast(&quot;int&quot;)) df_salted = df_salted.withColumn(&quot;salted_key&quot;, col(&quot;key&quot;) + col(&quot;salt&quot;))<\/code><\/pre>\n\n\n\n<p>Salting works by redistributing data more evenly, preventing some resources from becoming overloaded.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide\" target=\"_blank\" rel=\"noreferrer noopener\">Explode() and joins<\/a>: the villains of the data explosion<\/h3>\n\n\n\n<p>There are two operations that are champions in making data explode:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>explode()<\/strong>: takes columns that are collections and transforms them into individual rows, multiplying the volume of data<\/li>\n\n\n\n<li><strong>joins<\/strong>: especially when they produce many more rows than you expected (you can check this in the SortMergeJoin node in the Spark UI)<\/li>\n<\/ol>\n\n\n\n<p>I&#039;ve seen 128MB partitions turn into gigabytes because of these explosions, and then the available memory can&#039;t handle it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Use repartitioning to control explosions<\/h3>\n\n\n\n<p>When you&#039;re dealing with data explosions, a few strategies can save the day:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Decrease the\u00a0<code>spark.sql.files.maxPartitionBytes<\/code>\u00a0from 128MB to 16MB or 32MB<\/li>\n\n\n\n<li>To execute\u00a0<code>repartition()<\/code>\u00a0right after reading the data<\/li>\n\n\n\n<li>For explosions that occur in joins, increase the number of shuffle partitions<\/li>\n<\/ul>\n\n\n\n<p>These techniques can prevent data from being flushed to disk and greatly improve memory management when you&#039;re processing large volumes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cache and Persistence: How to Really Gain Performance<\/h2>\n\n\n\n<p>Caching is one of those things everyone talks about as important, but in practice, many people don&#039;t quite know when and how to use it. It turns out that caching data can make a huge difference in PySpark&#039;s memory management, especially when you need to reuse the same data multiple times.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Databricks Delta Cache: The Cache That Runs Itself<\/h3>\n\n\n\n<p>Databricks&#039; disk cache, formerly known as Delta Cache, is a feature that significantly speeds up data reads. It creates local copies of your Parquet files using an optimized intermediate format. The cool thing is that this feature is automatically enabled on nodes with SSD volumes, and it uses a maximum&nbsp;<a href=\"https:\/\/learn.microsoft.com\/pt-br\/azure\/databricks\/optimizations\/disk-cache\" target=\"_blank\" rel=\"noreferrer noopener\">half the available space<\/a>&nbsp;on these devices.<\/p>\n\n\n\n<p>To check if it is working or to change the settings:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><em># View how it is configured<\/em> spark.conf.get(&quot;spark.databricks.io.cache.enabled&quot;) <em># Turn the cache on or off<\/em> spark.conf.set(&quot;spark.databricks.io.cache.enabled&quot;, &quot;true&quot;)<\/code><\/pre>\n\n\n\n<p>One interesting thing about this disk cache is that it automatically detects when files change, so you don&#039;t have to worry about manually invalidating the cache. This is quite different from Apache Spark&#039;s default cache.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to use cache() and persist() without breaking everything<\/h3>\n\n\n\n<p>While Delta Cache takes care of the files, the&nbsp;<code>cache()<\/code>&nbsp;and&nbsp;<code>persist()<\/code>&nbsp;are to improve performance when you are going to use the same DataFrame multiple times:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><em># Basic Cache (uses MEMORY_AND_DISK by default)<\/em> df.cache() <em># Persist with more control over where to store it<\/em> df.persist(storageLevel=StorageLevel.MEMORY_ONLY)<\/code><\/pre>\n\n\n\n<p>Here&#039;s an important catch: these two operations are&nbsp;<a href=\"https:\/\/sparkbyexamples.com\/pyspark\/pyspark-cache-explained\/\" target=\"_blank\" rel=\"noreferrer noopener\">lazy evaluation operations<\/a>, meaning they will only execute when you call an action. If you want caching to happen immediately, you need to force it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df.persist() df.count() <em># This forces the cache to be materialized<\/em><\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Temporary views or tables? It depends on what you want<\/h3>\n\n\n\n<p>Temporary views are virtual, so every time you access them, the query runs again. Temporary tables, on the other hand, materialize the results. The choice depends on your situation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Temporary views<\/strong>: when it is something simple or that you will only use once<\/li>\n\n\n\n<li><strong>Temporary tables<\/strong>: for heavy transformations that you will access multiple times<\/li>\n<\/ul>\n\n\n\n<p>There&#039;s an interesting strategy for computationally expensive operations that you&#039;ll access frequently. You can create a temporary view and then cache it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dataframe.createOrReplaceTempView(&quot;view_name&quot;) spark.sql(&quot;CACHE TABLE view_name&quot;)<\/code><\/pre>\n\n\n\n<p>This approach combines the best of both worlds: the flexibility of views with the performance benefits of caching.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>After exploring all these memory management techniques in PySpark, it&#039;s clear that there&#039;s no magic solution that solves everything at once. Each problem has its own unique characteristics and requires a specific approach.<\/p>\n\n\n\n<p>The memory analysis tools introduced with Databricks Runtime 12.0 have truly changed the game. While it used to be very difficult to understand where the bottlenecks were, now we can see exactly which lines of code are consuming the most resources. This makes it much easier to make the necessary improvements.<\/p>\n\n\n\n<p>We&#039;ve seen how data layout in Delta tables can make a huge difference in performance. optimizeWrite and autoCompact solve that annoying problem of small files, while ZORDER is a great help when you need to query specific data frequently.<\/p>\n\n\n\n<p>Shuffles and partitions remain one of the most important aspects of preventing data from being spilled to disk. AQE helps a lot with this, but we still need to understand our data and adjust settings as needed.<\/p>\n\n\n\n<p>It&#039;s important to note that skew and data explosion issues can arise in unexpected ways, especially when working with large volumes. Knowing how to identify these issues in Spark UI and having salting techniques up your sleeve makes all the difference.<\/p>\n\n\n\n<p>Caching strategies, both Delta Cache and persist() on DataFrames, complete this toolkit. When used correctly, they eliminate a lot of unnecessary reprocessing and save resources.<\/p>\n\n\n\n<p>However, all of these techniques are now part of the daily lives of those who work with big data, and we somehow need to constantly adapt as our data and workloads evolve.<\/p>\n\n\n\n<p>The important thing to remember is that memory optimization isn&#039;t something you do once and forget about. It&#039;s an ongoing process that needs to keep pace with your data growth and the evolution of your applications. Each new project brings its own challenges, and having these tools well-understood makes it much easier to address bottlenecks that arise along the way.<\/p>\n\n\n\n<p>For those who want to delve deeper into this world of large-scale data processing, it&#039;s worth continuing to study and practice these techniques in real-world scenarios. Practical experience is what truly solidifies this knowledge.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">FAQs<\/h2>\n\n\n\n<p><strong>Q1. How does Adaptive Query Execution (AQE) improve performance in PySpark?<\/strong>&nbsp;AQE optimizes queries during execution by dynamically adjusting the number of shuffle partitions and splitting skewed partitions. This results in better resource utilization and reduced performance bottlenecks.<\/p>\n\n\n\n<p><strong>Q2. What are the best practices to avoid the \u201csmall files\u201d issue in Delta Lake?<\/strong>&nbsp;Leverage features like optimizeWrite and autoCompact, adjust delta.targetFileSize, and apply ZORDER to frequently filtered columns. These techniques optimize data layout, improving query performance.<\/p>\n\n\n\n<p><strong>Q3. How to identify and fix data skew issues in Spark?<\/strong>&nbsp;Analyze the Spark UI to identify tasks that take longer than average and use shuffle metrics. To fix this, use skew hints, implement salting, or leverage AQE to automatically manage skewed joins.<\/p>\n\n\n\n<p><strong>Q4. What is the difference between Databricks disk cache and Apache Spark cache?<\/strong>&nbsp;Databricks&#039; disk cache (Delta Cache) optimizes reading of Parquet files by automatically detecting changes. Spark&#039;s cache (cache() and persist()) improves the performance of repeated transformations on DataFrames.<\/p>\n\n\n\n<p><strong>Q5. When should I use temporary views instead of intermediate tables in Spark?<\/strong>&nbsp;Use temporary views for simple queries or single-use, and temporary tables for complex transformations or multiple accesses. For expensive and frequent operations, consider creating a temporary view with caching to combine flexibility and performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p>[1] &#8211;&nbsp;<a href=\"https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.databricks.com\/discover\/pages\/optimize-data-workloads-guide<\/a><br>[2] &#8211;&nbsp;<a href=\"https:\/\/www.databricks.com\/blog\/2022\/11\/30\/memory-profiling-pyspark.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.databricks.com\/blog\/2022\/11\/30\/memory-profiling-pyspark.html<\/a><br>[3] &#8211;&nbsp;<a href=\"https:\/\/blog.dataengineerthings.org\/a-quick-guide-to-spark-and-databricks-optimization-engines-1d2089185cf2\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/blog.dataengineerthings.org\/a-quick-guide-to-spark-and-databricks-optimization-engines-1d2089185cf2<\/a><br>[4] &#8211;&nbsp;<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/synapse-analytics\/spark\/optimize-write-for-apache-spark\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/learn.microsoft.com\/en-us\/azure\/synapse-analytics\/spark\/optimize-write-for-apache-spark<\/a><br>[5] &#8211;&nbsp;<a href=\"https:\/\/delta.io\/blog\/delta-lake-optimize\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/delta.io\/blog\/delta-lake-optimize\/<\/a><br>[6] &#8211;&nbsp;<a href=\"https:\/\/docs.databricks.com\/aws\/en\/delta\/tune-file-size\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.databricks.com\/aws\/en\/delta\/tune-file-size<\/a><br>[7] &#8211;&nbsp;<a href=\"https:\/\/docs.databricks.com\/aws\/en\/delta\/data-skipping\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.databricks.com\/aws\/en\/delta\/data-skipping<\/a><br>[8] &#8211;&nbsp;<a href=\"https:\/\/community.databricks.com\/t5\/data-engineering\/what-is-z-ordering-in-delta-and-what-are-some-best-practices-on\/td-p\/26639\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/community.databricks.com\/t5\/data-engineering\/what-is-z-ordering-in-delta-and-what-are-some-best-practices-on\/td-p\/26639<\/a><br>[9] &#8211;&nbsp;<a href=\"https:\/\/www.databricks.com\/blog\/2020\/10\/21\/faster-sql-adaptive-query-execution-in-databricks.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.databricks.com\/blog\/2020\/10\/21\/faster-sql-adaptive-query-execution-in-databricks.html<\/a><br>[10] &#8211;&nbsp;<a href=\"https:\/\/www.sparkcodehub.com\/pyspark\/performance\/shuffle-optimization\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.sparkcodehub.com\/pyspark\/performance\/shuffle-optimization<\/a><br>[11] &#8211;&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/3.5.0\/sql-performance-tuning.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/spark.apache.org\/docs\/3.5.0\/sql-performance-tuning.html<\/a><br>[12] &#8211;&nbsp;<a href=\"https:\/\/www.databricks.com\/notebooks\/gallery\/SparkAdaptiveQueryExecution.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.databricks.com\/notebooks\/gallery\/SparkAdaptiveQueryExecution.html<\/a><br>[13] &#8211;&nbsp;<a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-spark-performance.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-spark-performance.html<\/a><br>[14] &#8211;&nbsp;<a href=\"https:\/\/aws.amazon.com\/blogs\/big-data\/detect-and-handle-data-skew-on-aws-glue\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/aws.amazon.com\/blogs\/big-data\/detect-and-handle-data-skew-on-aws-glue\/<\/a><br>[15] &#8211;&nbsp;<a href=\"https:\/\/www.linkedin.com\/pulse\/what-data-skewness-spark-how-handle-code-soutir-sen-xf6hf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.linkedin.com\/pulse\/what-data-skewness-spark-how-handle-code-soutir-sen-xf6hf<\/a><br>[16] &#8211;&nbsp;<a href=\"https:\/\/docs.aws.amazon.com\/prescriptive-guidance\/latest\/tuning-aws-glue-for-apache-spark\/optimize-shuffles.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.aws.amazon.com\/prescriptive-guidance\/latest\/tuning-aws-glue-for-apache-spark\/optimize-shuffles.html<\/a><br>[17] &#8211;&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/spark.apache.org\/docs\/latest\/sql-performance-tuning.html<\/a><br>[18] &#8211;&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/3.5.3\/sql-performance-tuning.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/spark.apache.org\/docs\/3.5.3\/sql-performance-tuning.html<\/a><br>[19] &#8211;&nbsp;<a href=\"https:\/\/www.linkedin.com\/pulse\/handling-data-skewness-spark-power-salting-pyspark-kommanaboina-vskic\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.linkedin.com\/pulse\/handling-data-skewness-spark-power-salting-pyspark-kommanaboina-vskic<\/a><br>[20] &#8211;&nbsp;<a href=\"https:\/\/docs.databricks.com\/aws\/en\/optimizations\/disk-cache\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/docs.databricks.com\/aws\/en\/optimizations\/disk-cache<\/a><br>[21] &#8211;&nbsp;<a href=\"https:\/\/learn.microsoft.com\/pt-br\/azure\/databricks\/optimizations\/disk-cache\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/learn.microsoft.com\/pt-br\/azure\/databricks\/optimizations\/disk-cache<\/a><br>[22] &#8211;&nbsp;<a href=\"https:\/\/sparkbyexamples.com\/pyspark\/pyspark-cache-explained\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/sparkbyexamples.com\/pyspark\/pyspark-cache-explained\/<\/a><br>[23] &#8211;&nbsp;<a href=\"https:\/\/community.databricks.com\/t5\/data-engineering\/temp-table-vs-temp-view-vs-temp-table-function-which-one-is\/td-p\/4087\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/community.databricks.com\/t5\/data-engineering\/temp-table-vs-temp-view-vs-temp-table-function-which-one-is\/td-p\/4087<\/a><br>[24] &#8211;&nbsp;<a href=\"https:\/\/www.chaosgenius.io\/blog\/databricks-temporary-table\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.chaosgenius.io\/blog\/databricks-temporary-table\/<\/a><br>[25] &#8211;&nbsp;<a href=\"https:\/\/stackoverflow.com\/questions\/50716772\/spark-tempview-performance\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/stackoverflow.com\/questions\/50716772\/spark-tempview-performance<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>\ud83d\ude80 Optimize your memory management in PySpark with these valuable tips for Databricks notebooks! \ud83d\udcbb Since Databricks Runtime 12.0, memory analysis tools make it easier to identify bottlenecks and improve performance.<\/p>","protected":false},"author":1,"featured_media":1399,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[26,35,38,37,36],"class_list":["post-1394","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-data-lake","tag-databricks","tag-delta-lake","tag-memoria","tag-pyspark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/dataforma.tech\/en\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence\" \/>\n<meta property=\"og:description\" content=\"\ud83d\ude80 Otimize seu gerenciamento de mem\u00f3ria no PySpark com estas dicas valiosas para notebooks Databricks! \ud83d\udcbb Desde o Databricks Runtime 12.0, ferramentas de an\u00e1lise de mem\u00f3ria facilitam identificar gargalos e melhorar o desempenho.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/dataforma.tech\/en\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\" \/>\n<meta property=\"og:site_name\" content=\"Dataforma | Business Intelligence\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-07T17:35:42+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-07T17:35:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1300\" \/>\n\t<meta property=\"og:image:height\" content=\"742\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Guilherme Rodrigues\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@dataformacomp\" \/>\n<meta name=\"twitter:site\" content=\"@dataformacomp\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Guilherme Rodrigues\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\"},\"author\":{\"name\":\"Guilherme Rodrigues\",\"@id\":\"https:\/\/dataforma.tech\/#\/schema\/person\/35d2d0ba7ea231ccd708f76e6a84b572\"},\"headline\":\"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks\",\"datePublished\":\"2025-09-07T17:35:42+00:00\",\"dateModified\":\"2025-09-07T17:35:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\"},\"wordCount\":2977,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/dataforma.tech\/#organization\"},\"image\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png\",\"keywords\":[\"Data Lake\",\"DataBricks\",\"Delta Lake\",\"Mem\u00f3ria\",\"Pyspark\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\",\"url\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\",\"name\":\"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence\",\"isPartOf\":{\"@id\":\"https:\/\/dataforma.tech\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png\",\"datePublished\":\"2025-09-07T17:35:42+00:00\",\"dateModified\":\"2025-09-07T17:35:44+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage\",\"url\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png\",\"contentUrl\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png\",\"width\":1300,\"height\":742},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"In\u00edcio\",\"item\":\"https:\/\/dataforma.tech\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/dataforma.tech\/#website\",\"url\":\"https:\/\/dataforma.tech\/\",\"name\":\"Dataforma | Business Intelligence\",\"description\":\"Consultoria em BI\",\"publisher\":{\"@id\":\"https:\/\/dataforma.tech\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/dataforma.tech\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/dataforma.tech\/#organization\",\"name\":\"Dataforma | Business Intelligence\",\"url\":\"https:\/\/dataforma.tech\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/dataforma.tech\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/07\/Logo-Dataforma-01.svg\",\"contentUrl\":\"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/07\/Logo-Dataforma-01.svg\",\"width\":300,\"height\":42,\"caption\":\"Dataforma | Business Intelligence\"},\"image\":{\"@id\":\"https:\/\/dataforma.tech\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/dataformacomp\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/dataforma.tech\/#\/schema\/person\/35d2d0ba7ea231ccd708f76e6a84b572\",\"name\":\"Guilherme Rodrigues\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/dataforma.tech\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/2b3128f6bce22b9a5c34983edfb84d9cdffc980bf397edf478a5cc4b8b271e75?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/2b3128f6bce22b9a5c34983edfb84d9cdffc980bf397edf478a5cc4b8b271e75?s=96&d=mm&r=g\",\"caption\":\"Guilherme Rodrigues\"},\"sameAs\":[\"https:\/\/dataforma.tech\"],\"url\":\"https:\/\/dataforma.tech\/en\/blog\/author\/heyoguilherme-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/dataforma.tech\/en\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/","og_locale":"en_US","og_type":"article","og_title":"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence","og_description":"\ud83d\ude80 Otimize seu gerenciamento de mem\u00f3ria no PySpark com estas dicas valiosas para notebooks Databricks! \ud83d\udcbb Desde o Databricks Runtime 12.0, ferramentas de an\u00e1lise de mem\u00f3ria facilitam identificar gargalos e melhorar o desempenho.","og_url":"https:\/\/dataforma.tech\/en\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/","og_site_name":"Dataforma | Business Intelligence","article_published_time":"2025-09-07T17:35:42+00:00","article_modified_time":"2025-09-07T17:35:44+00:00","og_image":[{"width":1300,"height":742,"url":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png","type":"image\/png"}],"author":"Guilherme Rodrigues","twitter_card":"summary_large_image","twitter_creator":"@dataformacomp","twitter_site":"@dataformacomp","twitter_misc":{"Written by":"Guilherme Rodrigues","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#article","isPartOf":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/"},"author":{"name":"Guilherme Rodrigues","@id":"https:\/\/dataforma.tech\/#\/schema\/person\/35d2d0ba7ea231ccd708f76e6a84b572"},"headline":"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks","datePublished":"2025-09-07T17:35:42+00:00","dateModified":"2025-09-07T17:35:44+00:00","mainEntityOfPage":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/"},"wordCount":2977,"commentCount":0,"publisher":{"@id":"https:\/\/dataforma.tech\/#organization"},"image":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png","keywords":["Data Lake","DataBricks","Delta Lake","Mem\u00f3ria","Pyspark"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/","url":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/","name":"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks - Dataforma | Business Intelligence","isPartOf":{"@id":"https:\/\/dataforma.tech\/#website"},"primaryImageOfPage":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage"},"image":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png","datePublished":"2025-09-07T17:35:42+00:00","dateModified":"2025-09-07T17:35:44+00:00","breadcrumb":{"@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#primaryimage","url":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png","contentUrl":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/09\/image-4.png","width":1300,"height":742},{"@type":"BreadcrumbList","@id":"https:\/\/dataforma.tech\/blog\/otimizando-o-gerenciamento-de-memoria-no-pyspark-guia-pratico-para-notebooks-databricks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"In\u00edcio","item":"https:\/\/dataforma.tech\/"},{"@type":"ListItem","position":2,"name":"Otimizando o Gerenciamento de Mem\u00f3ria no PySpark: Guia Pr\u00e1tico para Notebooks Databricks"}]},{"@type":"WebSite","@id":"https:\/\/dataforma.tech\/#website","url":"https:\/\/dataforma.tech\/","name":"Dataforma | Business Intelligence","description":"BI Consulting","publisher":{"@id":"https:\/\/dataforma.tech\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/dataforma.tech\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/dataforma.tech\/#organization","name":"Dataforma | Business Intelligence","url":"https:\/\/dataforma.tech\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dataforma.tech\/#\/schema\/logo\/image\/","url":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/07\/Logo-Dataforma-01.svg","contentUrl":"https:\/\/dataforma.tech\/wp-content\/uploads\/2025\/07\/Logo-Dataforma-01.svg","width":300,"height":42,"caption":"Dataforma | Business Intelligence"},"image":{"@id":"https:\/\/dataforma.tech\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/dataformacomp"]},{"@type":"Person","@id":"https:\/\/dataforma.tech\/#\/schema\/person\/35d2d0ba7ea231ccd708f76e6a84b572","name":"Guilherme Rodrigues","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dataforma.tech\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/2b3128f6bce22b9a5c34983edfb84d9cdffc980bf397edf478a5cc4b8b271e75?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/2b3128f6bce22b9a5c34983edfb84d9cdffc980bf397edf478a5cc4b8b271e75?s=96&d=mm&r=g","caption":"Guilherme Rodrigues"},"sameAs":["https:\/\/dataforma.tech"],"url":"https:\/\/dataforma.tech\/en\/blog\/author\/heyoguilherme-com\/"}]}},"_links":{"self":[{"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/posts\/1394","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/comments?post=1394"}],"version-history":[{"count":1,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/posts\/1394\/revisions"}],"predecessor-version":[{"id":1401,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/posts\/1394\/revisions\/1401"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/media\/1399"}],"wp:attachment":[{"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/media?parent=1394"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/categories?post=1394"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataforma.tech\/en\/wp-json\/wp\/v2\/tags?post=1394"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}